0% found this document useful (0 votes)

57 views36 pages

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

The document provides an overview of data mining and knowledge discovery. It defines data mining as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Data mining involves techniques from machine learning, statistics, databases, and other fields to discover patterns in large data sets. It discusses how vast amounts of data are now collected and stored, creating opportunities to apply data mining to gain useful knowledge and insights. The document outlines some common data mining tasks like classification, clustering, and association rule mining and the types of patterns they can reveal in databases, data warehouses, and transactional data.

Uploaded by

Yrga Weldegiwergs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views36 pages

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

Yrga Weldegiwergs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF

TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

DATA MINING AND KNOWLEDGE DISCOVERY

Halefom Tekle
Friday, February 5, 2021
Outlines
Chapter 1: Definition
 Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
 What is not Data mining?  What is Data Mining?

Look up phone number in Certain names are more prevalent

phone directory in certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston
Query a Web search area)
engine for information
about “Amazon” Group together similar documents
returned by search engine
according to their context (e.g.
Amazon rainforest, Amazon.com,)
Con.
 Data mining is a technique for discovering interesting
patterns from data
 Data mining also kwon as knowledge discovery from data.
 It is a multi-disciplinary field involving
 Machine learning
 Statistics
 Databases
 Artificial intelligence
 Information retrieval, and
 Visualization
1.1 Why Data Mining? Commercial view

 We live in a world where vast amounts of data are

collected daily.
 Lots of data is being collected and warehoused
 Web data, e-commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions

 Computers have become cheaper and more powerful

 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
1.3 Motivation

 There is often information “hidden” in the data that is

not readily evident
 Human analysts may take weeks to discover useful information
 Much of the data is never analyzed at all
1.4 Data Mining as the Evolution of Information
Technology
 Data mining can be viewed as a result of the natural evolution of
information technology.
 Those are
 Data collection and database creation
 Database management system
 Advanced database system
 Advanced data analysis
 The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of
effective mechanisms for data storage and retrieval, as well as query
and transaction processing.
 Nowadays numerous database systems offer query and transaction
processing as common practice.
 Advanced data analysis has naturally become the next step.
Con.
Con.
ata
d
is or.
r ld po
wo on
h e ati
s, t rm
a n nfo
e ti
m
his h bu
T ric

So, we need tools to extract the valuable knowledge

embedded in the vast amounts of data to help decision
maker’s intuition .
Con.

Data mining
 Is the process of discovering interesting patterns and
knowledge from large amounts of data.
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
 The knowledge discovery process is an iterative sequence
Con.
 Pre-processing:
 The raw data is usually not suitable for mining due to
various reasons.
 Data mining:
 The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
 Post-processing:
 In many applications, not all discovered patterns are
useful. This step identifies those useful ones for
applications. Various evaluation and visualization
techniques are used to make the decision.
Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations
5. Data mining: an essential process where intelligent methods are
applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
1.5 What Kinds of Data Can Be Mined?
 Data mining can be applied to any kind of data as long as the data
are meaningful for a target application.
 The most basic forms of data for mining applications are
 Database data
 Data warehouse data
 Transactional data
 Can also be applied to other forms of data
 data streams
 ordered/sequence data
 graph or networked data
 text data
 multimedia data (audio, video, image)
 and WWW
Con.
1.5.1 Database data
 Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
Con.
 Database data
 Relational data can be accessed by database queries written in a
relational query (SQL, PostgreeSQL, …) or
 With the assistance of graphical user interfaces.

 The mining task is

 prediction methods
 Predict the credit risk of new customers
 To use some variables to predict unknown or future values of
other variables.
 detect deviations—that is, items with sales that are far from
those expected in comparison with the previous year
 Description Methods
 Find human-interpretable patterns that describe the data.
Con.

 Classification
 Regression Predictive
 Deviation Detection

 Clustering
 Association Rule Discovery Descriptive
 Sequential Pattern Discovery
Con.
1.5.2 Data warehouse
 Is a repository of multiple heterogeneous data sources
organized under a unified schema at a single site to
facilitate management decision making.

 Data warehouse technology includes data cleaning, data

integration, and online analytical processing (OLAP)

 OLAP—is analysis techniques with functionalities such

as summarization, consolidation, and aggregation, as well
as the ability to view information from different angles.
Con.

 Although OLAP tools support multidimensional analysis and

decision making, additional data analysis tools are required
for in-depth analysis—for example, data mining tools that
provide data classification, clustering, outlier/anomaly
detection, and the characterization of changes in data over
time.
 A data warehouse is usually modeled by a multidimensional
data structure, called a data cube, in which each dimension
corresponds to an attribute or a set of attributes in the schema,
and each cell stores the value of some aggregate measure such
as count or sum (sales_amount).
 A data cube provides a multidimensional view of data and
allows the precomputation and fast access of summarized data.
Con.
 Let AllElectronics had a data warehouse
Con.
1.5.3 Transactional Data
 Transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a
web page.
 A transaction typically includes
 a unique transaction identity number (trans ID) and
 a list of the items making up the transaction, such as the items
purchased in the transaction.
 A transactional database may have additional tables, which
contain other information related to the transactions
 such as item description,
 information about the salesperson or the branch, and so on.
1.6 What Kinds of Patterns Can Be Mined?
 There are a number of data mining functionalities. These include
 Characterization and discrimination
 Mining of frequent patterns, associations, and correlations

 Classification and regression

 Clustering analysis

 Outlier analysis

 Data mining functionalities are used to specify the kinds of patterns to

be found in data mining tasks.
 Such tasks can be classified into two categories:
 Descriptive and

 Predictive.

 Descriptive mining tasks characterize properties of the data in a target

data set.
 Predictive mining tasks perform induction on the current data in order
to make predictions.
Con.
1.6.1 Class/Concept Description: Characterization and Discrimination
 Data entries can be associated with classes or concepts.
 For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
 It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms.
 Such descriptions of a class or a concept are called class/concept
descriptions.
 These descriptions can be derived using
 Data characterization, by summarizing the data of the class under study
(often called the target class) in general terms
 Data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes) or
 both data characterization and discrimination.
Con.
1.6.2 Mining Frequent Patterns, Associations, and
Correlations
 Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
 There are many kinds of frequent patterns
 Frequent itemsets
 a set of items that often appear together in a transactional data set, milk

and bread
 Frequent subsequences (also known as sequential patterns)
 tend to purchase first a laptop, followed by a digital camera, and then a
memory card
 Frequent substructures.
 can refer to different structural forms (e.g., graphs, trees, or lattices) that

may be combined with itemsets or subsequences.

Con.

 Mining frequent patterns leads to the discovery of interesting

associations and correlations within data.
 Association analysis.
 Suppose that, as a marketing manager at AllElectronics, you want to
know which items are frequently purchased together (i.e., within the
same transaction).
 Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
confidence = 50%],
 single-dimensional association rules (buys).
 Age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
[support = 2%, confidence = 60%],
 multidimensional association rule (Age, income, buys).
 Typically, association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold and a minimum
confidence threshold.
Con.
1.6.3 Classification and Regression for Predictive Analysis
 Classification (na¨ıve Bayesian, SVM, and KNN)
 Is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
 The model are derived based on the analysis of a set of training
data (i.e., data objects for which the class labels are known).
 The model is used to predict the class label of objects for which
the class label is unknown.
 It predicts categorical (discrete, unordered) labels
 Regression analysis
 Is a statistical methodology that is most often used for
numeric prediction
 It predicts continuous-valued
Con.
Con.

1.6.4 Cluster Analysis

 Unlike classification and regression, which analyze class-
labeled (training) data sets.
 Clustering analyzes data objects without consulting class
labels.
 In many cases, classlabeled data may simply not exist at the
beginning.
 Clustering can be used to generate class labels for a group of
data.
 The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the
interclass similarity.
Con.
Con.
1.6.5 Outlier Analysis
 A data set may contain objects that do not comply with the
general behavior or model of the data.
 These data objects are outliers.
 Many data mining methods discard outliers as noise or
exceptions.
 However, in some applications (e.g., fraud detection) the rare
events can be more interesting than the more regularly
occurring ones
1.7 Which Technologies Are Used?
Con.

 A statistical model
 Is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their
associated probability distributions.
 Machine Learning
 Machine learning investigates how computers can learn (or improve
their performance) based on data.
 A main research area is for computer programs to automatically learn
to recognize complex patterns and make intelligent decisions based on
data.
 learning methods
 Supervised

 Unsupervised

 Semi-supervised

 Reinforcement
Which Kinds of Applications Are Targeted?

 Business Intelligence
 Organization commercial context
customers, the market, supply and resources, and
competitors
 provide historical, current, and predictive views of business

operations
 Web Search Engines
 Have to handle with
 a huge and ever-growing amount of data

 online data

 queries that are asked only a very small number of times

 Bioinformatics and health informatics

 Finance, digital libraries, and digital governments.
1.8 Major Issues in Data Mining
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multidimensional space
 Data mining—an interdisciplinary effort
 Boosting the power of discovery in a networked environment
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Ad hoc data mining and data mining query languages
 Presentation and visualization of data mining results
 Efficiency and Scalability
 Efficiency, scalability, performance, optimization, ability to execute in real time
 Parallel, distributed, and incremental mining algorithms
 Diversity of Database Types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data Mining and Society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
Exercises
 How is a data warehouse different from a database? How are
they similar?
 What are the major challenges of mining a huge amount of
data (e.g., billions of tuples) in comparison with mining a
small amount of data (e.g., data set of a few hundred tuple)?
 Define each of the following data mining functionalities:
characterization, discrimi-nation, association and correlation
analysis, classification, regression, clustering, and outlier
analysis. Give examples of each data mining functionality,
using a real-life database that you are familiar with.

Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining Essentials
No ratings yet
Data Mining Essentials
13 pages
Software
No ratings yet
Software
93 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Unit - I
No ratings yet
Unit - I
22 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Module 1
No ratings yet
Module 1
41 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
24 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Web Mining - Lec1 2
No ratings yet
Web Mining - Lec1 2
62 pages
8 Data Mining and Warehousing
No ratings yet
8 Data Mining and Warehousing
171 pages
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
No ratings yet
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
77 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Minng
No ratings yet
Data Minng
20 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit III
No ratings yet
Unit III
101 pages
KDD and Data Mining Explained
No ratings yet
KDD and Data Mining Explained
46 pages
Chap 1
No ratings yet
Chap 1
32 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
1 Intro
No ratings yet
1 Intro
50 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
UNIT-1 Why We Need Data Mining?
No ratings yet
UNIT-1 Why We Need Data Mining?
99 pages
Datamining Unit - 1
No ratings yet
Datamining Unit - 1
20 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
32 pages
1intro - Data Mining
No ratings yet
1intro - Data Mining
61 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
Module 4
No ratings yet
Module 4
54 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
25 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
DM Mod1
No ratings yet
DM Mod1
29 pages
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
No ratings yet
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
80 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Combine 056
No ratings yet
Combine 056
57 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Data Mining
No ratings yet
Data Mining
14 pages
DM Module 1
No ratings yet
DM Module 1
13 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
145 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Data Mining: Applications and Techniques
No ratings yet
Data Mining: Applications and Techniques
60 pages
Unit 1
No ratings yet
Unit 1
48 pages
Introduction
No ratings yet
Introduction
27 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
Pass4Test: IT Certification Guaranteed, The Easy Way!
No ratings yet
Pass4Test: IT Certification Guaranteed, The Easy Way!
5 pages
Black Box Testing
100% (1)
Black Box Testing
2 pages
ATM Banking Software Support-Submitting A Software Support Incident
No ratings yet
ATM Banking Software Support-Submitting A Software Support Incident
1 page
07 - Network Traffic Classification Using K-Means Clustering
No ratings yet
07 - Network Traffic Classification Using K-Means Clustering
6 pages
Autonomous Systems Security Testing
No ratings yet
Autonomous Systems Security Testing
5 pages
Airline Reservation System
No ratings yet
Airline Reservation System
31 pages
Blockchain Technology
No ratings yet
Blockchain Technology
2 pages
Senior .NET Engineer Resume Lahore
No ratings yet
Senior .NET Engineer Resume Lahore
2 pages
Microsoft 365 Mobility and Security
No ratings yet
Microsoft 365 Mobility and Security
2 pages
LinuxFoundation CKS v2021-09-20 q9
No ratings yet
LinuxFoundation CKS v2021-09-20 q9
10 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
4 pages
The Role of Game Engines in Game Development and T
No ratings yet
The Role of Game Engines in Game Development and T
26 pages
SharesPost Palantir Company Report
100% (2)
SharesPost Palantir Company Report
79 pages
ET200S 1SI Getting Started USS
No ratings yet
ET200S 1SI Getting Started USS
26 pages
RAID Is A Redundant Array of Inexpensive Disks
No ratings yet
RAID Is A Redundant Array of Inexpensive Disks
7 pages
Opmanager Datasheet
No ratings yet
Opmanager Datasheet
5 pages
Edir Install
No ratings yet
Edir Install
164 pages
Marketing Analytics Student Resume
No ratings yet
Marketing Analytics Student Resume
1 page
Service Level Agreement
No ratings yet
Service Level Agreement
8 pages
HP 3PAR StoreServ 7000 Learner Guide
No ratings yet
HP 3PAR StoreServ 7000 Learner Guide
236 pages
CompTIA Security+ Study Guide - Scholarly Flashcards
No ratings yet
CompTIA Security+ Study Guide - Scholarly Flashcards
23 pages
Cisco Meraki Lab Solutions Guide
No ratings yet
Cisco Meraki Lab Solutions Guide
38 pages
BASIS Tcodes 1page
No ratings yet
BASIS Tcodes 1page
1 page
Google's Safety APIs for Partners
No ratings yet
Google's Safety APIs for Partners
3 pages
AUTOMATING PROCESSES IN WEB-INTERFACES WITH ROBOTIC PROCESS AUTOMATION - Jesse Varis
No ratings yet
AUTOMATING PROCESSES IN WEB-INTERFACES WITH ROBOTIC PROCESS AUTOMATION - Jesse Varis
33 pages
Taniya Rawat: Objective
No ratings yet
Taniya Rawat: Objective
1 page
Solution 02
No ratings yet
Solution 02
6 pages
IntroductiontoComputerScience PDF
No ratings yet
IntroductiontoComputerScience PDF
4 pages
Navisphere Manager Simulator Lab Guide
No ratings yet
Navisphere Manager Simulator Lab Guide
64 pages
E Raion Card
100% (1)
E Raion Card
145 pages

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF

DEPARTMENT OF INFORMATION TECHNOLOGY

DATA MINING AND KNOWLEDGE DISCOVERY

Look up phone number in Certain names are more prevalent

 We live in a world where vast amounts of data are

 Computers have become cheaper and more powerful

 There is often information “hidden” in the data that is

So, we need tools to extract the valuable knowledge

 The mining task is

 Data warehouse technology includes data cleaning, data

 OLAP—is analysis techniques with functionalities such

 Although OLAP tools support multidimensional analysis and

 Classification and regression

 Data mining functionalities are used to specify the kinds of patterns to

 Descriptive mining tasks characterize properties of the data in a target

may be combined with itemsets or subsequences.

 Mining frequent patterns leads to the discovery of interesting

1.6.4 Cluster Analysis

 queries that are asked only a very small number of times

 Bioinformatics and health informatics

You might also like