Data Warehousing
and Data Mining
Lecture 1 Introduction
CITS3401
CITS5504
Wei Liu
School of Computer
Science and Software
Engineering
Faculty of Engineering,
Computing and
Mathematics
Acknowledgement: The Lecture Slides are adapted from the original slides from Hans textbook.
Administrative
Unit Coordinator & Lecturer
Dr. Wei Liu
Email: wei.liu@uwa.edu.au
Office: CSSE Room 2.18
Phone: 64883095
The Unit Materials are for both CITS3401 and CITS5504
CITS3401 Bachelor of Science (Data Science Major)
CITS5504 Master of Information Technology
Common Lecture Hours:
TUESDAYS 10:00 11:45am
2
CITS3401 and CITS5504
Common Consultation Hour:
Tuesdays 2:00-3:00pm (Walk in - No appointment)
Find me either in CSSE Room 2.18 or Lab 2.01
Common Teaching Material
Lecture slides, lab sheets and projects
Different websites
http://teaching.csse.uwa.edu.au/units/CITS3401
http://teaching.csse.uwa.edu.au/units/CITS5504
Different Lab Sessions (from Week 2 onward):
CITS3401: Tuesdays 2:00-4:00pm Dr. Syed Mohammed Shamsul Islam
(Shams)
CITS5504: Mondays 9:00-11:00am Dr. Wei Liu
Common Assessment Structures
Two projects : 20% each
An analysis of a business scenario through an OLAP tool.
We will be using an excel plug-in JEDOX for Data Warehousing Project.
http://www.jedox.com/en/services/downloads
An analysis of a data mining and exploration problem using WEKA.
Weka is a collection of machine learning algorithms for data mining tasks.
The algorithms can either be applied directly to a dataset or called from your
own Java Code
http://www.cs.waikato.ac.nz/ml/weka/
Mid-semester Test: 10%
at the lecture venue after the study break
Final Examination: 50%
Project Specifications and Instructions will be available on the
course website.
4
Text Book and Recommend Readings
Course Text Book:
Data Mining: Concepts and Techniques
2nd ed., Jiawei Han and Micheline Kamber- 2006
3rd ed., Jiawei Han and Micheline Kamber, Jian Pei -2011
Jiawei Hans web page:
http://web.engr.illinois.edu/~hanj/
References:
Data Mining: Methods and Techniques by, A. Shawkat Ali and
Saleh Wasimi Thomson, 2007
Data Mining: The Textbook by, Charu C. Aggarwal, Springer,
May 2015
Introduction to Data Mining
Why Data Mining?
What Is Data Mining? A Knowledge Discovery (KDD) Process
A Multi-Dimensional View of Data Mining/ classification
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Are all the patterns interesting?
Integration of Data Mining System with Data Warehousing System
Major Issues in Data Mining
6
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data Explosion
Our capability of generating , collecting, storing and managing data has
grown tremendously in the last 50 years.
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData mining
Automated and scalable analysis of massive data sets
7
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
8
Example 1: Market Analysis
Where does the data come from?
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus
(public) lifestyle studies,
Target marketing
Find clusters of model customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysisFind associations/co-relations between product
sales, & predict based on such association
Customer profilingWhat types of customers buy what products
(clustering or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary Information:
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
Example 2: Corporate Analysis and
Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financialratio,trend analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
10
Example 3. Fraud Detection and
Mining Unusual Patterns
Approaches: Clustering & model construction for frauds,
outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Money laundering: suspicious monetary transactions
Medical insurance:
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day
or week. Analyze patterns that deviate from an expected norm
Retail industry:
Analysts estimate that 38% of retail shrink is due to dishonest
employees
Anti-terrorism:
11
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find
closed-form solutions for complex mathematical models.
1990-now, data science (data-driven science)
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
12
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
13
Why Data Mining
Summary:
Abundance of data and data archives are seldom visited.
Far exceeded human ability for comprehension
Intuitive decisions are prone to biases and errors, and is
extremely time-consuming and costly
Data mining tools perform data analysis and uncover important
data patterns, contributing greatly to business strategies,
knowledge bases, and scientific and medical research.
Data
Tombs
Nuggets of
knowledge
14
What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer? (Knowledge Mining from data)
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything data mining?
Simple search and query processing
(Deductive) expert systems
15
What is Data Mining?
Tremendous amount of data (terabyte-petabyte)
High-dimensionality and high complexity of data
Structured, un-structured, heterogeneous data
Scalable
Data mining involves integration of multiple disciplines:
Machine learning
Pattern recognition
Statistics
Databases
Business Intelligence
Big data
Efficient: Derived knowledge is new, interesting, informative and
can be used for sophisticated application (decision making,
process control, information management....)
16
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
17
Steps of Knowledge Discovery
(KDD) Process
This is a view from typical
database systems and data
warehousing communities
Pattern Evaluation
Data mining plays an essential
role in the knowledge
discovery process
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
18
Data Warehousing and Mining
Framework
19
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
20
Multi-Dimensional View of Data
Mining
Data to be mined
Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web, multimedia, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized (methodologies)
Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
21
Data Mining: On What Kinds of
Data?
Structured and semi-structured data
Relational database/ Object-relational data
Data Warehouse,
Transactional Database
Unstructured data
Data streams and sensor data
Text data and web data
Time-series data, temporal data, sequence data (incl. biosequences)
Graphs, social networks and information networks
Spatial data, spatiotemporal data and multimedia data
22
Relational Database
A relational database is a collection of tables, each of which is
assigned a unique name.
Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
Each tuple in a relational table represents an object identified
by unique key and described by a set of attribute values.
A semantic data model, such as the entity relationship data
model, is often constructed for relational databases.
An ER data model represents the database as a set of entities
and their relationships.
23
Relational Database
Relational data can be accessed by database queries
written in a relational language such as SQL.
A given query is transformed into a set of relational
operations such as join, selection and projection,
and is then optimized for efficient processing.
Efficiency of retrieval, efficiency of update and
integrity are the key requirements of a good
relational database.
24
An Example - AllElectronics
Four relational tables: customer, item, employee and
branch.
Each relation consists of a set of attributes.
25
Example of Queries
Show me a list of all items that were sold in the last
quarter
Show me the total sales of the last month, grouped
by branch
Which sales person has the highest amount of
sales?
How many sales transactions occurred in the month
of September?
26
Purpose of relational databases
The main purpose of a relational database is to store
data correctly and retrieve data on demand.
This type of data processing is sometime called
Online Transaction Processing (OLTP).
Relational databases are passive data repositories in
the sense that a query only shows you what is
stored in the database, but cannot tell you much
about the meaning or trend of the data.
27
Data Warehouse of AllElectronics
A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema,
and that usually resides at a single site.
Need is to provide an analysis of the companys sales per
item type per branch for the a specified period.
28
Data Warehouse
The data warehouse
may store a summary
of the transactions per
item type for each
store or, summarized
to a higher level, for
each sales region.
29
Transactional Database
A transactional database consists of a file where each
record represents a transaction.
Supports nested relation
Transaction id: Items, Customer name, date
Sample Queries:
Show me all the items purchased by X
How many transactions include item number Y?
market basket data analysis: Which items sold well
together? (Frequent item set)
30
Knowledge View: What Knowledge to be
mined?
Data summary in multidimensional space
Data cube and OLAP (On-Line Analytical Processing)
Pattern discovery
Mining frequent patterns, association and correlation
Applying pattern mining in many other tasks
Classification and predictive modelling
Model construction based on some training examples
Prediction of new data based on constructed models
Cluster analysis: How to group data to form new categories?
Outlier analysis: Discovery of anomalies and rare events
Trend and evolution analysis
31
Data Mining Function: (1)
Characterization and Discrimination
Data can be associated with classes or concepts. ( e.g.,
classes of items: computer, printers concept of
customers: bigSpender, budgetSpender are the
descriptions )
Multidimensional concept description:
Characterization: summarizing the class in general. (e.g. general
specification of products whose sales increased by 10% and,
.profile of customers who spend more than $1000 a year. )
Discrimination: comparison of target class with a contrast class.(
compare the two groups of customers, such as who shop computer
products regularly versus who rarely shop such products). Drilling
down on dimensions such as occupation, age, etc.)
32
Data Mining Function: (2)
Association and Correlation Analysis
Frequent patterns (or frequent item_sets)
What items are frequently purchased together ?
Association, correlation vs. causality
A typical association rule
Milk Bread [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and/or set rules efficiently in
large datasets? ( single or multi-dimensional
association, minimum support threshold)
How to use such patterns for classification, clustering,
and other applications?
33
Data Mining Function: (3)
Classification
Classification and label prediction
Construct models (functions) based on some training examples or
rules.[example: kind of response (good, mild, no) in sales
campaign: price, brand, category, place_made]
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, nave Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression,
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages,
34
Data Mining Function: (4) Cluster
Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity &
minimizing interclass similarity
Example: homogeneous sub-population of
AllElectronics customers (customer attributes: city,
age, income,..)
Many methods and applications
35
Data Mining Function: (5) Outlier
Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Most data mining methods discard outliers as noise or
exceptions.
Noise or exception? One persons garbage could be
another persons treasure
Methods: by product of clustering or regression analysis,
distance analysis, statistical or probability model,
Useful in fraud detection, rare events are more interesting
Example: By detecting a purchase of extremely large
amount for a given account number.
36
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g., regression
and value prediction
Sequential pattern mining
e.g., first buy digital camera, then buy large SD
memory cards
Periodicity analysis (e.g., overall stock market evolution
regularities or for particular companies)
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data streams
37
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family,
classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
38
Methodology View: Confluence of
Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
Distributed /
cloud
computing
39
Why Confluence of Multiple
Disciplines?
Tremendous amount of data
Algorithms must be scalable to handle big data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social and information networks
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
40
Application View: Diverse Applications
Mining text data and mining the Web
Web page classification and ranking, Weblog analysis,
recommender systems,
Mining business data
Transaction data, market basket analysis, fraud detection,
Data mining and software/system engineering e.g.,
mining software bugs , optimize system performance,
help in computer vision
Mining biological and medical data
Gene, protein, microarray data, biological networks
Mining social and information networks
Community discovery, information propagation,
Invisible data mining : web search, stock market analysis
41
Classification of Data Mining System
According to the kinds of database mined:
relational, transactional, .spatial, text, stream data.or World Wide Web
According to the kinds of knowledge mined:
Based on mining functionalities, e.g. : characterization, discrimination,
association, .can be multiple and/or integrated data mining., can be
distinguished based on granularity, regular or irregular patterns(outliers)
mining
According to the techniques utilized:
degree of user interaction involved ( autonomous, interactive, query-driven),
method of analysis (machine learning, pattern recognition, statistics, neural
network.), combining merits of individual aspects..
According to the applications adapted:
Finance, Telecommunication, DNA, stock-marketall purpose data mining
system may not fit for domain specific minig.
42
Summary (till this)
Data mining: Discovering interesting patterns and knowledge
from massive amount of data
A natural evolution of science and information technology, in
great demand, with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier
analysis, etc.
Data mining technologies and applications
43
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of patterns
Some may fit only certain dimension space
time, location,
Some may not be representative, may be transient,
Evaluation of mined knowledge directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
44
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them
are interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
45
Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting patterns?
Approaches
First general all the patterns and then filter out the uninteresting
ones
Generate only the interesting patternsmining query
optimization
46
Integration of Data Mining and Data
Warehousing
Data mining systems, DBMS, Data warehouse systems coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
47
Coupling Data Mining with DB/DW
Systems
No couplingflat file processing for developing efficient and effective
algorithms, is a poor design as may spend time in preprocessing.
Loose coupling- Fetching data from DB/DW. Mining does not explore
data structure and optimization methods provided by DB & DW.Difficult for
high scalability.
Semi-tight couplingenhanced DM performance
Provide efficient implement a few data mining primitives in a DB/DW
system, e.g., sorting, indexing, aggregation, histogram analysis, multiway
join, precomputation of some statistical functions
Tight couplinguniform processing environment
DM is smoothly integrated into a DB/DW system, mining query is optimized
based on mining query, indexing, query processing methods, etc.
48
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space at multiple level of
abstraction.
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Background knowledge (integrity constraints & deduction rules)
Presentation and visualization of data mining results
49
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
50
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), WSDM (2008), etc.
ACM Transactions on KDD (2007)
51