KEMBAR78
Cs501 data preprocessingdw | PDF
CS501: DATABASE AND DATA
MINING
Data Preprocessing and Data Warehouse1
WHY DATA PREPROCESSING?
 Data in the real world is dirty
 incomplete: lacking attribute values lacking certain incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e g Age=“42” Birthday=“03/07/1997” e.g., Age 42 Birthday 03/07/1997
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
2
WHY IS DATA DIRTY?
 Incomplete data may come from
“N t li bl ” d t l h ll t d “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instrumentsFaulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
I i d f Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)p y ( g , y )
 Duplicate records also need data cleaning
3
WHY IS DATA PREPROCESSING IMPORTANT?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
i l di t ti timisleading statistics.
 Data warehouse needs consistent integration of
quality dataquality data
 Data extraction, cleaning, and transformation
comprises the majority of the work of building a dataj y g
warehouse
4
MULTI-DIMENSIONAL MEASURE OF DATA
QUALITY
 A well-accepted multidimensional view:
A Accuracy
 Completeness
 ConsistencyConsistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility Accessibility
 Broad categories:
 Intrinsic, contextual, representational, and
accessibility 5
MAJOR TASKS IN DATA PREPROCESSING
 Data cleaning
Fill i i i l th i d t id tif Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance Part of data reduction but with particular importance,
especially for numerical data
6
FORMS OF DATA PREPROCESSING
7
DATA CLEANING
 Importance
“D l i i f h h bi bl “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
h i ” DCIwarehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
8
MISSING DATA
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Mi i d t b d t Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deletedinconsistent with other recorded data and thus deleted
 data entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
Mi i d t d t b i f d Missing data may need to be inferred. 9
HOW TO HANDLE MISSING DATA?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification not effective when the(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?g y
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
f l d i i tformula or decision tree 10
NOISY DATA
 Noise: random error or variance in a measured variable
I ib l d Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems data entry problems
 data transmission problems
 technology limitationtec o ogy tat o
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data 11
HOW TO HANDLE NOISY DATA?
 Binning
 first sort data and partition into (equal frequency) first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
di h b bi b d imedian, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functionssmooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)deal with possible outliers)
12
BINNING METHODS FOR DATA
SMOOTHING
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
Bin 1: 9 9 9 9- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* S hi b bi b d i* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34 13
REGRESSION
y
Y1
y = x + 1Y1’
xX1
14
CLUSTER ANALYSIS
15
DATA CLEANING AS A PROCESS
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
Ch k fi ld l di Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
Data scrubbing: use simple domain kno ledge (e g postal Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
l i fi d li )clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
ETL (E t ti /T f ti /L di ) t l ll t ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)Iterative and interactive (e.g., Potter s Wheels)
16
DATA INTEGRATION
 Data integration:
 Combines data from multiple sources into a coherentp
store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sourcesIntegrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data
sources e g Bill Clinton = William Clintonsources, e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are differentdifferent sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
17
HANDLING REDUNDANCY IN DATA
INTEGRATION
 Redundant data occur often when integration of
multiple databasesmultiple databases
 Object identification: The same attribute or object
may have different names in different databasesy
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and qualityinconsistencies and improve mining speed and quality18
CORRELATION ANALYSIS (NUMERICAL DATA)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
BAnABBBAA
r
)())(( 




h i th b f t l d th ti
BABA nn
r BA
 )1()1(
,




A Bwhere n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.
A B
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
0 i d d t 0 ti l l t d rA,B = 0: independent; rA,B < 0: negatively correlated 19
CORRELATION ANALYSIS (CATEGORICAL
DATA)
 Χ2 (chi-square) test



Expected
ExpectedObserved 2
2 )(

 The larger the Χ2 value, the more likely the variables
are related
Th ll th t t ib t th t t th Χ2 l The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected countexpected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
20
CHI-SQUARE CALCULATION: AN EXAMPLE
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distributionexpected counts calculated based on the data distribution
in the two categories)
93507
)8401000()360200()21050()90250( 2222
2









 It shows that like_science_fiction and play_chess are
correlated in the group
93.507
84036021090

correlated in the group 21
DATA TRANSFORMATION: NORMALIZATION
 Min-max normalization: to [new_minA, new_maxA]
i
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



g , , [ ,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
716.00)00.1(
000,12000,98
000,12600,73



 Ex Let μ = 54 000 σ = 16 000 Then 2251
000,54600,73


A
Av
v


'
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
225.1
000,16

v
v' Where j is the smallest integer such that Max(|ν’|) < 1 22j
v
10
' Where j is the smallest integer such that Max(|ν |) < 1
WHAT IS DATA WAREHOUSE?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A d t h i bj t i t d i t t d ti i t “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmondecision making process. W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses The process of constructing and using data warehouses
23
DATA WAREHOUSE—SUBJECT-ORIENTED
 Organized around major subjects, such as customer,g j j , ,
product, sales
 Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processingp g
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful insubject issues by excluding data that are not useful in
the decision support process
24
DATA WAREHOUSE—INTEGRATED
 Constructed by integrating multiple, heterogeneous
ddata sources
 relational databases, flat files, on-line transaction
recordsrecords
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sourcesdata sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted. 25
DATA WAREHOUSE—TIME VARIANT
 The time horizon for the data warehouse is
significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”
26
DATA WAREHOUSE—NONVOLATILE
 A physically separate store of data transformed from
the operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:q y p g
initial loading of data and access of data
27
DATA WAREHOUSE VS. HETEROGENEOUS
DBMS
 Traditional heterogeneous DB integration: A query driven approach
 Build wrappers/mediators on top of heterogeneous databases
 When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
gl b l tglobal answer set
 Complex information filtering, compete for resources
D t h d t d i hi h f Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysisand stored in warehouses for direct query and analysis 28
DATA WAREHOUSE VS. OPERATIONAL DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMSajo tas o t ad t o a e at o a S
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on line analytical processing) OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs read only but complex queries Access patterns: update vs. read-only but complex queries 29
OLTP VS. OLAP
OLTP OLAP
users clerk IT professional knowledge workerusers clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current up to date historicaldata current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric t ti th h t th h tmetric transaction throughput query throughput, response 30
FROM TABLES AND SPREADSHEETS TO DATA
CUBESCUBES
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cubewhich views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
F bl i ( h d ll ld) d k Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a baseg ,
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cubeforms a data cube. 31
CUBE: A LATTICE OF CUBOIDS
all
0-D(apex) cuboid
time item location supplier
( p )
1-D cuboids
time item
time,location item,location location,supplier
2-D cuboidstime,item
time,supplier item,supplier
time,location,supplier
2 D cuboids
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
32
time, item, location, supplier
CONCEPTUAL MODELING OF DATA WAREHOUSES
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to
a set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
h f ll d l h f ll itherefore called galaxy schema or fact constellation33
EXAMPLE OF STAR SCHEMA
time key
time
itemtime_key
day
day_of_the_week
month
Sales Fact Table
time key
item_key
item_name
brand
quarter
year
time_key
item_key
branch key
type
supplier_type
location_key
street
location
branch_key
location_key
units sold
branch_key
branch name
branch
city
state_or_province
country
units_sold
dollars_sold
avg_sales
b a c _ a e
branch_type
34
Measures
EXAMPLE OF SNOWFLAKE SCHEMA
time key
time
itemtime_key
day
day_of_the_week
month
Sales Fact Table
time key
item_key
item_name
brand
t
supplier_key
supplier_type
supplier
quarter
year
time_key
item_key
branch key
type
supplier_key
location_key
street
it k
location
_ y
location_key
units_sold
branch_key
branch name
branch
city_key
dollars_sold
avg_sales
b a c _ a e
branch_type
city_key
city
city
35
Measures
state_or_province
country
EXAMPLE OF FACT CONSTELLATION
time_key
time
item Shipping Fact Table
day
day_of_the_week
month
quarter
Sales Fact Table
time key
item_key
item_name
brand
type
time_key
item_key
year
_ y
item_key
branch_key
yp
supplier_type shipper_key
from_location
location_key
street
locationlocation_key
units_sold
branch_key
branch_name
branch to_location
dollars_cost
street
city
province_or_state
country
dollars_sold
avg_sales
M
branch_type units_shipped
shipper
36Measures shipper_key
shipper_name
location_key
shipper_type
MEASURES OF DATA CUBE: THREE CATEGORIES
 Distributive: if the result derived by applying the function to n
l i h h d i d b l i haggregate values is the same as that derived by applying the
function on all the data without partitioning
 E.g., count(), sum(), min(), max()g , (), (), (), ()
 Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is
bt i d b l i di t ib ti t f tiobtained by applying a distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage sizeg
needed to describe a subaggregate.
 E.g., median(), mode(), rank()
37
A CONCEPT HIERARCHY: DIMENSION
(LOCATION)
allall
Europe North_America...region
MexicoCanadaSpainGermany ......country
Vancouver ...
country
TorontoFrankfurtcity Vancouver
M. WindL. Chan
... ...
...office
TorontoFrankfurtcity
38
M. WindL. Chan ...office
MULTIDIMENSIONAL DATA
 Sales volume as a function of product, month, andp , ,
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
oduct
Category Country Quarter
Product City Month Week
Pro
Office Day
39
Month
A SAMPLE DATA CUBE
Total annual sales
of TV in U S ADate of TV in U.S.A.
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Country
sum
VCR
Canada
i
C
Mexico
sum
40
CUBOIDS CORRESPONDING TO THE CUBE
all
product date country
0-D(apex) cuboid
product date country
product,date product,country date, country
1-D cuboids
2-D cuboids
3 D(b ) b id
product, date, country
3-D(base) cuboid
41
BROWSING A DATA CUBEBROWSING A DATA CUBE
 Visualization
 OLAP
capabilities
 Interactive
manipulationp
42
TYPICAL OLAP OPERATIONS
 Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary orf g y y
detailed data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL) 43
Q1
Q2
Q3
Q4
1000
Canada
USA 2000
time(quarters)
location (countries)
Toronto 395
Q1
Q2
605
Vancouver
time
(quarters)
location (cities)
home
entertainment
computer
Q4
t
home
entertainment
computer
item (types)
phone
security
entertainment
item (types)
dice for
(location = “Toronto” or “Vancouver”)
and (time = “Q1” or “Q2”) and
(item = “home entertainment” or “computer”)
roll-up
on location
(from cities
605 825 14 400Q1
Q2
Chicago
New York
Toronto
Vancouver
uarters)
location (cities)
440
395
1560
(from cities
to countries)
Fig. 3.10 Typical OLAP
Operations Q2
Q3
Q4
time(quar
home
entertainment
computer
item (types)
phone
security
slice
for time = “Q1”
Chicago
s)
drill-down
on time
(from quarters
Chicago
New York
Toronto
Vancouverlocation (cities)
Chicago
New York
Toronto
Vancouver
home
entertainment
computer
phone
security
location(cities)
605 825 14 400
to months)
January
February
March
April
May
June
July
August
Vancouver
time(months)
l
150
100
150
entertainment
item (types)
home
entertainment 605
s)
pivot
44
September
October
November
December
t
home
entertainment
computer
item (types)
phone
security
computer
phone
security
825
14
400
Chicago
New York
location (cities)
item(types)
Toronto
Vancouver
DATA WAREHOUSE DESIGN PROCESS
 Typical data warehouse design process
 Choose a business process to model e g orders invoices etc Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record
45

Cs501 data preprocessingdw

  • 1.
    CS501: DATABASE ANDDATA MINING Data Preprocessing and Data Warehouse1
  • 2.
    WHY DATA PREPROCESSING? Data in the real world is dirty  incomplete: lacking attribute values lacking certain incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation=“ ”  noisy: containing errors or outliers noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e g Age=“42” Birthday=“03/07/1997” e.g., Age 42 Birthday 03/07/1997  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records 2
  • 3.
    WHY IS DATADIRTY?  Incomplete data may come from “N t li bl ” d t l h ll t d “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed.  Human/hardware/software problems  Noisy data (incorrect values) may come from  Faulty data collection instrumentsFaulty data collection instruments  Human or computer error at data entry  Errors in data transmission I i d f Inconsistent data may come from  Different data sources  Functional dependency violation (e.g., modify some linked data)p y ( g , y )  Duplicate records also need data cleaning 3
  • 4.
    WHY IS DATAPREPROCESSING IMPORTANT?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even i l di t ti timisleading statistics.  Data warehouse needs consistent integration of quality dataquality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a dataj y g warehouse 4
  • 5.
    MULTI-DIMENSIONAL MEASURE OFDATA QUALITY  A well-accepted multidimensional view: A Accuracy  Completeness  ConsistencyConsistency  Timeliness  Believability  Value added  Interpretability  Accessibility Accessibility  Broad categories:  Intrinsic, contextual, representational, and accessibility 5
  • 6.
    MAJOR TASKS INDATA PREPROCESSING  Data cleaning Fill i i i l th i d t id tif Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance Part of data reduction but with particular importance, especially for numerical data 6
  • 7.
    FORMS OF DATAPREPROCESSING 7
  • 8.
    DATA CLEANING  Importance “Dl i i f h h bi bl “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball  “Data cleaning is the number one problem in data h i ” DCIwarehousing”—DCI survey  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration 8
  • 9.
    MISSING DATA  Datais not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Mi i d t b d t Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deletedinconsistent with other recorded data and thus deleted  data entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data Mi i d t d t b i f d Missing data may need to be inferred. 9
  • 10.
    HOW TO HANDLEMISSING DATA?  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the(assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.  Fill in the missing value manually: tedious + infeasible?g y  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian f l d i i tformula or decision tree 10
  • 11.
    NOISY DATA  Noise:random error or variance in a measured variable I ib l d Incorrect attribute values may due to  faulty data collection instruments  data entry problems data entry problems  data transmission problems  technology limitationtec o ogy tat o  inconsistency in naming convention  Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data 11
  • 12.
    HOW TO HANDLENOISY DATA?  Binning  first sort data and partition into (equal frequency) first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin di h b bi b d imedian, smooth by bin boundaries, etc.  Regression  smooth by fitting the data into regression functionssmooth by fitting the data into regression functions  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)deal with possible outliers) 12
  • 13.
    BINNING METHODS FORDATA SMOOTHING  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: Bin 1: 9 9 9 9- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * S hi b bi b d i* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 13
  • 14.
    REGRESSION y Y1 y = x+ 1Y1’ xX1 14
  • 15.
  • 16.
    DATA CLEANING ASA PROCESS  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution) Ch k fi ld l di Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools Data scrubbing: use simple domain kno ledge (e g postal Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and l i fi d li )clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified ETL (E t ti /T f ti /L di ) t l ll t ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface  Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels)Iterative and interactive (e.g., Potter s Wheels) 16
  • 17.
    DATA INTEGRATION  Dataintegration:  Combines data from multiple sources into a coherentp store  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sourcesIntegrate metadata from different sources  Entity identification problem:  Identify real world entities from multiple data sources e g Bill Clinton = William Clintonsources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are differentdifferent sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units 17
  • 18.
    HANDLING REDUNDANCY INDATA INTEGRATION  Redundant data occur often when integration of multiple databasesmultiple databases  Object identification: The same attribute or object may have different names in different databasesy  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityinconsistencies and improve mining speed and quality18
  • 19.
    CORRELATION ANALYSIS (NUMERICALDATA)  Correlation coefficient (also called Pearson’s product moment coefficient) BAnABBBAA r )())((      h i th b f t l d th ti BABA nn r BA  )1()1( ,     A Bwhere n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. A B  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. 0 i d d t 0 ti l l t d rA,B = 0: independent; rA,B < 0: negatively correlated 19
  • 20.
    CORRELATION ANALYSIS (CATEGORICAL DATA) Χ2 (chi-square) test    Expected ExpectedObserved 2 2 )(   The larger the Χ2 value, the more likely the variables are related Th ll th t t ib t th t t th Χ2 l The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected countexpected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population 20
  • 21.
    CHI-SQUARE CALCULATION: ANEXAMPLE Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distributionexpected counts calculated based on the data distribution in the two categories) 93507 )8401000()360200()21050()90250( 2222 2           It shows that like_science_fiction and play_chess are correlated in the group 93.507 84036021090  correlated in the group 21
  • 22.
    DATA TRANSFORMATION: NORMALIZATION Min-max normalization: to [new_minA, new_maxA] i  Ex. Let income range $12,000 to $98,000 normalized to [0.0, AAA AA A minnewminnewmaxnew minmax minv v _)__('     g , , [ , 1.0]. Then $73,000 is mapped to  Z-score normalization (μ: mean, σ: standard deviation): 716.00)00.1( 000,12000,98 000,12600,73     Ex Let μ = 54 000 σ = 16 000 Then 2251 000,54600,73   A Av v   '  Ex. Let μ = 54,000, σ = 16,000. Then  Normalization by decimal scaling 225.1 000,16  v v' Where j is the smallest integer such that Max(|ν’|) < 1 22j v 10 ' Where j is the smallest integer such that Max(|ν |) < 1
  • 23.
    WHAT IS DATAWAREHOUSE?  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis. “A d t h i bj t i t d i t t d ti i t “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmondecision making process. W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses The process of constructing and using data warehouses 23
  • 24.
    DATA WAREHOUSE—SUBJECT-ORIENTED  Organizedaround major subjects, such as customer,g j j , , product, sales  Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingp g  Provide a simple and concise view around particular subject issues by excluding data that are not useful insubject issues by excluding data that are not useful in the decision support process 24
  • 25.
    DATA WAREHOUSE—INTEGRATED  Constructedby integrating multiple, heterogeneous ddata sources  relational databases, flat files, on-line transaction recordsrecords  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sourcesdata sources  E.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted. 25
  • 26.
    DATA WAREHOUSE—TIME VARIANT The time horizon for the data warehouse is significantly longer than that of operational systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element” 26
  • 27.
    DATA WAREHOUSE—NONVOLATILE  Aphysically separate store of data transformed from the operational environment  Operational update of data does not occur in the data warehouse environment  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:q y p g initial loading of data and access of data 27
  • 28.
    DATA WAREHOUSE VS.HETEROGENEOUS DBMS  Traditional heterogeneous DB integration: A query driven approach  Build wrappers/mediators on top of heterogeneous databases  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a gl b l tglobal answer set  Complex information filtering, compete for resources D t h d t d i hi h f Data warehouse: update-driven, high performance  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysisand stored in warehouses for direct query and analysis 28
  • 29.
    DATA WAREHOUSE VS.OPERATIONAL DBMS  OLTP (on-line transaction processing)  Major task of traditional relational DBMSajo tas o t ad t o a e at o a S  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on line analytical processing) OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs read only but complex queries Access patterns: update vs. read-only but complex queries 29
  • 30.
    OLTP VS. OLAP OLTPOLAP users clerk IT professional knowledge workerusers clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current up to date historicaldata current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions# records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric t ti th h t th h tmetric transaction throughput query throughput, response 30
  • 31.
    FROM TABLES ANDSPREADSHEETS TO DATA CUBESCUBES  A data warehouse is based on a multidimensional data model which views data in the form of a data cubewhich views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) F bl i ( h d ll ld) d k Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  In data warehousing literature, an n-D base cube is called a baseg , cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cubeforms a data cube. 31
  • 32.
    CUBE: A LATTICEOF CUBOIDS all 0-D(apex) cuboid time item location supplier ( p ) 1-D cuboids time item time,location item,location location,supplier 2-D cuboidstime,item time,supplier item,supplier time,location,supplier 2 D cuboids 3-D cuboids time,item,location time,item,supplier item,location,supplier 4-D(base) cuboid 32 time, item, location, supplier
  • 33.
    CONCEPTUAL MODELING OFDATA WAREHOUSES  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, h f ll d l h f ll itherefore called galaxy schema or fact constellation33
  • 34.
    EXAMPLE OF STARSCHEMA time key time itemtime_key day day_of_the_week month Sales Fact Table time key item_key item_name brand quarter year time_key item_key branch key type supplier_type location_key street location branch_key location_key units sold branch_key branch name branch city state_or_province country units_sold dollars_sold avg_sales b a c _ a e branch_type 34 Measures
  • 35.
    EXAMPLE OF SNOWFLAKESCHEMA time key time itemtime_key day day_of_the_week month Sales Fact Table time key item_key item_name brand t supplier_key supplier_type supplier quarter year time_key item_key branch key type supplier_key location_key street it k location _ y location_key units_sold branch_key branch name branch city_key dollars_sold avg_sales b a c _ a e branch_type city_key city city 35 Measures state_or_province country
  • 36.
    EXAMPLE OF FACTCONSTELLATION time_key time item Shipping Fact Table day day_of_the_week month quarter Sales Fact Table time key item_key item_name brand type time_key item_key year _ y item_key branch_key yp supplier_type shipper_key from_location location_key street locationlocation_key units_sold branch_key branch_name branch to_location dollars_cost street city province_or_state country dollars_sold avg_sales M branch_type units_shipped shipper 36Measures shipper_key shipper_name location_key shipper_type
  • 37.
    MEASURES OF DATACUBE: THREE CATEGORIES  Distributive: if the result derived by applying the function to n l i h h d i d b l i haggregate values is the same as that derived by applying the function on all the data without partitioning  E.g., count(), sum(), min(), max()g , (), (), (), ()  Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is bt i d b l i di t ib ti t f tiobtained by applying a distributive aggregate function  E.g., avg(), min_N(), standard_deviation()  Holistic: if there is no constant bound on the storage sizeg needed to describe a subaggregate.  E.g., median(), mode(), rank() 37
  • 38.
    A CONCEPT HIERARCHY:DIMENSION (LOCATION) allall Europe North_America...region MexicoCanadaSpainGermany ......country Vancouver ... country TorontoFrankfurtcity Vancouver M. WindL. Chan ... ... ...office TorontoFrankfurtcity 38 M. WindL. Chan ...office
  • 39.
    MULTIDIMENSIONAL DATA  Salesvolume as a function of product, month, andp , , region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year oduct Category Country Quarter Product City Month Week Pro Office Day 39 Month
  • 40.
    A SAMPLE DATACUBE Total annual sales of TV in U S ADate of TV in U.S.A. sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Country sum VCR Canada i C Mexico sum 40
  • 41.
    CUBOIDS CORRESPONDING TOTHE CUBE all product date country 0-D(apex) cuboid product date country product,date product,country date, country 1-D cuboids 2-D cuboids 3 D(b ) b id product, date, country 3-D(base) cuboid 41
  • 42.
    BROWSING A DATACUBEBROWSING A DATA CUBE  Visualization  OLAP capabilities  Interactive manipulationp 42
  • 43.
    TYPICAL OLAP OPERATIONS Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary orf g y y detailed data, or introducing new dimensions  Slice and dice: project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes  Other operations Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 43
  • 44.
    Q1 Q2 Q3 Q4 1000 Canada USA 2000 time(quarters) location (countries) Toronto395 Q1 Q2 605 Vancouver time (quarters) location (cities) home entertainment computer Q4 t home entertainment computer item (types) phone security entertainment item (types) dice for (location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item = “home entertainment” or “computer”) roll-up on location (from cities 605 825 14 400Q1 Q2 Chicago New York Toronto Vancouver uarters) location (cities) 440 395 1560 (from cities to countries) Fig. 3.10 Typical OLAP Operations Q2 Q3 Q4 time(quar home entertainment computer item (types) phone security slice for time = “Q1” Chicago s) drill-down on time (from quarters Chicago New York Toronto Vancouverlocation (cities) Chicago New York Toronto Vancouver home entertainment computer phone security location(cities) 605 825 14 400 to months) January February March April May June July August Vancouver time(months) l 150 100 150 entertainment item (types) home entertainment 605 s) pivot 44 September October November December t home entertainment computer item (types) phone security computer phone security 825 14 400 Chicago New York location (cities) item(types) Toronto Vancouver
  • 45.
    DATA WAREHOUSE DESIGNPROCESS  Typical data warehouse design process  Choose a business process to model e g orders invoices etc Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record 45