Cs501 data preprocessingdw

CS501: DATABASE AND DATA
MINING
Data Preprocessing and Data Warehouse1

WHY DATA PREPROCESSING?
 Data in the real world is dirty
 incomplete: lacking attribute values lacking certain incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e g Age=“42” Birthday=“03/07/1997” e.g., Age 42 Birthday 03/07/1997
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
2

WHY IS DATA DIRTY?
 Incomplete data may come from
“N t li bl ” d t l h ll t d “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instrumentsFaulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
I i d f Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)p y ( g , y )
 Duplicate records also need data cleaning
3

WHY IS DATA PREPROCESSING IMPORTANT?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
i l di t ti timisleading statistics.
 Data warehouse needs consistent integration of
quality dataquality data
 Data extraction, cleaning, and transformation
comprises the majority of the work of building a dataj y g
warehouse
4

MULTI-DIMENSIONAL MEASURE OF DATA
QUALITY
 A well-accepted multidimensional view:
A Accuracy
 Completeness
 ConsistencyConsistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility Accessibility
 Broad categories:
 Intrinsic, contextual, representational, and
accessibility 5

MAJOR TASKS IN DATA PREPROCESSING
 Data cleaning
Fill i i i l th i d t id tif Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance Part of data reduction but with particular importance,
especially for numerical data
6

DATA CLEANING
 Importance
“D l i i f h h bi bl “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
h i ” DCIwarehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
8

MISSING DATA
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Mi i d t b d t Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deletedinconsistent with other recorded data and thus deleted
 data entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
Mi i d t d t b i f d Missing data may need to be inferred. 9

HOW TO HANDLE MISSING DATA?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification not effective when the(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?g y
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
f l d i i tformula or decision tree 10

NOISY DATA
 Noise: random error or variance in a measured variable
I ib l d Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems data entry problems
 data transmission problems
 technology limitationtec o ogy tat o
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data 11

HOW TO HANDLE NOISY DATA?
 Binning
 first sort data and partition into (equal frequency) first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
di h b bi b d imedian, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functionssmooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)deal with possible outliers)
12

BINNING METHODS FOR DATA
SMOOTHING
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
Bin 1: 9 9 9 9- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* S hi b bi b d i* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34 13

REGRESSION
y
Y1
y = x + 1Y1’
xX1
14

DATA CLEANING AS A PROCESS
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
Ch k fi ld l di Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
Data scrubbing: use simple domain kno ledge (e g postal Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
l i fi d li )clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
ETL (E t ti /T f ti /L di ) t l ll t ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)Iterative and interactive (e.g., Potter s Wheels)
16

DATA INTEGRATION
 Data integration:
 Combines data from multiple sources into a coherentp
store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sourcesIntegrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data
sources e g Bill Clinton = William Clintonsources, e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are differentdifferent sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
17

HANDLING REDUNDANCY IN DATA
INTEGRATION
 Redundant data occur often when integration of
multiple databasesmultiple databases
 Object identification: The same attribute or object
may have different names in different databasesy
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and qualityinconsistencies and improve mining speed and quality18

CORRELATION ANALYSIS (NUMERICAL DATA)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
BAnABBBAA
r
)())(( 




h i th b f t l d th ti
BABA nn
r BA
 )1()1(
,




A Bwhere n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.
A B
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
0 i d d t 0 ti l l t d rA,B = 0: independent; rA,B < 0: negatively correlated 19

CORRELATION ANALYSIS (CATEGORICAL
DATA)
 Χ2 (chi-square) test



Expected
ExpectedObserved 2
2 )(

 The larger the Χ2 value, the more likely the variables
are related
Th ll th t t ib t th t t th Χ2 l The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected countexpected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
20

CHI-SQUARE CALCULATION: AN EXAMPLE
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distributionexpected counts calculated based on the data distribution
in the two categories)
93507
)8401000()360200()21050()90250( 2222
2









 It shows that like_science_fiction and play_chess are
correlated in the group
93.507
84036021090

correlated in the group 21

DATA TRANSFORMATION: NORMALIZATION
 Min-max normalization: to [new_minA, new_maxA]
i
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



g , , [ ,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
716.00)00.1(
000,12000,98
000,12600,73



 Ex Let μ = 54 000 σ = 16 000 Then 2251
000,54600,73


A
Av
v


'
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
225.1
000,16

v
v' Where j is the smallest integer such that Max(|ν’|) < 1 22j
v
10
' Where j is the smallest integer such that Max(|ν |) < 1

WHAT IS DATA WAREHOUSE?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A d t h i bj t i t d i t t d ti i t “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmondecision making process. W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses The process of constructing and using data warehouses
23

DATA WAREHOUSE—SUBJECT-ORIENTED
 Organized around major subjects, such as customer,g j j , ,
product, sales
 Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processingp g
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful insubject issues by excluding data that are not useful in
the decision support process
24

DATA WAREHOUSE—INTEGRATED
 Constructed by integrating multiple, heterogeneous
ddata sources
 relational databases, flat files, on-line transaction
recordsrecords
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sourcesdata sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted. 25

DATA WAREHOUSE—TIME VARIANT
 The time horizon for the data warehouse is
significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”
26

DATA WAREHOUSE—NONVOLATILE
 A physically separate store of data transformed from
the operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:q y p g
initial loading of data and access of data
27

DATA WAREHOUSE VS. HETEROGENEOUS
DBMS
 Traditional heterogeneous DB integration: A query driven approach
 Build wrappers/mediators on top of heterogeneous databases
 When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
gl b l tglobal answer set
 Complex information filtering, compete for resources
D t h d t d i hi h f Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysisand stored in warehouses for direct query and analysis 28

DATA WAREHOUSE VS. OPERATIONAL DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMSajo tas o t ad t o a e at o a S
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on line analytical processing) OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs read only but complex queries Access patterns: update vs. read-only but complex queries 29

OLTP VS. OLAP
OLTP OLAP
users clerk IT professional knowledge workerusers clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current up to date historicaldata current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric t ti th h t th h tmetric transaction throughput query throughput, response 30

FROM TABLES AND SPREADSHEETS TO DATA
CUBESCUBES
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cubewhich views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
F bl i ( h d ll ld) d k Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a baseg ,
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cubeforms a data cube. 31

CUBE: A LATTICE OF CUBOIDS
all
0-D(apex) cuboid
time item location supplier
( p )
1-D cuboids
time item
time,location item,location location,supplier
2-D cuboidstime,item
time,supplier item,supplier
time,location,supplier
2 D cuboids
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
32
time, item, location, supplier

CONCEPTUAL MODELING OF DATA WAREHOUSES
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to
a set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
h f ll d l h f ll itherefore called galaxy schema or fact constellation33

EXAMPLE OF STAR SCHEMA
time key
time
itemtime_key
day
day_of_the_week
month
Sales Fact Table
time key
item_key
item_name
brand
quarter
year
time_key
item_key
branch key
type
supplier_type
location_key
street
location
branch_key
location_key
units sold
branch_key
branch name
branch
city
state_or_province
country
units_sold
dollars_sold
avg_sales
b a c _ a e
branch_type
34
Measures

EXAMPLE OF SNOWFLAKE SCHEMA
time key
time
itemtime_key
day
day_of_the_week
month
Sales Fact Table
time key
item_key
item_name
brand
t
supplier_key
supplier_type
supplier
quarter
year
time_key
item_key
branch key
type
supplier_key
location_key
street
it k
location
_ y
location_key
units_sold
branch_key
branch name
branch
city_key
dollars_sold
avg_sales
b a c _ a e
branch_type
city_key
city
city
35
Measures
state_or_province
country

EXAMPLE OF FACT CONSTELLATION
time_key
time
item Shipping Fact Table
day
day_of_the_week
month
quarter
Sales Fact Table
time key
item_key
item_name
brand
type
time_key
item_key
year
_ y
item_key
branch_key
yp
supplier_type shipper_key
from_location
location_key
street
locationlocation_key
units_sold
branch_key
branch_name
branch to_location
dollars_cost
street
city
province_or_state
country
dollars_sold
avg_sales
M
branch_type units_shipped
shipper
36Measures shipper_key
shipper_name
location_key
shipper_type

MEASURES OF DATA CUBE: THREE CATEGORIES
 Distributive: if the result derived by applying the function to n
l i h h d i d b l i haggregate values is the same as that derived by applying the
function on all the data without partitioning
 E.g., count(), sum(), min(), max()g , (), (), (), ()
 Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is
bt i d b l i di t ib ti t f tiobtained by applying a distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage sizeg
needed to describe a subaggregate.
 E.g., median(), mode(), rank()
37

A CONCEPT HIERARCHY: DIMENSION
(LOCATION)
allall
Europe North_America...region
MexicoCanadaSpainGermany ......country
Vancouver ...
country
TorontoFrankfurtcity Vancouver
M. WindL. Chan
... ...
...office
TorontoFrankfurtcity
38
M. WindL. Chan ...office

MULTIDIMENSIONAL DATA
 Sales volume as a function of product, month, andp , ,
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
oduct
Category Country Quarter
Product City Month Week
Pro
Office Day
39
Month

A SAMPLE DATA CUBE
Total annual sales
of TV in U S ADate of TV in U.S.A.
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Country
sum
VCR
Canada
i
C
Mexico
sum
40

CUBOIDS CORRESPONDING TO THE CUBE
all
product date country
0-D(apex) cuboid
product date country
product,date product,country date, country
1-D cuboids
2-D cuboids
3 D(b ) b id
product, date, country
3-D(base) cuboid
41

BROWSING A DATA CUBEBROWSING A DATA CUBE
 Visualization
 OLAP
capabilities
 Interactive
manipulationp
42

TYPICAL OLAP OPERATIONS
 Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary orf g y y
detailed data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL) 43

Q1
Q2
Q3
Q4
1000
Canada
USA 2000
time(quarters)
location (countries)
Toronto 395
Q1
Q2
605
Vancouver
time
(quarters)
location (cities)
home
entertainment
computer
Q4
t
home
entertainment
computer
item (types)
phone
security
entertainment
item (types)
dice for
(location = “Toronto” or “Vancouver”)
and (time = “Q1” or “Q2”) and
(item = “home entertainment” or “computer”)
roll-up
on location
(from cities
605 825 14 400Q1
Q2
Chicago
New York
Toronto
Vancouver
uarters)
location (cities)
440
395
1560
(from cities
to countries)
Fig. 3.10 Typical OLAP
Operations Q2
Q3
Q4
time(quar
home
entertainment
computer
item (types)
phone
security
slice
for time = “Q1”
Chicago
s)
drill-down
on time
(from quarters
Chicago
New York
Toronto
Vancouverlocation (cities)
Chicago
New York
Toronto
Vancouver
home
entertainment
computer
phone
security
location(cities)
605 825 14 400
to months)
January
February
March
April
May
June
July
August
Vancouver
time(months)
l
150
100
150
entertainment
item (types)
home
entertainment 605
s)
pivot
44
September
October
November
December
t
home
entertainment
computer
item (types)
phone
security
computer
phone
security
825
14
400
Chicago
New York
location (cities)
item(types)
Toronto
Vancouver

DATA WAREHOUSE DESIGN PROCESS
 Typical data warehouse design process
 Choose a business process to model e g orders invoices etc Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record
45

Cs501 data preprocessingdw

More Related Content

What's hot

Similar to Cs501 data preprocessingdw

More from Kamal Singh Lodhi

Recently uploaded

Cs501 data preprocessingdw