0% found this document useful (0 votes)

23 views16 pages

Data ch2

Uploaded by

ranashahzaibtariq709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views16 pages

Data ch2

Uploaded by

ranashahzaibtariq709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla

wi
crosstabs

n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series 2 Beer, Bread
 Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
 Genetic sequence data
4 Beer, Bread, Diaper, Milk
 Spatial, image and multimedia:
 Spatial data: maps
5 Coke, Diaper, Milk
 Image data:
 Video data:
3
Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
5
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
9
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

10
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
11
Basic Statistical Descriptions of Data
 For data preprocessing to be successful, it is essential to
have an overall picture of your data.
 Basic statistical descriptions can be used to identify
 properties of the data
 highlight noise or outliers
 Three areas of basic statistical descriptions.
 Measures of Central Tendency: Middle or center of a data
distribution. (where do most of its values fall) e.g. mean,
median, mode.
 Dispersion of the data: how are the data spread out? (range,
quartiles, boxplots; and the variance and standard deviation of the data )
 Graphic displays of basic statistical descriptions: to visually
inspect our data. (bar charts, pie charts, and line graphs, histograms,
and scatter plots etc.)

12
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N

 Weighted arithmetic mean: The weights reflect the w x i i

significance, importance, or occurrence frequency x i 1

attached to their respective values. w

i 1
i

13
Measuring the Central Tendency
 Issues: A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
 For example, the mean salary at a company may be substantially pushed up
by that of a few highly paid managers.
 Similarly, the mean score of a class in an exam could be pulled down quite a
bit by a few very low scores.

 Trimmed mean: Mean obtained after chopping off values at the high and low
extremes, to offset the effect caused by a small number of extreme values.
 For example, we can sort the values observed for salary and remove the top
and bottom 2% before computing the mean.
 We should avoid trimming too large a portion (such as 20%) at both ends,
as this can result in the loss of valuable information

14
Measuring the Central Tendency
 Median:
 Middle value if odd number of values, or average of
the middle two values otherwise
 Estimated by interpolation (for grouped data)
 The mode for a set of data is the value that occurs
most frequently in the set
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 A data set with two or more modes is multimodal
 At the other extreme, if each data value occurs only
once, then there is no mode

15
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

May 24, 2024 Data Mining: Concepts and Techniques 16

VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
02 Data
No ratings yet
02 Data
24 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02 Data
No ratings yet
02 Data
65 pages
02 Data
No ratings yet
02 Data
65 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02 Data
No ratings yet
02 Data
64 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Lect 3
No ratings yet
Lect 3
51 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
02 Data
No ratings yet
02 Data
35 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Module 1
No ratings yet
Module 1
64 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
41 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 Data
No ratings yet
02 Data
66 pages
02 Data
No ratings yet
02 Data
62 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
02 Data
No ratings yet
02 Data
36 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
CH 2
No ratings yet
CH 2
68 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
DMDW Module2-Chapter 2
No ratings yet
DMDW Module2-Chapter 2
67 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
Lec 2
No ratings yet
Lec 2
26 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
Data Summarization Techniques
No ratings yet
Data Summarization Techniques
16 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
02data Part1
No ratings yet
02data Part1
19 pages
Inverse Reinforcement Learning Survey
No ratings yet
Inverse Reinforcement Learning Survey
48 pages
Rules of Inference
No ratings yet
Rules of Inference
123 pages
Formulario de Calculo Diferencial
No ratings yet
Formulario de Calculo Diferencial
2 pages
Research Paper Final Najud Ni Promise For Print With Footer
No ratings yet
Research Paper Final Najud Ni Promise For Print With Footer
17 pages
Slot03 04 BasicComputation
No ratings yet
Slot03 04 BasicComputation
59 pages
BaigiangECC PDF
No ratings yet
BaigiangECC PDF
248 pages
Module 3
No ratings yet
Module 3
5 pages
Forecasting Examples Forecasting Example 1996 UG Exam: Month 1 2 3 4 5 Demand ('00s) 13 17 19 23 24
50% (2)
Forecasting Examples Forecasting Example 1996 UG Exam: Month 1 2 3 4 5 Demand ('00s) 13 17 19 23 24
14 pages
Advanced Differential Equations 1st Edition Youssef Raffoul Instant Download
No ratings yet
Advanced Differential Equations 1st Edition Youssef Raffoul Instant Download
49 pages
Polynomial Functions: FX A X A X A X A X A
No ratings yet
Polynomial Functions: FX A X A X A X A X A
11 pages
MATHEMATICAL LAng
No ratings yet
MATHEMATICAL LAng
34 pages
Stat Notes
No ratings yet
Stat Notes
56 pages
Spratley 3rd Grade Handbook 2017/18
No ratings yet
Spratley 3rd Grade Handbook 2017/18
12 pages
Horizontal Curves
No ratings yet
Horizontal Curves
6 pages
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
32 pages
Quotes of Mathematics
No ratings yet
Quotes of Mathematics
15 pages
Pumping Lemma for Regular and CFL
No ratings yet
Pumping Lemma for Regular and CFL
6 pages
Rules For Drawing and Using Mohrs Circle
No ratings yet
Rules For Drawing and Using Mohrs Circle
7 pages
Linear-Algebra With Python
100% (2)
Linear-Algebra With Python
26 pages
Vishwakarma Institute of Technology: FF No. 182
No ratings yet
Vishwakarma Institute of Technology: FF No. 182
4 pages
Linear Law
No ratings yet
Linear Law
4 pages
IC368 Computational Intelligence in Control Engineering
No ratings yet
IC368 Computational Intelligence in Control Engineering
3 pages
Mathematics Class 10
0% (1)
Mathematics Class 10
376 pages
Eng M 540
No ratings yet
Eng M 540
60 pages
Linear Programming (LP) : Georgia Institute of Technology Systems Realization Laboratory 1
No ratings yet
Linear Programming (LP) : Georgia Institute of Technology Systems Realization Laboratory 1
25 pages
(Ebook) Algorithmic Combinatorics On Partial Words by Francine Blanchet-Sadri ISBN 9781420060928, 9781420060935, 1420060929, 1420060937 PDF Download
No ratings yet
(Ebook) Algorithmic Combinatorics On Partial Words by Francine Blanchet-Sadri ISBN 9781420060928, 9781420060935, 1420060929, 1420060937 PDF Download
126 pages
JMC 2023 Solutions
No ratings yet
JMC 2023 Solutions
4 pages
Grade 7 Mathematics Book
No ratings yet
Grade 7 Mathematics Book
60 pages
CAEP Standards
No ratings yet
CAEP Standards
91 pages
Chapter 5
50% (2)
Chapter 5
28 pages

Data ch2

Uploaded by

Data ch2

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Data sets are made up of data objects.

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Weighted arithmetic mean: The weights reflect the w x i i

significance, importance, or occurrence frequency x i 1

attached to their respective values. w

positively skewed negatively skewed

May 24, 2024 Data Mining: Concepts and Techniques 16

You might also like