0% found this document useful (0 votes)

47 views34 pages

02data Part2

Here are the steps to calculate the standard deviation: 1) Calculate the mean (sum of all scores / total number of scores) 2) Calculate the deviation of each score from the mean (X - Mean) 3) Multiply the deviation by the frequency 4) Sum the results from step 3 5) Divide the result from step 4 by the total number of scores 6) Take the square root of the result from step 5 The standard deviation is the square root result, which measures how dispersed the scores are from the mean.

Uploaded by

baigsalman251

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views34 pages

02data Part2

Uploaded by

baigsalman251

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
shahid.awan@umt.edu.pk
University of Management and Technology

Fall 2018
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
2.2 Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread

 Data dispersion characteristics

 median, max, min, quantiles, outliers, variance, etc.

3
4
Basic Statistical Descriptions of Data
 Numerical dimensions
 correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals

 Dispersion analysis on computed measures

 Folding measures into numerical dimensions

 Boxplot or quantile analysis on the transformed cube

5
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi
n i 1
Note: n is sample size and N is population size.
 Weighted arithmetic mean:
 Trimmed mean: chopping extreme values   x
N

w x i i
x i 1
n

w
i 1
i

6
Activity
 Calculate Mean,

 Data: 3, 1, 5

 Data: Class CGPA

 Suppose we have the following values for

salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56,
60, 63, 70, 70, 110.
Sample Grade Data
A A- B+ B B- C+ C C- F SA

> 85 80-84 75-79 70-74 63-69 60-62 55-59 50-54 < 50 …..

6 10 14 18 15 12 8 6 2 1

20 B
18
16
14 B+ B- C+
12 A-
10
8
A
6
4
2
0

8
Measuring the Central Tendency…

 Mode
 Value that occurs most frequently in the data
 1,2,3,3,3,4,4,5 (mode = 3)

 Unimodal, bimodal, trimodal

 Empirical formula:

Fm  Fm 1
mod e  L  ( ) width Mode
( Fm  Fm 1 )  ( Fm  Fm 1 ) interval

mean  mode  3  (mean  median)

9
Measuring the Central Tendency…
 Median:
 Middle value if odd number of values, or average of the
middle two values otherwise
 Estimated by interpolation (for grouped data):

n / 2  ( freq ) l
median  L1  ( ) width Median
freq median interval

10
Measuring the Central Tendency…
 Midrange
 Average of max and min values

 (Max + Min)/2
Activity
 Calculate Median, Mode, Midrange

 Data: 3, 1, 5

 Data: Class CGPA

 Suppose we have the following values for

salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56,
60, 63, 70, 70, 110.
Class Activity
 A student has gotten the following grades on his
tests: 87, 95, 76, and 88.
 He wants an 85 or better overall. What is the
minimum grade he must get on the last test in
order to achieve that average?

14
 A student has gotten the following grades on his
tests: 87, 95, 76, and 88.
 He wants an 85 or better overall. What is the
minimum grade he must get on the last test in
order to achieve that average?

The unknown score is "x". Then the desired average is:

(87 + 95 + 76 + 88 + x) ÷ 5 = 85
Multiplying through by 5 and simplifying, I get:
87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79
 He needs to get at least a 79 on the last test.

15
Symmetric vs. Skewed Data

 Median, mean and mode of symmetric

symmetric, positively and

negatively skewed data

positively skewed negatively skewed

November 13, 2023 Data Mining: Concepts and Techniques 17

Measuring the Dispersion of Data

 Quartiles, outliers and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1

 Five number summary: min, Q1, median, Q3, max

 Boxplot: ends of the box are the quartiles; median is marked; add whiskers,

and plot outliers individually

 Outlier: usually, a value higher/lower than 1.5 x IQR

 at least 1.5 x IQR above the third quartile or below the first quartile.

18
Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 The average of the squared differences from the Mean.

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 The Standard Deviation is a measure of how spread out numbers are.

1 n 1 n 2 1 n 2 1 n
1 n
s 
2

n  1 i 1
( xi  x ) 
2
[ xi  ( xi ) ]
n  1 i 1 n i 1
 
2

N

i 1
( xi  
2
) 
N
x
i 1
i
2
 2

19
Standard Deviation
 http://standard-deviation.appspot.com/

20
Python Code Examples
 Describing a numeric Series.  Describing a categorical Series.
 s = pd.Series([1, 2, 3])
 s.describe()  s = pd.Series(['a', 'a', 'b', 'c'])
 count 3.0  s.describe()
 mean 2.0  count 4
 std 1.0  unique 3
 min 1.0  top a
 25% 1.5  freq 2
 50% 2.0  dtype: object
 75% 2.5
 max 3.0

21
Standard Deviation
A C E
Test Score (X) X–Mean (d) d2
100 50
110 40
120 30
130 20
140 10
150 0
160 -10
170 -20
180 -30
190 -40
200 -50
SUM

22
Standard Deviation
A B C D E
Test Score Frequency X–Mean (d) fd fd2
(X) (f)
100 8 50 400 20,000
110 13 40 520 20,800
120 17 30 510 15,300
130 20 20 400 8,000
140 21 10 210 2,100
150 22 0 0 0
160 21 -10 -210 2,100
170 20 -20 -400 8,000
180 17 -30 -510 15,300
190 13 -40 -520 20,800
200 8 -50 -400 20,000
SUM 180 132,400

23
Example: Dispersion of Data
 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
 Q1
 Q2
 Q3
 IQR
 Five Number Summary
 Variance
 SD
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

25
Visualization of Data Dispersion: 3-D Boxplots

November 13, 2023 Data Mining: Concepts and Techniques 27

Properties of Normal Distribution Curve

 The normal (distribution) curve

 From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)

 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

28
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
29
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
 It shows what proportion of cases 30
fall into each of several categories
25
 Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent

30
Histograms Often Tell More than Boxplots

 The two histograms

shown in the left may
have the same boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have rather
different data
distributions

31
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f indicates
i i
that approximately 100 fi% of the data are below or
equal to the value xi

Data Mining: Concepts and Techniques 32

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

33
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

34
Positively and Negatively Correlated Data

 The left half fragment is positively

correlated
 The right half is negative correlated

35
Uncorrelated Data

Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
02 Data
No ratings yet
02 Data
36 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
02 Data
No ratings yet
02 Data
64 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
02 Data
No ratings yet
02 Data
62 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
65 pages
02 Data
No ratings yet
02 Data
65 pages
02 Data
No ratings yet
02 Data
66 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
12 pages
Lecture 2.2.1, 2.2.2 2.2.3
No ratings yet
Lecture 2.2.1, 2.2.2 2.2.3
19 pages
Lec 2
No ratings yet
Lec 2
26 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
CH - 4
No ratings yet
CH - 4
71 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
About Data
No ratings yet
About Data
25 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
CH 2
No ratings yet
CH 2
68 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
02 Data
No ratings yet
02 Data
41 pages
CHP 2
No ratings yet
CHP 2
52 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Week2 1
No ratings yet
Week2 1
24 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Color Image Processing Guide
No ratings yet
Color Image Processing Guide
40 pages
Image Processing Techniques
No ratings yet
Image Processing Techniques
65 pages
Ch10-Image Segmentation
No ratings yet
Ch10-Image Segmentation
22 pages
Ch05-Image Restoration
No ratings yet
Ch05-Image Restoration
49 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
02data Part4
No ratings yet
02data Part4
28 pages
02data Part1
No ratings yet
02data Part1
19 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Topic: Standardised To Normal Distributions: Biostatistics
No ratings yet
Topic: Standardised To Normal Distributions: Biostatistics
44 pages
Stat Module 5
No ratings yet
Stat Module 5
12 pages
Statistics for Engineering Students
No ratings yet
Statistics for Engineering Students
32 pages
Final Educ 107 Unit 4 Analysis and Interpretation of Assessment Results
No ratings yet
Final Educ 107 Unit 4 Analysis and Interpretation of Assessment Results
33 pages
BMSI Lectures Breakdown
No ratings yet
BMSI Lectures Breakdown
3 pages
Data Analysis Tasks for Analysts
No ratings yet
Data Analysis Tasks for Analysts
2 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
Stat Module 3
No ratings yet
Stat Module 3
15 pages
Atenta Session-6
No ratings yet
Atenta Session-6
14 pages
LRQA
No ratings yet
LRQA
3 pages
Data Modul 3
No ratings yet
Data Modul 3
11 pages
Elementary Statistics Exam Analysis
No ratings yet
Elementary Statistics Exam Analysis
6 pages
Correlation and Regression
No ratings yet
Correlation and Regression
35 pages
1ST Scores Frequency-Distribution-Of-Scores Grade9
No ratings yet
1ST Scores Frequency-Distribution-Of-Scores Grade9
16 pages
Code No.: BB-237: Online Annual Examination, 2022
No ratings yet
Code No.: BB-237: Online Annual Examination, 2022
5 pages
Modemedian
No ratings yet
Modemedian
3 pages
Gecs 1202 Statisitics
No ratings yet
Gecs 1202 Statisitics
3 pages
Chapter 3 Measures of Location
No ratings yet
Chapter 3 Measures of Location
2 pages
IPE 333 - Sheet-1
No ratings yet
IPE 333 - Sheet-1
11 pages
Course Outline of Introduction To Statistics
No ratings yet
Course Outline of Introduction To Statistics
5 pages
Quntative Data Analysis SPSS: Correlation & Regression
No ratings yet
Quntative Data Analysis SPSS: Correlation & Regression
65 pages
Assignment - III (Interval Estimation)
No ratings yet
Assignment - III (Interval Estimation)
5 pages
Q Q Q Q Q Q Q Q: The Quartiles For Ungrouped Data
No ratings yet
Q Q Q Q Q Q Q Q: The Quartiles For Ungrouped Data
2 pages
Math Data Analysis
No ratings yet
Math Data Analysis
10 pages
Variability
100% (1)
Variability
20 pages
Data Collection and Presentation Methods
No ratings yet
Data Collection and Presentation Methods
77 pages
Chap 005
No ratings yet
Chap 005
38 pages
Mathematics: Self-Learning Module 11
No ratings yet
Mathematics: Self-Learning Module 11
17 pages
Blend Astm Final Dosage Units Calculations Revised 04-22-18
No ratings yet
Blend Astm Final Dosage Units Calculations Revised 04-22-18
27 pages
Descriptive Statistics and Probability Distributions: Session 1
No ratings yet
Descriptive Statistics and Probability Distributions: Session 1
34 pages

02data Part2

Uploaded by

02data Part2

Uploaded by

Data Mining

Dr. Shahid Mahmood Awan

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Data dispersion characteristics

 median, max, min, quantiles, outliers, variance, etc.

 Dispersion analysis on computed measures

 Folding measures into numerical dimensions

 Data: Class CGPA

 Suppose we have the following values for

 Unimodal, bimodal, trimodal

mean  mode  3  (mean  median)

 Data: Class CGPA

 Suppose we have the following values for

The unknown score is "x". Then the desired average is:

 Median, mean and mode of symmetric

symmetric, positively and

positively skewed negatively skewed

November 13, 2023 Data Mining: Concepts and Techniques 17

 Quartiles, outliers and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1

 Five number summary: min, Q1, median, Q3, max

and plot outliers individually

 Outlier: usually, a value higher/lower than 1.5 x IQR

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 The Standard Deviation is a measure of how spread out numbers are.

November 13, 2023 Data Mining: Concepts and Techniques 27

 The normal (distribution) curve

measurements (μ: mean, σ: standard deviation)

 Boxplot: graphic display of five-number summary

 The two histograms

Data Mining: Concepts and Techniques 32

 The left half fragment is positively

You might also like