KEMBAR78
Data Exploration | PDF | Standard Deviation | Median
0% found this document useful (0 votes)
9 views61 pages

Data Exploration

Data mining involves exploring large datasets to identify valid, novel, useful, and understandable patterns. Data exploration is a preliminary analysis aimed at understanding data characteristics, guiding the selection of analysis tools, and leveraging human pattern recognition. Key concepts include types of data, measures of central tendency and variability, the importance of outliers, and the role of visualization in data analysis.

Uploaded by

dumi dlam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views61 pages

Data Exploration

Data mining involves exploring large datasets to identify valid, novel, useful, and understandable patterns. Data exploration is a preliminary analysis aimed at understanding data characteristics, guiding the selection of analysis tools, and leveraging human pattern recognition. Key concepts include types of data, measures of central tendency and variability, the importance of outliers, and the role of visualization in data analysis.

Uploaded by

dumi dlam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Data exploration

Definition
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.

Valid: The patterns hold in general.


Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend
the patterns.
What is data exploration?
A preliminary exploration of the data to better
understand its characteristics (related to the area
of Exploratory Data Analysis (EDA) and created by
statistician John Tukey)
Key motivations of data exploration include:
•Helping to select the right tool for preprocessing or
analysis;
•Making use of humans’ abilities to recognize
patterns;
•People can recognize patterns not captured by data
analysis tools
Exploratory Data Analysis

Get Data

Exploratory Data Analysis

Preprocessing

Data Mining
What is data?
Categorical (Qualitative)
• Nominal scales – number is just a symbol that
identifies a quality
• 0=male, 1=female
• 1=green, 2=blue, 3=red, 4=white
• Ordinal – rank order

Quantitative (continuous and discrete)


• Interval – units are of identical size (i.e. Years)
• Ratio – distance from an absolute zero (i.e.
Age, reaction time)
Exploratory Data Analysis

Exploratory Confirmatory
Frequency of Ad Recall

Value Label Value Frequency Percent Valid Cumulative


Percent Percent
Bar Chart
Pie Chart
What is a measurement?

Every measurement has 2 parts:


The True Score (the actual state of things in the
world)

and

ERROR! (mistakes, bad measurement, report bias,


etc.)
X=T+e
Frequency and Mode

• The frequency of an attribute value is the percentage of


time the value occurs in the
data set
• For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
attribute value
• The notions of frequency and mode are typically used
with categorical data
Organizing your data in a
spreadsheet
Stacked data: Subject condition score

1 before 3
Multiple cases (rows) 1 during 2

for each subject 1


2
after
before
5
3
2 during 8
2 after 4
3 before 3
3 during 7
3 after 1

Unstacked data: Subject before during after

1 3 2 5
Only one case (row) 2 3 8 4

per subject 3 3 7 1
Variable Summaries
• Indices of central tendency:
• Mean – the average value (sum of the data values divided by
the number of data items).
• Median – middle value of an odd number of data items
arranged in order. For an even number of data items, the
median is the average of the two middle values.
• Mode – value or values that occur most often. When all the
data values occur the same number of times, there is no
mode.
• The range of a set of data is the difference between the
greatest and least values. It is used to show the spread of the
data in a data set.
• Indices of Variability:
• Variance – the spread around the mean
• Standard deviation
• Standard error of the mean (estimate)
Measures of Location:
Mean and Median
• The mean is the most common measure of the location of a set
of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used.
Mean
The mean or average of the data values is
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
𝑚𝑒𝑎𝑛 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠

𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 σ 𝑥
𝑚𝑒𝑎𝑛 = =
𝑛 𝑛
Finding the Mean of Data
Find the mean of the data set given below:
4, 7, 8, 2, 1, 2, 4, 2

Mean:
4+7+8+2+1+2+4+2 = 30 Add the values.

8 items sum
Divide the sum by the
30  8 = 3.75 number of items.

The mean is 3.75.


Median

The median, m, is the middle value when the data


are ordered.
If there are an even number of values, the median
is the average of the two middle values.

The median splits the data in half.


Finding the Median of Data
(odd number of values)
Find the median of the data set given below:

3, 1, 10, 7, 13

The values are arranged in order as 1, 3, 7, 10, 13. Then the


median is 7.
Finding the Median of Data
(even number of values)
Find the median of the data set given below:
4, 7, 8, 2, 1, 2, 4, 2

Median:
1, 2, 2, 2, 4, 4, 7, 8 Arrange the values in order.

There are two middle values, so find the


2+4=6
mean of these two values.
62=3

The median is 3.
Mode of Data
Mode – the number that appears most
frequently in a set of numbers.

1, 1, 3, 7, 10, 13
Mode = 1
Finding the Mode of Data
Find the mode of the data set given below:
4, 7, 8, 2, 1, 2, 4, 2

Mode:

1, 2, 2, 2, 4, 4, 7, 8 The value 2 occurs three times.

The mode is 2.
Range of Data
Range – the difference between the
greatest and the least value in a set of
numbers.

1, 1, 3, 7, 10, 13
Range = 12
Copyright © 2000 by Monica Yuskaitis
Finding the Range of Data
Find the range of the data set given below:
4, 7, 8, 2, 1, 2, 4, 2

Range:
1, 2, 2, 2, 4, 4, 7, 8 Subtract the least value
from the greatest value.

8 – 1 =7

The range is 7.
Outlier
“An outlier is an observation which deviates so much from the
other observations as to arouse suspicions that it was generated
by a different mechanism.” (Hawkins, 1980).

Outlier analyses include investigating whether the data are valid


or invalid.

States define what value or combination of values are outside the


expected norm.

Valid outliers may appear to be outside the norm, but


investigation demonstrates that the data are not in error.

Valid outliers may occur due to random variation, which occurs


24 due to chance and is inherent in a system.
Outlier
In the data set below, the value 12 is much less than
the other values in the set. An extreme value such as
this is called an outlier.

35, 38, 27, 12, 30, 41, 31, 35

x
x x x x x x x

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
The Mean
Subject before during after
1 3 2 7
2 3 8 4
3 3 7 3
4 3 2 6
5 3 8 4
6 3 1 6
7 3 9 3
8 3 3 6
9 3 9 4
10 3 1 7

Sum = 30 50 50
/n 10 10 10
Mean = 3 5 5
The Variance: Sum of the squared
deviations divided by number of scores
(Before (After
Before During (During After -
– –
Subject before during after -mean - mean – mean)2 mean
mean)2 mean)2
1 3 2 7 0 0 -3 9 2 4
2 3 8 4 0 0 3 9 -1 1
3 3 7 3 0 0 2 4 -2 4
4 3 2 6 0 0 -3 9 1 1
5 3 8 4 0 0 3 9 -1 1
6 3 1 6 0 0 -4 16 1 1
7 3 9 3 0 0 4 16 -2 4
8 3 3 6 0 0 -2 4 1 1
9 3 9 4 0 0 4 16 -1 1
10 3 1 7 0 0 -4 16 2 4

Sum = 30 50 50 0 0 0 108 0 22
/n 10 10 10 10 10 10
Mean = 3 5 5 VAR = 0 10.8 2.2
Variance

 

  8.00
8.00 8.00

  

6.00   
6.00 6.00
during
before

after
4.00 4.00   
4.00

mean             

2.00 2.00   2.00

 

1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00

subject subject subject


Distribution
•Means and variances are ways to
describe a distribution of scores.
•Knowing about your distributions is one
of the best ways to understand your
data
•A NORMAL distribution is the most
common assumption of statistics, thus it
is often important to check if your data
are normally distributed.
Summary Statistics
•Summary statistics are numbers that summarize
properties of the data

•Summarized properties include frequency,


location and spread
• Examples: location - mean
spread - standard deviation

•Most summary statistics can be calculated in a


single pass through the data
Standard deviation
Variance, as calculated earlier, is arbitrary.
What does it mean to have a variance of 10.8?
Or 2.2? Or 1459.092? Or 0.000001?
Nothing. But if you could “standardize” that
value, you could talk about any variance (i.e.
deviation) in equivalent terms.
Standard Deviations are simply the square
root of the variance
Standard deviation
The process of standardizing deviations goes like this:
1.Score (in the units that are meaningful)
2.Mean
3.Each score’s deviation from the mean
4.Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n (if population) or n-1 (if sample)
7.Square root – now the value is in the units we
started with!!!
Interpreting standard deviation (SD)
First, the SD will let you know about the distribution
of scores around the mean.
High SDs (relative to the mean) indicate the scores
are spread out
Low SDs tell you that most scores are very near the
mean.

High SD Low SD
Interpreting standard deviation (SD)
Second, you can then interpret any individual score
in terms of the SD.
For example: mean = 50, SD = 10
versus mean = 50, SD = 1
A score of 55 is:
0.5 Standard deviation units from the mean (not
much) OR
5 standard deviation units from mean (a lot!)
Standardized scores (Z)
Third, you can use SDs to create standardized
scores – that is, force the scores onto a
normal distribution by putting each score
into units of SD.
Subtract the mean from each score and divide
by SD
Z = (X – mean)/SD
This is truly an amazing thing
Standardized normal distribution
ALL Z-scores have a mean of 0 and SD of 1. Nice and
simple.
From this we can get the proportion of scores
anywhere in the distribution.
The trouble with normal
We violate assumptions about statistical tests
if the distributions of our variables are not
approximately normal.
Thus, we must first examine each variable’s
distribution and make adjustments when
necessary so that assumptions are met.
Visualization
Visualization is the conversion of data into a
visual or tabular format so that the characteristics
of the data and the relationships among data
items or attributes can be analyzed or reported.
Visualization of data is one of the most
powerful and appealing techniques for data
exploration.
Humans have a well-developed ability to
analyze large amounts of information that is
presented visually.
Can detect general patterns and trends.
Can detect outliers and unusual patterns.
History
1137 - earliest known map (China)
1603 - first star charts by Johann Beyer
1637 - cartesian coordinate system
(Descartes)
History (2) - Statistical
1686 - first meteorological chart (Halley)

• 1693 - mortality
tables of city of
Breslau (Halley) ->
first attempt to
correlate two
variables
History (3) - 2D
Approx. 1750 - contour lines (height)
1817 - isotherms (temperature)
1829 - isochromatic lines (color)
1864 - isobars (pressure)
History (4) - 3D Imaging
1895 - X rays by W. Röntgen
1938 - x-ray sections or slices (3D!)
1912 - x-ray crystallography (Laue) - position of
atoms in a crystal
History (5) - Computer Graphics
1949 - SAGE air defense - tracked position of
aircraft by radar, analyzed results and display on
CRT
1965 - sketchpad (Sutherland) - interactive
graphical drawing system
Used to be BIG and EXPENSIVE
History (6) - Scientific Visualization
1987 - NSF report [McCormick87]

Personal/exploratory graphics - to enable a scientist to


gain more knowledge (interact with data)

Peer graphics - enable scientist to show information to


their colleagues and to collaborate

Presentation graphics - communicate information and


results (high quality, fully annotated)

Publication of visualization - enable others to use the


data (replicable)
History (7) - Augmented Reality
1983 - responsive environments (Myron Krueger)
1995(?) - Cave
Visualization
Acquisition

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination
Visualization

A graphic is a composition of data-representative marks


Arrangement
Is the placement of visual elements within a display.
Can make a large difference in how easy it is to
understand the data
Example:
Iris Sample Data Set
• Many of the exploratory data techniques are illustrated with the
Iris Plant data set.
• Can be obtained from the UCI Machine Learning Repository
(from the statistician Douglas Fisher)
• Three flower types (classes):
• Setosa
• Virginica
• Versicolour
• Four (non-class) attributes
• Sepal width and length
• Petal width and length Virginica. Robert H. Mohlenbrock.
USDA NRCS. 1995. Northeast wetland
flora: Field office guide to plant
species.
Visualization Techniques
• Star Plots
• Similar approach to parallel coordinates, but axes
radiate from a central point
• The line connecting the values of an object is a
polygon
• Chernoff Faces
• Approach created by Herman Chernoff
• This approach associates each attribute with a
characteristic of a face
• The values of each attribute determine the
appearance of the corresponding facial characteristic
• Each object becomes a separate face
• Relies on human’s ability to distinguish faces
Star Plots for Iris Data

Setosa

Versicolour

Virginica
Chernoff Faces for Iris Data
Setosa

Versicolour

Virginica
Example: Iris data
• We show how the attributes, petal length, petal width, and
species type can be converted to a multidimensional array
• First, we discretized the petal width and length to have
categorical values: low, medium, and high
• We get the following table - note the count attribute
Example: Sea Surface Temperature
The following shows the Sea Surface Temperature (SST)
for July 1982.
Tens of thousands of data points are summarized in a
single figure.
Scanning - Domains
Medical scanners (MRI, CT, SPECT, PET,
ultrasound)
Scanning - Applications
Primary education
Medical education for surgery, anesthesia
Illustration of medical procedures to the patient
Scanning - Applications
Surgical simulation for treatment planning
Tele-medicine
Inter-operative visualization in brain surgery, biopsies
Industrial purposes (quality control, security)
Games with realistic 3D effects?
Scientific Computation - Apps
Computational fluid dynamics (CFD)
Computational field simulations (CFS)
Vector Field Visualization Applications

Computational Fluid Dynamics Weather modeling


Measuring - Domains

• Orbiting satellites
• Spacecraft
• Seismic devices
• Statistical Data
Measuring - Applications

• for military intelligence


• weather and atmospheric studies
• planetary and interplanetary exploration
• oil, precious metal exploitation
• earthquake studies
• Statistical Analysis - Info Vis (Financial Data …)

You might also like