Exploring Data
Mrs. Watkins
      AP Statistics
  Unit 1: Chapters 1-4
   Statistics is the study of
DATA and        Data: systematically
                recorded information
VARIATION
                Variation: concept
                that data values will
                be different from
                subject to subject
    Who or what do we study?
• Population: entire collection of subjects
  about which information is desired
 Ex: teenagers who purchase shoes
         without laces
• Sample: a subset of population which is
  used to gather information
     Ex: a random sample of 500 teenagers who
          purchased shoes without laces in the
          last month
Sample is a subset of
the population
Each member of the
sample is called an
observational unit (or
case or subject or
experimental unit)
           Samples
• We sample because it is simpler to
  use a small group
• We sample because it is cheaper
  to use a small group
We must ensure our sample is
  RANDOM and
  REPRESENTATIVE
             Measures
Population Parameters: a measure of the
population, like the population MEAN
(average)
      Symbol:  X (mu)
Sample Statistic: a measure of the sample,
like the sample MEAN (average)
      Symbol: x ( x bar)
            Variables
• Variable: characteristic of observational
  unit
• Quantitative: uses numerical values that
  are quantities
• Categorical: uses labels or groups that
  are not quantities
 Quantitative Variables
Two Types :
 Discrete-- “countable”
   # of people, # of plants, # of cars
 Continuous-- “measurable”
   cost, pulse rate, temperature,
 weight
   Categorical Variables
Two Types :
 Binary– two labels
 Yes/No or other two-group measures
 Non-binary– More than two labels possible
     Most variables are non-binary
For some companies now, data on gender is non-
 binary.
Example:
A company which makes sneakers
wants to survey 500 teens about
preferences for no-lace sneakers. Identify
each:
     Population:
     Sample:
     Observational Unit:
     Possible Variables:
Example:
Classify each variable:
     a)Weight of package delivered
    b) Age of customer
    c) Highest Level of Education of patient
Example:
Identify the Observational Unit:
a)Survival rates are collected about the 10
most common cancers in the US
b)Color preference for cars purchased by
200 customers at local Toyota dealership
c)Customers at outlet store are asked their
zip code for marketing purposes
Categorical
Data Graphs
             Bar Chart
Best Used: To display counts or percentages
  for categorical data
**should have space between bars
Advantage: Easiest to make
Example: Elementary School
          Circle Graph
Best Used: To display counts or percentages for
categorical data
**should be labeled with percents
Advantage: visually appealing
 Example: Elementary School
Conewago
Londonderry
East Hanover
South Hanover
Nye
      Frequency Tables
Variable   Tally Freq.   Rel Freq
 9th III
 10th      IIIIIII
 11th      IIIIIIIII
 12th      IIIII
Purpose: To organize raw data
    Cumulative Relative
       Frequency
Variable Rel. Freq Cum. Rel.Freq
 9th
 10th
 11th
 12th
Should add up to 100% or 1.00 (or close)
Quantitative
Data Graphs
Graphs Quantitative Data
Dotplot
Stem/Leaf Plot
Histogram
              Dot Plot
Best used: small data sets with small range
Advantage: Show distribution of discrete
 values; shows gaps in data
        Stem/Leaf Plot
Best used: Small data set, two digits
Advantage: data values are preserved, quick
Back to Back Stem/Leaf Plot
  These graphs share a common stem
            Histogram
Best used: Large range, large amount of
 data
Advantage: Usually made by
 computer/calculator, can see shape easily
 Histograms—two types
FREQUENCY
   showing actual counts for each variable
   value on vertical axis
RELATIVE FREQUENCY
   showing proportion/percent for each
   variable value on vertical axis
   How to make histogram on
      TI---pages 42 and 71 in
                textbook
1. Stat—Edit—type in values
2. To sort—Stat—Edit—Sort A
3. To make histogram—StatPlot—turn on
  Select Histogram—Name X List as which
  list you want to display
  Freq: 1 (leave as 1)
  Press Zoom 9
  Press Trace to see values and adjust by
  using Window
            SOCS
When describing a distribution of data,
         put on your socs!
         DATA ANALYSIS
AP questions will ask you to “comment on
 the distribution”
S: SHAPE? Symmetric, skewed, bimodal
O: OUTLIERS? Any unusual values, gaps
C: CENTER ? Middle of the data
S: SPREAD? Range of data
ALWAYS DESCRIBE IN CONTEXT OF
 DATA
Skewed Right—most data are low
values, a few high
Examples: income, housing prices,
     number of speeding tickets per
driver in a year
Skewed Left: most data are high
values, a few low
Examples: scores on honors level
exam, blood pressure among
overweight patients, prices of auto
insurance for teens
Symmetric—data relatively equal on both
    sides of center
Examples: body temp, pulse, IQ
Bimodal—two peaks in distribution
Uniform: relatively equal
distribution
 Statistical Measures
and Data Distribution
        Mrs. Watkins
        AP Statistics
     Unit 1, Chapters 5,6
MEASURES OF CENTER
Mean: arithmetic average of all data values
    population mean: μx (read “mu”)
     sample mean: X (read x bar)
Median: the middle value in a data set
    also referred to as 2nd quartile Q2
     and 50th percentile, P50
Midrange: average of the extremes
        High + Low
             2
Mode: the most common value in a data set
 —best for categorical data
        RESISTANCE
Resistant Measures: measures that are
 NOT affected by extreme data values
Non-resistant Measures: measures
 that ARE affected by extreme data
 values
Mean, Midrange: NON-resistant
Median: resistant
                SHAPE
If the mean > median, then data distribution
       is skewed RIGHT. The mean is in the tail.
If the mean < median, then data distribution
       is skewed LEFT. The mean is in the tail.
If the mean ≈ median, then data distribution
       is approximately SYMMETRIC.
        MEASURES OF
          SPREAD
Range: Maximum – minimum
 This is a single value measure
   Resistant? NO
IQR (Interquartile Range): Q3 - Q1
  This is a single value measure
     Resistant? YES
    5 Number Summary
5 important numbers in data set:
     Min: lowest value
     Q1: first quartile (25th percentile)
     Med: middle (50th percentile)
     Q3: third quartile (75th percentile)
     Max: highest value
Q1, Med, Q3, may not be actual data values
                BOXPLOT
graphical display of data using 5 number summary
 (if outliers shown, called “modified box plot”)
5 # Summary Law: {60, 68, 74, 85, 94)
5# Summary Business: {65, 76, 86, 95, 100}
            OUTLIERS
Outliers: unusually large or small data
 values
     Can see on modified box plot
  IQR Test for Outliers
   Calculate (IQR )
      Multiply IQR x (1.5) = constant K
           Q1 - K = outlier lower fence
           Q3 + K = outlier upper fence
If any data values exceed these
   fences(bounds), they are outliers
Example: IQR Test
A college student looks for a used textbook
on-line and finds the following costs:
      83 94 85 88 78 28 80
Are there outliers in this data set?
STANDARD DEVIATION
a measure of the average amount of
  deviation from the mean among the data
  values
 Population St. Deviation: σx (read “sigma”of x)
 Sample St. Deviation: sx (read s of x)
We use sx because we usually do not have
 entire population. NOT RESISTANT
            VARIANCE
*the square of the standard deviation
 *what you get before taking square root
     NOT RESISTANT
     Population Variance: σ 2
     Sample Variance: s2
This measure not used much in elementary
  statistics but you need to know what it is.
 Formulas for Standard
      Deviation
           ( x  x ) 2
                                  ( x   ) 2
   sx                    X 
             n 1                     n
Variance is the number you get before the
square root is taken
        “Comment on the
          distribution”
You now have numbers to support your
    statements, rather than just graphs.
SHAPE: how is the data distributed?
OUTLIERS: do you have any outliers?
CENTER: where is the middle?
SPREAD: how widely does the data vary?
 Unusual Features: gaps, clusters
ADJUSTMENTS TO DATA SET
What would happen to the statistical
 measures if one very low or very high data
 value was added to the set?
 Mean:
 Standard Deviation:
 Median:
 IQR:
    TRANSFORMATIONS TO
           DATA
What would happen to the statistical
 measures if each data value had a
 constant added to or subtracted from it?
 Mean:
 Standard Deviation:
 Median:
 IQR:
    TRANSFORMATIONS TO
           DATA
What would happen to the statistical
 measures if each data value had a
 constant multiplied or divided by it?
 Mean:
 Standard Deviation:
 Median:
 IQR:
        MEASURES OF
          POSITION
These give a numerical approximation of
 where a single data value stands
 compared to the whole distribution
Quartiles: mark 25th, 50th, 75th percentiles
Percentiles: mark what percent of data
    are equal to or below a certain value
           Z SCORE
Standardized Score: how a single
 value compares to entire data set
 in terms of position in distribution
z = individual value – mean
          st. deviation
      x  X          xx
   z              z
        X             sx
      NORMAL MODEL
shows how continuous data is distributed
  symmetrically along an interval according
  to empirical rule
Empirical Rule:
  68% of data within 1 st. deviation of μ
  95 % of data within 2 st. deviations of μ
  99.7% of data within 3 st. deviations of μ
       OUTLIER TEST
Using Empirical Rule:
 Data values of z > +2 st. deviations away
         from mean are mild outliers
 Data values of z > +3 st. deviations away
         from mean are extreme outliers
               NORMAL CURVE
a theoretical ideal about how
   traits/characteristics are distributed
Many human traits are approximately normally
 distributed such as height, body temp, IQ,
 pulse
Avoid using “normal” when describing data—say
  “approximately normal or symmetric” unless
  clearly mound-shaped, bell-shaped
       NORMAL CURVE
Normal curve—symmetric, mound-shaped
Area under curve = 1 for whole curve
A z score can be used to establish what % of
  the curve is less or more than the z score,
  and establish probability of a data value
  being in that position.
   Normal Curve Example #1
Studies on car safety report that stopping
distances follow a normal model. Suppose
that one model of car traveling at 62 mph
has a mean stopping distance of 155 feet
with st. dev. = 5 feet.
Draw the model:
   Normal Curve Example #1
a. What proportion of cars will stop in less
   than 145 feet?
b. What proportion of cars will need more
   than 160 feet to stop?
c. What proportion of cars will stop between
   145 and 165 feet?
    PERCENTILES USING
      NORMAL CURVE
1. Find a z score(s)
2. Use calculator: normalcdf under DISTR
Looking for area > z score: normalcdf (z, ∞)
Looking for area < z score: normalcdf (∞, z)
Looking for area between z scores:
                     normalcdf (z1, z2)
   Normal Curve Example #2
• Data from health studies show that the
  distribution of human pregnancies is
  approximately normal with a mean of 270
  and st. dev = 15 days.
• Draw the model:
   Normal Curve Example #2
a. What proportion of pregnancies will last
   more than 280 days?
b. What proportion of pregnancies will last
   less than 236 days?
c. What proportion of pregnancies will last
   between 290 and 310 days?
  FINDING CUT OFF SCORES
If you are given a percentile or probability,
   and need to determine the “cut off score”
1. Sketch curve to determine where z score is
   located.
2. Determine if you want area above or below this
   percentile
3. Use INVNORM on calculator
       invnorm(percentile)= z score
4. Use z score formula to solve for x.
   Inverse Norm Example #1
• Data from health studies show that the
  distribution of human pregnancies is
  approximately normal with a mean of 270
  and st. dev = 15 days.
• Find the 90th percentile for human
  pregnancies:
   Inverse Norm Example #2
• Data from a standardized test of 3rd grade
  reading ability show that the distribution of
  reading ability is approximately normal
  with an approx. mean of 85 and st.dev = 6
  days.
• Find the score interval for the middle 70%
  of 3rd grade reading abilities:
   Does the data fit a normal
           model?
1. Check mean and median—how close are
   they, in context of data?
2. Make a NORMAL PROBABILITY PLOT
   on calculator. It should be approx. linear.
3. Make a BOXPLOT on calculator. It
   should be approx. symmetric.
AVOID histograms on calculator to check.
   Mean versus Median Check
Example: the mean spending on a laptop
among college students is $825 and median
spending is $749.
This means that the distribution is likely not
normal as these values are not close
enough to assert symmetry of the
distribution
Normal Probability Plot Check
Boxplot Check
Histogram Check