Quantitative Methods
SESSIONS 1-3
           CASE:
      Catalog Marketing
CONNECT!
Poll
http://timesofindia.indiatimes.com/home/polls
Social Media Analytics
https://socialblade.com/
 The world’s most valuable resource is no
longer oil, but data
http://www.economist.com/news/leaders/21721656-data-economy-demands-new-approach-
antitrust-rules-worlds-most-valuable-
resource?fsrc=scn/fb/te/bl/ed/regulatingtheinternetgiantsthedataeconomydemandsanewappro
achtoantitrust
http://www.economist.com/news/finance-and-economics/21698656-jacking-up-prices-may-
not-be-only-way-balance-supply-and-demand-taxis?fsrc=scn/fb/te/bl/ed/
    Some useful databases
• data.gov.in      – This is the portal of the Indian Government’s open data. You can
  check out a few visualizations for inspiration here.
• World Bank – The open data from the World bank. The platform provides several tools
  like Open Data Catalog, world development indices, education indices etc.
• RBI – Data available from the Reserve Bank of India.
•   Five Thirty Eight Datasets – Here is a link to datasets used in different stories. Each
    dataset includes the data, a dictionary explaining the data and the link to the story
    carried out by Five Thirty Eight.
COURSE EVALUATION
Specific Assessment Method   Weight
Quiz                         10%
Homework Assignment          10%
Mid-term Exam                25%
End-term Exam                35%
Group Project                20%
                     CASE: Catalog Marketing
Data Types
Scales
Statistic
Data Visualization
Histogram
Bar Chart
Pie Chart
Frequency table
Cross Tabs
SPSS Download :14 days trial
https://www-
01.ibm.com/marketing/iwm/iwmdocs/tnd/data/web/en_US/trialprograms/W110742E06714B2
9.html
Types of Variables
•   Elements are the entities on which data is collected.
•   A variable is a characteristic of interest for the elements.
•   Categorical (qualitative) variables have values that can only be placed into categories,
    such as “yes” and “no.” Qualitative data are labels or names used to identify each element
    in a variable.
            Count the number of elements or proportion of each category.
            Even if there is a numeric code, arithmetic operations such as addition,
            subtraction…don’t provide meaningful results
•   Numerical (quantitative) variables have values that represent quantities (how much or how
    many).
      Discrete variables arise from a counting process
      Continuous variables arise from a measuring process
Types of Variables
                                   Variables
            Categorical                          Numerical
        Examples:
            Marital Status
            Political Party             Discrete              Continuous
            Eye Color
            (Defined categories)     Examples:                   Examples:
                                         Number of Children          Weight
                                         Defects per hour            Voltage
                                         (Counted items)             (Measured characteristics)
Four Levels of
Measurement
Nominal level – data that is       Interval level – similar to the
 classified into categories and    ordinal level, with the additional
 cannot be arranged in any         property that meaningful amounts of
 particular order.                 differences between data values can
                                   be determined. There is no natural
                                   zero point.
Ordinal level – data arranged in   Ratio level – the interval level with
 some order, but the differences   an inherent zero starting point.
 between data values cannot be     Differences and ratios are meaningful
 determined or are meaningless.    for this level of measurement.
                                                                           1-11
Data Scales
Scale      Basic              Common                Marketing           Permissible Statistics
           Characteristics    Examples              Examples          Descriptive    Inferential
Nominal    Numbers identify   Social Security       Brand nos., store Percentages,    Chi-square,
           & classify objects nos., numbering       types             mode            binomial test
                              of football players
Ordinal    Nos. indicate the Quality rankings,      Preference       Percentile,       Rank-order
           relative positions rankings of teams     rankings, market median            correlation,
           of objects but not in a tournament       position, social                   Friedman ANOVA
           the magnitude of                         class
           differences
           between them
Interval   Differences        Temperature           Attitudes,        Range, mean,     Product-moment
           between objects (Fahrenheit)             opinions, index   standard         correlation, t
           can be compared, Celsius)                nos.              deviation        tests, regression
           zero point is
           arbitrary
Ratio      Zero point is      Length, weight        Age, sales,       Geometric mean, Coefficient of
           fixed, ratios of                         income, costs     harmonic mean variation
           scale values can
           be compared
Descriptive Statistics
   Statistic is a value that summarizes the data of a particular variable.
   The central tendency is the extent to which all the data values group around a typical or
    central value.
    The variation is the amount of dispersion or scattering of values
 The shape is the pattern of the distribution of values from the lowest value to the highest
value.
                                          Mean
     The arithmetic mean (often just called the “mean”) is the most common
     measure of central tendency
      ◦ For a sample (data) of size n:
                                                     The ith value
Pronounced x-bar
                              n
                            X        i
                                            X1  X2    Xn
                    X       i1
                                          
                                  n                 n
      Sample size                                      Observed values
   Mean
The population mean is the sum of the values in the population divided by
the population size, N
                     X        i
                                     X1  X2    XN
                       i1
                                   
                           N                N
        Where       μ = population mean
                    N = population size
                    Xi = ith value of the variable X
MEAN
       The most common measure of central tendency
       Mean = sum of values divided by the number of values
       Affected by extreme values (outliers)
   11 12 13 14 15 16 17 18 19 20        11 12 13 14 15 16 17 18 19 20
            Mean = 13                            Mean = 14
  11  12  13  14  15 65             11  12  13  14  20 70
                            13                                  14
             5            5                        5            5
Median
    In an ordered array, the median is the “middle” number (50% above, 50%
    below)
    11 12 13 14 15 16 17 18 19 20          11 12 13 14 15 16 17 18 19 20
            Median = 13                             Median = 13
    Not affected by extreme values
Median
The position of the median when the values are in numerical order (smallest to largest):
                                    n 1
           Median position              position in the ordered data
                                      2
If the number of values is odd, the median is the middle number
If the number of values is even, the median is the average of the two middle numbers
           n 1
             2
     Mode
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical (nominal) data
There may be no mode
There may be several modes
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14                      0 1 2 3 4 5 6
                                                            No Mode
                              Mode = 9
Measures of Central Tendency:
Review Example
    House Prices:        Mean:  ($3,000,000/5)
        $2,000,000                = $600,000
        $ 500,000
        $ 300,000
                         Median: middle value of ranked
        $ 100,000         data
        $ 100,000                  = $300,000
    Sum $ 3,000,000      Mode: most frequent value
                                   = $100,000
Measures of Central Tendency:
Which Measure to Choose?
 •   The mean is generally used, unless extreme values (outliers) exist.
 •   The median is often used, since the median is not sensitive to extreme values.
     For example, median home prices may be reported for a region; it is less
     sensitive to outliers.
 •   In some situations it makes sense to report both the mean and the median.
Measures of Central Tendency:
Summary
                      Central Tendency
      Arithmetic            Median         Mode
        Mean
          n
          X      i
     X   i1
              n        Middle value      Most
                       in the ordered    frequently
                       array             observed
                                         value
Measures of Variation
                              Variation
    Range          Variance          Standard              Coefficient
                                     Deviation             of Variation
  Measures of variation give
  information on the spread or
  variability or dispersion of
  the data values.
                                             Same center,
                                          different variation
Range
   • Simplest measure of variation
   • Difference between the largest and the smallest values:
                     Range = Xlargest – Xsmallest
         Example:
                    0 1 2 3 4 5 6 7 8 9 10 11 12     13 14
                           Range = 13 - 1 = 12
Why The Range Can Be Misleading
      •   Ignores the way in which data are distributed
          7    8    9    10       11     12          7   8       9   10   11      12
              Range = 12 - 7 = 5                             Range = 12 - 7 = 5
      •   Sensitive to outliers
              1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
                                         Range = 5 - 1 = 4
              1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
                                       Range = 120 - 1 = 119
              Variance
Average (approximately) of squared deviations of values from the mean
         ( xi   )
             N                 2                               n
                                                               (X  X)
  
                                                                             2
   2                                                                    i
    i 1      N                                     S 2      i1
                                                                      n -1
                                   X = Sample mean
             Where
                                   n = sample size; N = population size
                                   Xi = ith value of the variable X
                                      = Population Mean
Standard Deviation
     Most commonly used measure of variation
     Shows variation about the mean
     Is the square root of the variance
     Has the same units as the original data
      
             ( xi   )
                N               2
                                                     (X  X)
                                                           i
                                                                 2
                  N                            S   i1
        i 1                                              n -1
Measures of Variation:
Comparing Standard Deviations
     Smaller standard deviation
     Larger standard deviation
        Example
Sample
Data (Xi) :    10    12    14    15    17   18   18    24
                    n=8         Mean = X = 16
         (10  X)2  (12  X)2  (14  X)2    (24  X)2
  S
                               n 1
         (10  16)2  (12  16)2  (14  16)2    (24  16)2
    
                                 8 1
         130                     A measure of the “average”
                  4.3095
          7                      scatter around the mean
Measures of Variation:
Comparing Standard Deviations
Data A
                                                     Mean = 15.5
11   12   13   14   15   16   17   18   19   20 21    S = 3.338
Data B                                               Mean = 15.5
11   12   13   14   15   16   17   18   19   20
                                                      S = 0.926
21
     Data C                                          Mean = 15.5
                                                      S = 4.567
11   12   13   14   15   16   17   18   19   20 21
Quartiles
Quartile Measures
   Quartiles split the ranked data into 4 segments with an
   equal number of values per segment.
               25%        25%        25%        25%
                     Q1         Q2         Q3
       The first quartile, Q1, is the value for which 25% of the
        values are smaller and 75% are larger.
       Q2 is the same as the median (50% of the values are
        smaller and 50% are larger).
       Only 25% of the values are greater than the third quartile.
            Quartile Measures:
            Locating Quartiles
Find a quartile by determining the value in the
appropriate position in the ranked data, where:
 First quartile position:   Q1 = (n+1)/4   ranked value.
 Second quartile position: Q2 = (n+1)/2    ranked value.
 Third quartile position:   Q3 = 3(n+1)/4 ranked value.
                where n is the number of observed values.
Quartile Measures:
Calculation Rules
When calculating the ranked position use the following rules:
 ◦ If the result is a whole number then it is the ranked position to use.
 ◦ If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two corresponding data values.
 ◦ If the result is not a whole number or a fractional half then round the result to the nearest integer to
   find the ranked position.
Quartile Measures:
Locating Quartiles
  Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
       (n = 9)
       Q1 is in the (9+1)/4 = 2.5 position of the ranked data
       so use the value half way between the 2nd and 3rd values,
                                  so    Q1 = 12.5
                 Q1 and Q3 are measures of non-central location
                 Q2 = median, is a measure of central tendency
      Quartile Measures
      Calculating The Quartiles: Example
 Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
  (n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
                           so Q1 = (12+13)/2 = 12.5.
Q2 is in the (9+1)/2 = 5th position of the ranked data,
                           so Q2 = median = 16.
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
                          so Q3 = (18+21)/2 = 19.5.
         Q1 and Q3 are measures of non-central location.
         Q2 = median, is a measure of central tendency.
 Calculating The Interquartile Range
Example:
                          Median                  X
 X             Q1                      Q3             maximum
     minimum               (Q2)
         25%        25%          25%        25%
  12           30           45         57         70
                Interquartile range
                  = 57 – 30 = 27
Five Number Summary and
The Boxplot
   The Boxplot: A Graphical display of the data based on the five-number
   summary:
          Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
     Example:
                25% of data        25%          25%      25% of data
                                  of data      of data
    Xsmallest                   Q1          Median       Q3            Xlargest
  Distribution Shape and
  The Boxplot
Left-Skewed    Symmetric   Right-Skewed
  Q1   Q2 Q3    Q1 Q2 Q3    Q1 Q2 Q 3
Measures of Variation:
Summary Characteristics
    •   The more the data are spread out, the greater the range, variance, and
        standard deviation.
    •   The more the data are concentrated, the smaller the range, variance, and
        standard deviation.
    •   If the values are all the same (no variation), all these measures will be
        zero.
    •   None of these measures are ever negative.
Measures of Variation:
The Coefficient of Variation
      Measures relative variation
      Always in percentage (%)
      Shows variation relative to mean
      Can be used to compare the variability of two or more sets of data measured
      in different units
                           S
                     CV      100%
                              
                           X 
  Measures of Variation:
  Comparing Coefficients of Variation
Stock A:
 ◦ Average price last year = $50
 ◦ Standard deviation = $5
              S             $5
        CVA     100%       100%  10%
               X           $50                Both stocks
                                                have the same
                                                standard
Stock B:                                        deviation, but
 ◦ Average price last year = $100               stock B is less
                                                variable relative
 ◦ Standard deviation = $5                      to its price
              S             $5
        CVB     100%        100%  5%
               X           $100
  Measures of Variation:
  Comparing Coefficients of Variation
Stock A:
 ◦ Average price last year = $50
 ◦ Standard deviation = $5
              S          $5
               
        CVA     100%       100%  10%
              X          $50                  Stock C has a
                                                much smaller
Stock C:                                        standard
                                                deviation but a
 ◦ Average price last year = $8                 much higher
 ◦ Standard deviation = $2                      coefficient of
                                                variation
                 S            $2
          CVC        100%   100%  25%
                      
                 X            $8
Shape of a Distribution
Describes how data are distributed
Two useful shape related statistics are:
◦ Skewness
 ◦ Measures the extent to which data values are not symmetrical
◦ Kurtosis
 ◦ Kurtosis affects the peakedness of the curve of the distribution—that is, how
   sharply the curve rises approaching the center of the distribution
Shape of a Distribution (Skewness)
     Measures the extent to which data is not symmetrical
       Left-Skewed                 Symmetric                Right-Skewed
        Mean < Median              Mean = Median             Median < Mean
Skewness
Statistic     <0                          0                       >0
Coefficient of Skewness
Summary measure for skewness
                                      3  Md 
                           Sk 
                                               
Coefficient of Skewness (Sk) - compares the mean
and median in light of the magnitude to the standard deviation;
Md is the median; Sk is coefficient of skewness; σ is the Std Dev
If Sk < 0, the distribution is negatively skewed (skewed to the left).
If Sk = 0, the distribution is symmetric (not skewed). If Sk is close to 0, it’s almost
symmetric
If Sk > 0, the distribution is positively skewed (skewed to the right).
However, when our data is skewed, for example, as with
the right-skewed data set given here.
The mean is being dragged in the direct of the skew.
In these situations, the median is generally considered to
be the best representative of the central location of the
data.
The more skewed the distribution, the greater the
difference between the median and mean, and the greater
emphasis should be placed on using the median as
opposed to the mean.
 A classic example of the above right-skewed distribution
is income (salary), where higher-earners provide a false
representation of the typical income if expressed as a
mean and not a median.
Skewness – An Example
    Following are the earnings per share for a sample of 15
     software companies for the year 2010. The earnings per share
     are arranged from smallest to largest.
                                                                    4-48
Step 1 : Compute the Mean
             X
                 X
                            
                                $74.26
                                        $4.95
                    n             15
Step 2 : Compute the Standard Deviation
              s
                  X X          
                                  2
                                      
                                        ($0.09  $4.95) 2  ...  ($16.40  $4.95) 2 )
                                                                                        $5.22
                   n 1                                    15  1
Step 3 : Compute Median  $ 3.18
                                                                                             4-49
Sample Skewness
      Step 4 : Compute sk
        3( X  M )
      
             s
        3 * (4.95  3.18)
      
               5.22
      1.017241
However, when our data is skewed, for example, as with
the right-skewed data set given here.
The mean is being dragged in the direct of the skew.
In these situations, the median is generally considered to
be the best representative of the central location of the
data.
The more skewed the distribution, the greater the
difference between the median and mean, and the greater
emphasis should be placed on using the median as
opposed to the mean.
 A classic example of the above right-skewed distribution
is income (salary), where higher-earners provide a false
representation of the typical income if expressed as a
mean and not a median.
Kurtosis
 Kurtosis measures how peaked the histogram is
                                 (x  x)i
                                                  4
              kurtosis           i
                                              4
                                       ns
 The kurtosis of a normal distribution is 3
 Kurtosis characterizes the relative peakedness or flatness or
 tailedness of a distribution compared to the normal distribution
Shape of a Distribution -- Kurtosis
                                                                   Sharper Peak
                                                                 Than Bell-Shaped
                                                                   (Kurtosis > 0)
                                                                    Bell-Shaped
                                                                   (Kurtosis = 0)
                                                                    Flatter Than
                                                                    Bell-Shaped
     https://upload.wikimedia.org/wikipedia/commons/e/e6/Stand     (Kurtosis < 0)
     ard_symmetric_pdfs.png
Mean=0; Standard Deviation=1; Skewness=0
          Uniform(min=−√3,              Normal(μ=0, σ=1)         Logistic(α=0, β=0.55153)
               max=√3)                kurtosis = 3, excess = 0   kurtosis = 4.2, excess = 1.2
      kurtosis = 1.8, excess = −1.2
Marks   Number of students
10      5
15      10
20      16
25      20
30      14
35      8
40      4
WORKING WITH GROUPED DATA
Calculation of Grouped Mean
  Sometimes data are already grouped, and you are interested in
  calculating summary statistics
  Interval    Frequency (f) Midpoint (M) f*M
  20-under 30        6           25       150
  30-under 40       18           35       630
  40-under 50       11           45       495
  50-under 60       11           55       605
  60-under 70        3           65       195
  70-under 80        1           75        75
                    50                   2150
    f * M 2150
                  43.0
      f      50
 Median of Grouped Data - Example
   Select the class interval with cumulative frequency just greater than N/2
                          Cumulative
 Class Interval Frequency Frequency
 20-under 30          6        6
                                                                                N
 30-under 40         18       24                                                    cfp
 40-under 50         11       35                                       Md  L  2        W 
                                                                                  fmed
 50-under 60         11       46
                                                                                 50
 60-under 70          3       49                                                      24
 70-under 80          1       50                                           40  2        10
                                                                                    11
                  N = 50                                                   40.909
L: Lower limit of the median class
cfp : cumulative frequency of the previous class to the median class
fmed : frequency of median class
W: Length of the class interval
Mode of Grouped Data
Midpoint of the modal class
Modal class has the greatest frequency
   Class Interval      Frequency                30  40
   20-under 30               6           Mode           35
   30-under 40              18                     2
   40-under 50              11
   50-under 60              11
   60-under 70               3
   70-under 80               1
Variance and Standard Deviation
of Grouped Data
           f M                                    M  X 
                               2                                   2
                                                    f
       
    2
                                            
                                        2
                  N                 S                    n1
        
              2
                                    S 
                                                    2
                                                S
              M is the mid point.
Population Variance and Standard
Deviation of Grouped Data
                            f        M             M       M           M   
                                                                                         2
                                              fM
                                                                     2
                                                                         f
       Class Interval
       20-under 30           6       25      150       -18     324                1944
       30-under 40          18       35      630        -8      64                1152
       40-under 50          11       45      495         2       4                  44
       50-under 60          11       55      605        12     144                1584
       60-under 70           3       65      195        22     484                1452
       70-under 80           1       75       75        32    1024                1024
                            50              2150                                  7200
                                 2
              f    M  
                                                          2  144  12
                                          7200
       2                                     144
                        N                  50
     Chebyshev Rule
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the
values will fall within k standard deviations of the mean (for k > 1)
 ◦   Examples:
         (1 - 1/22) x 100% = 75% ….............. k=2 (μ ± 2σ)
         (1 - 1/32) x 100% = 88.89% ……….. k=3 (μ ± 3σ)
The Empirical Rule
The empirical rule approximates the variation of data in a bell-shaped
distribution
Approximately 68% of the data in a bell shaped distribution is within 1
standard deviation of the mean or
                  μ  1σ
                                  68%
                                    μ
                                μ  1σ
The Empirical Rule
Approximately 95% of the data in a bell-shaped distribution lies
within two standard deviations of the mean, or µ ± 2σ
Approximately 99.7% of the data in a bell-shaped distribution
lies within three standard deviations of the mean, or µ ± 3σ
          95%                                 99.7%
        μ  2σ                                μ  3σ
Using the Empirical Rule
Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard
 deviation of 90. Then,
  68% of all test takers scored between 410 and 590     (500 ± 90).
  95% of all test takers scored between 320 and 680     (500 ± 180).
  99.7% of all test takers scored between 230 and 770   (500 ± 270).
The Covariance
    The covariance measures the strength of the linear relationship
    between two numerical variables (X & Y).
    The sample covariance:
                                ( X  X)(Y  Y)
                                       i          i
              cov ( X, Y)     i1
                                           n 1
    Only concerned with the strength of the relationship.
    No causal effect is implied.
Interpreting Covariance
   Covariance between two variables:
   cov(X,Y) > 0        X and Y tend to move in the same direction.
   cov(X,Y) < 0        X and Y tend to move in opposite directions.
   cov(X,Y) = 0        X and Y are independent.
   The covariance has a major flaw:
    ◦ It is not possible to determine the relative strength of the relationship from the size of
      the covariance.
Coefficient of Correlation
       Measures the relative strength of the linear relationship between two numerical
       variables.
       Sample coefficient of correlation:
                                         cov (X, Y)
                                      r
                                           SX SY
      Where,
                  (X  X)(Y  Y)
                                                   n                       n
                       i          i                 (X  X)
                                                          i
                                                                2
                                                                            (Y  Y)
                                                                                  i
                                                                                         2
  cov (X, Y)    i1
                                            SX    i1
                                                                    SY    i1
                           n 1                          n 1                    n 1
    Scatter Plots of Sample Data with Various
    Coefficients of Correlation
        Y                 Y
                      X                   X
             r = -1             r = -.6
                                     Y
Y                 Y
             X                   X              X
    r = +1            r = +.3             r=0
       Calculate Mean, Median, mode, standard deviation, coefficient of
          variation for a sample of usage times of 50 ATM customers.
Time                                         Frequency
20-25                                        1
25-30                                        7
30-35                                        10
35-40                                        9
40-45                                        9
45-50                                        6
50-55                                        5
55-60                                        3