Business Analytics Using Data Mining: Term 6
Business Analytics Using Data Mining: Term 6
]is
an
                                                                                                                                 ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                              up
                                                                                                                              22
                                                                                                                         aG
                                                                                                                           20
                                                                                                                           du
                                                   ru
pta
b.e
                                                                                                                     sik
                                                                                                                  Gu
t]is
an
                                                                                                                  an
                                                                                                               2[a
                                                                                                               a_
/H
                                                                                                              /H
                                                          sik
02
                                                                                                          du
                                                        an
ta2
.ed
                                                                                                       b.e
   Business Analytics using Data Mining
up
                                                                                                   ]isb
                                                ta/
                                                                                                  t]is
                                                              _G
                                                p
[at
                                                                                              2[a
                                             Gu
ika
22
                                                                                            02
                                            a
ns
                                                                                           20
                                        sik
                                                                                         ta2
                         Professor Vandith Pamuru
                                                   /Ha
                                                                                        pta
                                       an
                                                                                      up
                                                                                     Gu
                                                 ta
                                    /H
                                                                                   _G
                                              up
                                                                                   _
                                                                                ika
                                                                                ika
                                           aG
                                  du
                             b.e
ns
                                                                             ns
                                            Term 6
                                        sik
Ha
                                                                         Ha
                           t]is
an
                                                                      ta/
                                                              ta
                        2[a
/H
up
                                                                    up
                      02
aG
                                                                  aG
                                  du
                   ta2
sik
                                                              sik
                 up
t]is
an
                                                           an
               G
                       2 [a
            a_
/H
                                                              /H
          sik
02
                                                          du
          n
ta2
.ed
b.e
                                 2021-22
                 up
]isb
                                                   t]is
              _G
[at
                                                2[a
           ika
22
                                              02
        ns
20
       DISCLAIMER: The academic course pack contains copyrighted materials which are only
                                          ta2
       Ha
ta
       meant to be downloaded by the authorized users for their course work. Please note that the
                            up
up
       access is made available only to the duration of the course. Sharing of access with any
       one (copying, forwarding, or other means) is a violation of copyright law and is strictly
                         _G
_G
       prohibited.
                      ika
                                  ika
                    ns
                                  ns
                 Ha
Ha
                                                    1
                                                                                                                                              a
                                                                                                                                            ]is
an
                                                                                                                                         ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                      up
                                                                                                                                      22
                                                                                                                                 aG
                                                                                                                                   20
                                                                                                                                   du
                                                                                                                               pta
b.e
                                                                                                                             sik
                     Business Analytics using Data Mining
Gu
t]is
an
                                                                                                                          an
                                                                                                                       2[a
                                                                                                                       a_
/H
                                                                                                                      /H
                                           Table of Contents
sik
02
                                                                                                                  du
                                                                an
ta2
.ed
                                                                                                               b.e
                                                              H
up
                                                                                                           ]isb
                                                           ta/
      S.No                                         Topic                                                        Page No.
                                                                                                          t]is
                                                                      _G
                                                       p
[at
                                                                                                      2[a
      1
                                                    Gu
               A Predictive Analytics Primer                                                                        03
ika
22
                                                                                                    02
                                                    a
ns
                                                                                                   20
                                                sik
                                                                                                 ta2
      2        Where predictive analytics is having the biggest impact                                              07
/Ha
                                                                                                pta
                                            an
                                                                                              up
                                                                                             Gu
                                                         ta
                                         /H
                                                                                           _G
                                                      up
      3        Screening for Chronic Kidney Disease                                                                 13
                                                                                           _
                                                                                        ika
                                                                                        ika
                                                   aG
                                     du
ns
                                                                                     ns
                                                sik
4 Ha Link
                                                                                 Ha
                              t]is
an
               https://www.predictiveanalyticsworld.com/machinelearningtimes/12-
                                                                                /
                                                                              ta/
                                                                      ta
                           2[a
/H
               predictive-analytics-screw-ups/2049/
                                                                   up
                                                                            up
      5        Cluster Analysis for Segmentation                                                                    20
                         02
aG
                                                                          aG
                                     du
                      ta2
b.e
sik
sik
      6                                                                                                            Link
                              t]is
an
                                                                   an
                  G
               https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
                           2 [a
               a_
/H
                                                                      /H
             sik
02
      * Reference book at LRC for the book Data Mining for the book- Business Analytics: Concepts, Techniques, and Application
                                                      u
                                                                  du
             n
ta2
.ed
                                                               b.e
                   up
]isb
                                                           t]is
                 _G
[at
                                                        2[a
             ika
22
                                                      02
           ns
20
                                                 ta2
          Ha
                                   ta
                                up
                                            up
                             _G
                                          _G
                          ika
                                       ika
                       ns
                                     ns
                     Ha
Ha
                                                            2
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                                                                                                                                                                           pta
b.e
                                                                                                                                                                                         sik
                                                                                                                                                                                      Gu
t]is
an
                                                                                                                                                                                      an
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                                                                                                         sik
                                                                                                                                                                                02
                                             ANALYTICS
                                                                                                                                                                              du
                                             A Predictive Analytics
an
ta2
.ed
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                                                                                             ta/
                                                                                                                                                                      t]is
                                             Primer
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                                                                                     Gu
ika
22
                                                                                                                                                                02
                                                                                                a
ns
                                                                                                                                                               20
                                             by Thomas H. Davenport                         sik
                                                                                                                                                             ta2
                                             SEPTEMBER 02, 2014
/Ha
                                                                                                                                                            pta
                                                                                       an
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                             No one has the ability to capture and analyze data from the future. However, there is a way to predict
                                                                                                        up
                                                                                                                                                       _
                                             the future using data from the past. It’s called predictive analytics, and organizations do it every day.
ika
                                                                                                                                                    ika
                                                                                               aG
                                                                          du
                                                                    b.e
ns
                                                                                                                                                 ns
                                             Has your company, for example, developed a customer lifetime value (CLTV) measure? That’s using
                                                                                            sik
Ha
                                                                                                                                             Ha
                                             predictive analytics to determine how much a customer will buy from the company over time. Do
                                                               t]is
an
                                             you have a “next best offer” or product recommendation capability? That’s an analytical prediction of
                                                                                                                                            /
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
                                             the product or service that your customer is most likely to buy next. Have you made a forecast of
                                                                                                                            up
                                                                                                                                        up
                                             next quarter’s sales? Used digital marketing models to determine what ad to place on what
                                                 02
aG
                                                                                                                                      aG
                                                                          du
b.e
sik
                                                                                                                                  sik
                                         up
                                             Predictive analytics are gaining in popularity, but what do you—a manager, not an analyst—really
                                                               t]is
an
an
                                             need to know in order to interpret results and make better decisions? How do your data scientists do
                                    G
                                                         2 [a
                                 a_
                                             what they do? By understanding a few basics, you will feel more comfortable working with and
                                                                                                          /H
/H
                                             communicating with others in your organization about the results and recommendations from
                           sik
02
                                             predictive analytics. The quantitative analysis isn’t magic—but it is normally done with a lot of past
                                                                                                     u
                                                                                                                            du
                      n
ta2
.ed
                                             data, a little statistical wizardry, and some important assumptions. Let’s talk about each of these.
                   Ha
                                                                                                                    b.e
                                         up
]isb
t]is
                                             The Data: Lack of good data is the most common barrier to organizations seeking to employ
                                   _G
[at
                                             predictive analytics. To make predictions about what customers will buy in the future, for example,
                                                                                                          2[a
                             ika
                                             you need to have good data on who they are buying (which may require a loyalty program, or at least
                                                                               22
02
                                             a lot of analysis of their credit cards), what they have bought in the past, the attributes of those
                         ns
20
ta2
                                             products (attribute-based predictions are often more accurate than the “people who buy this also buy
                   Ha
ta
                                             this” type of model), and perhaps some demographic attributes of the customer (age, gender,
                                                                 up
up
                                             residential location, socioeconomic status, etc.). If you have multiple channels or customer
                                                            _G
                                                                                    _G
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    2
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   3
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             touchpoints, you need to make sure that they capture data on customer purchases in the same way
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                             your previous channels did.
pta
b.e
                                                                                                                                                                                         sik
                                             All in all, it’s a fairly tough job to create a single customer data warehouse with unique customer IDs
Gu
t]is
an
                                                                                                                                                                                      an
                                             on everyone, and all past purchases customers have made through all channels. If you’ve already
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             done that, you’ve got an incredible asset for predictive customer analytics.
sik
02
                                                                                                                                                                              du
                                             The Statistics: Regression analysis in its various forms is the primary tool that organizations use for
an
ta2
                                                                                                                                                                            .ed
                                             predictive analytics. It works like this in general: An analyst hypothesizes that a set of independent
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                             variables (say, gender, income, visits to a website) are statistically correlated with the purchase of a
ta/
                                                                                                                                                                      t]is
                                             product for a sample of customers. The analyst performs a regression analysis to see just how
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                             correlated each variable is; this usually requires some iteration to find the right combination of
Gu
ika
                                                                                                                                                                 22
                                             variables and the best model. Let’s say that the analyst succeeds and finds that each variable in the
                                                                                                                                                                02
                                                                                                a
                                             model is important in explaining the product purchase, and together the variables explain a lot of
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                             variation in the product’s sales. Using that regression equation, the analyst can then use the
/Ha
                                                                                                                                                            pta
                                                                                       an
regression coefficients—the degree to which each variable affects the purchase behavior—to create a
                                                                                                                                                          up
                                                                                                                                                         Gu
                                             score predicting the likelihood of the purchase.
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                             Voila! You have created a predictive model for other customers who weren’t in the sample. All you
                                                                          du
                                             have to do is compute their score, and offer the product to them if their score exceeds a certain level.
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
                                             It’s quite likely that the high scoring customers will want to buy the product—assuming the analyst
                                                                                                                                              Ha
                                                                                                                                             Ha
                                                               t]is
                                             did the statistical work well and that the data were of good quality.
                                                                                       an
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                             The Assumptions: That brings us to the other key factor in any predictive model—the assumptions
                                                 02
                                             that underlie it. Every model has them, and it’s important to know what they are and monitor
                                                                                                                          aG
                                                                                                                                      aG
                                                                          du
                                              ta2
                                             whether they are still true. The big assumption in predictive analytics is that the future will continue
                                                                    b.e
sik
sik
                                             to be like the past. As Charles Duhigg describes in his book The Power of Habit, people establish
                                         up
t]is
                                             strong patterns of behavior that they usually keep up over time. Sometimes, however, they change
                                                                                                            an
                                                                                                                               an
                                    G
                                             those behaviors, and the models that were used to predict them may no longer be valid.
                                                         2 [a
                                 a_
/H
                                                                                                                                   /H
                           sik
02
                                             What makes assumptions invalid? The most common reason is time. If your model was created
                                                                                                     u
                                                                                                                            du
                      n
                                             several years ago, it may no longer accurately predict current behavior. The greater the elapsed time,
                                              ta2
                                                                                               .ed
                   Ha
b.e
                                             the more likely customer behavior has changed. Some Netflix predictive models, for example, that
                                         up
]isb
                                             were created on early Internet users had to be retired because later Internet users were substantially
                                                                                                                t]is
                                   _G
                                             different. The pioneers were more technically-focused and relatively young; later users were
                                                                                    [at
2[a
                                             essentially everyone.
                             ika
22
                                                                                                     02
                         ns
20
                                             Another reason a predictive model’s assumptions may no longer be valid is if the analyst didn’t
                                                                                               ta2
                   Ha
ta
                                             include a key variable in the model, and that variable has changed substantially over time. The great
                                                                 up
up
                                             —and scary—example here is the financial crisis of 2008-9, caused largely by invalid models
                                                            _G
_G
                                             predicting how likely mortgage customers were to repay their loans. The models didn’t include the
                                             possibility that housing prices might stop rising, and even that they might fall. When they did start
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    3
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   4
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             falling, it turned out that the models became poor predictors of mortgage repayment. In essence, the
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                             fact that housing prices would always rise was a hidden assumption in the models.
pta
b.e
                                                                                                                                                                                         sik
                                             Since faulty or obsolete assumptions can clearly bring down whole banks and even (nearly!) whole
Gu
t]is
an
                                                                                                                                                                                      an
                                             economies, it’s pretty important that they be carefully examined. Managers should always ask
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             analysts what the key assumptions are, and what would have to happen for them to no longer be
                                             valid. And both managers and analysts should continually monitor the world to see if key factors
sik
02
                                                                                                                                                                              du
                                             involved in assumptions might have changed over time.
an
ta2
.ed
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                             With these fundamentals in mind, here are a few good questions to ask your analysts:
ta/
                                                                                                                                                                      t]is
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                             • Can you tell me something about the source of data you used in your analysis?
Gu
ika
                                                                                                                                                                 22
                                             • Are you sure the sample data are representative of the population?
                                                                                                                                                                02
                                                                                                a
                                             • Are there any outliers in your data distribution? How did they affect the results?
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                             • What assumptions are behind your analysis?
/Ha
                                                                                                                                                            pta
                                                                                       an
• Are there any conditions that would make your assumptions invalid?
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                             Even with those cautions, it’s still pretty amazing that we can use analytics to predict the future. All
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                             we have to do is gather the right data, do the right type of statistical model, and be careful of our
                                                                          du
                                             assumptions. Analytical predictions may be harder to generate than those by the late-night television
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
soothsayer Carnac the Magnificent, but they are usually considerably more accurate. Ha
                                                                                                                                             Ha
                                                               t]is
an
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
                                             Thomas H. Davenport is the president’s distinguished professor in management and information technology at Babson
                                                                                                                            up
                                                                                                                                        up
                                             College, and cofounder of the International Institute for Analytics. He also contributes to the MIT Initiative on the Digital
                                                 02
                                             Economy as a fellow, and as a senior advisor to Deloitte Analytics. Author of over a dozen management books, his latest
                                                                                                                          aG
                                                                                                                                      aG
                                                                          du
                                             is Only Humans Need Apply: Winners and Losers in the Age of Smart Machines.
                                              ta2
b.e
sik
                                                                                                                                  sik
                                         up
t]is
an
                                                                                                                               an
                                    G
                                                         2 [a
                                 a_
/H
                                                                                                                                   /H
                           sik
02
                                                                                                                            du
                      n
ta2
                                                                                               .ed
                   Ha
                                                                                                                    b.e
                                         up
]isb
                                                                                                                t]is
                                   _G
[at
                                                                                                          2[a
                             ika
22
                                                                                                     02
                         ns
20
                                                                                               ta2
                   Ha
                                                                    ta
                                                                 up
                                                                                      up
                                                            _G
                                                                                    _G
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    4
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   5
                                                                                                                                                                                                            a
                                                                                                                                                                                                          ]is
an
                                                                                                                                                                                                       ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                    up
                                                                                                                                                                                                    22
                                                                                                                                                                                               aG
                                                                                                                                                                                                 20
                                                                                                                                                                                                 du
               Copyright 2014 Harvard Business Publishing. All Rights Reserved. Additional restrictions
pta
b.e
                                                                                                                                                                                           sik
               may apply including the use of this content as assigned course material. Please consult your
Gu
                                                                                                                                                                                        t]is
               institution's librarian about any restrictions that might apply under the license with your
an
                                                                                                                                                                                        an
               institution. For more information and teaching resources from Harvard Business Publishing
                                                                                                                                                                                     2[a
                                                                                                                                                                                     a_
/H
                                                                                                                                                                                    /H
               including Harvard Business School Cases, eLearning products, and business simulations
               please visit hbsp.harvard.edu.
sik
02
                                                                                                                                                                                du
                                                                                                                       an
ta2
.ed
                                                                                                                                                                             b.e
                                                                                                                  H
up
                                                                                                                                                                         ]isb
                                                                                                               ta/
                                                                                                                                                                        t]is
                                                                                                                                   _G
                                                                                                         p
[at
                                                                                                                                                                    2[a
                                                                                                      Gu
ika
22
                                                                                                                                                                  02
                                                                                                 a
ns
                                                                                                                                                                 20
                                                                                             sik
                                                                                                                                                               ta2
                                                                                                                  /Ha
                                                                                                                                                              pta
                                                                                         an
                                                                                                                                                            up
                                                                                                                                                           Gu
                                                                                                             ta
                                                                                   /H
                                                                                                                                                         _G
                                                                                                          up
                                                                                                                                                         _
                                                                                                                                                      ika
                                                                                                                                                      ika
                                                                                                aG
                                                                            du
                                                                      b.e
ns
                                                                                                                                                   ns
                                                                                             sik
Ha
                                                                                                                                               Ha
                                                                t]is
an
                                                                                                                                            ta/
                                                                                                                                ta
                                                    2[a
/H
up
                                                                                                                                          up
                                                  02
aG
                                                                                                                                        aG
                                                                            du
                                               ta2
b.e
sik
                                                                                                                                    sik
                                          up
t]is
an
                                                                                                                                 an
                                     G
                                                          2 [a
                                  a_
/H
                                                                                                                                    /H
                             sik
02
                                                                                                                             du
                        n
ta2
                                                                                                 .ed
                     Ha
                                                                                                                      b.e
                                          up
]isb
                                                                                                                  t]is
                                     _G
[at
                                                                                                            2[a
                               ika
22
                                                                                                       02
                          ns
20
                                                                                                 ta2
                     Ha
                                                                      ta
                                                                   up
                                                                                        up
                                                             _G
                                                                                      _G
                                                        ika
                                                                                ika
                                                   ns
                                                                            ns
                                              Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                 6
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                                                                                                                                                                           pta
b.e
                                                                                                                                                                                         sik
                                                                                                                                                                                      Gu
t]is
an
                                                                                                                                                                                      an
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                                                                                                         sik
                                                                                                                                                                                02
                                             ANALYTICS
                                                                                                                                                                              du
                                             Where Predictive
an
ta2
.ed
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                                                                                             ta/
                                                                                                                                                                      t]is
                                             Analytics Is Having the
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                                                                                     Gu
ika
                                                                                                                                                                 22
                                             Biggest Impact
                                                                                                                                                                02
                                                                                                a
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                                                                                                 /Ha
                                                                                                                                                            pta
                                                                                       an
                                                                                                                                                          up
                                             by Jacob LaRiviere, Preston McAfee, Justin Rao, Vijay K. Narayanan and Walter Sun
                                                                                                                                                         Gu
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                             MAY 25, 2016
                                                                                                        up
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                                                          du
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
Ha
                                                                                                                                             Ha
                                                               t]is
an
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                                 02
aG
                                                                                                                                      aG
                                                                          du
                                              ta2
b.e
sik
                                                                                                                                  sik
                                         up
t]is
an
                                                                                                                               an
                                    G
                                                         2 [a
                                 a_
/H
                                                                                                                                   /H
                           sik
02
                                                                                                                            du
                      n
ta2
                                                                                               .ed
                   Ha
                                                                                                                    b.e
                                         up
]isb
                                                                                                                t]is
                                   _G
                                                                                                          2[a
                             ika
22
                                             The big data revolution is upon us. Firms are scrambling to hire a new brand of analysts dubbed “data
                                                                                                     02
                         ns
                                             scientists,” and universities have responded to this demand by introducing data science courses into
                                                                          20
ta2
                                             degrees ranging from computer science to business. Survey-based reports find that firms are
                   Ha
ta
                                             currently spending an estimated $36 billion on storage and infrastructure, and that is expected to
                                                                 up
up
                                             double by 2020.
                                                            _G
                                                                                    _G
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    2
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   7
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             Once companies are logging and storing detailed data on all their customer engagements and internal
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                             processes, what’s next? Presumably, firms are investing in big data infrastructure because they
pta
                                                                                                                                                                                           b.e
                                             believe that it offers a positive return on investment. However, looking at the surveys and consulting
                                                                                                                                                                                         sik
                                             reports, it is unclear what the precise use cases are that will drive this positive ROI from big data.
Gu
t]is
an
                                                                                                                                                                                      an
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             Our goal in this article is to offer specific, real-world case studies to show how big data has provided
                                             value for companies that have worked with Microsoft’s analytics teams. These cases reveal the
sik
02
                                                                                                                                                                              du
                                             circumstances in which big data predictive analytics are likely to enable novel and high-value
an
ta2
                                                                                                                                                                            .ed
                                             solutions, and the situations where the gains are likely to be minimal.
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                                                                                             ta/
                                                                                                                                                                      t]is
                                             Predicting demand. The first use case involves predicting demand for consumer products that are in
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                             the “long tail” of consumption. Firms value accurate demand forecasts because inventory is
Gu
ika
                                                                                                                                                                 22
                                             expensive to keep on shelves and stockouts are detrimental to both short-term revenue and long-
                                                                                                                                                                02
                                                                                                a
                                             term customer engagement. Aggregated total sales is a poor proxy because firms need to distribute
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                             inventory geographically, necessitating hyperlocal forecasts. The traditional way of solving this
/Ha
                                                                                                                                                            pta
                                                                                       an
problem is using time-series econometrics with historical sales data. This method works well for
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                             popular products in large regions but tends to fail when data gets thin because random noise
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                             overwhelms the underlying signal.
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                                                          du
                                             A big data solution to this problem is to use anonymized and aggregated web search or sentiment
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
                                             data linked to each store’s location on top of the existing time-series data. Microsoft data scientists
                                                                                                                                              Ha
                                                                                                                                             Ha
                                                               t]is
                                             have employed this approach to help a forecasting firm predict auto sales. Building models with web
                                                                                       an
                                                                                                                                            /
                                             search data as one of the inputs reduces mean absolute forecast error, a standard measure of
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                             prediction accuracy, for monthly national sales predictions on the order of 40% from baseline for
                                                 02
                                             auto makes with relatively small market shares, compared to traditional time-series models.
                                                                                                                          aG
                                                                                                                                      aG
                                                                          du
                                              ta2
                                             Although the gains were smaller for the most popular models at the national level, the relative
                                                                    b.e
sik
sik
t]is
an
                                                                                                                               an
                                    G
                                             In this case, the big data solution leverages the previously unused data point that people do a
                                                         2 [a
                                 a_
/H
/H
                                             considerable amount of social inquiry and research online before buying a car. The increased
                           sik
02
                                             prediction accuracy, in turn, makes it possible to achieve large increases in operational efficiency
                                                                                                     u
                                                                                                                            du
                      n
                                                                                               .ed
                   Ha
                                                                                                                    b.e
                                         up
]isb
                                             Anonymized web search data has proven to be helpful for other forecasts as well since online activity
                                                                                                                t]is
                                   _G
                                             often is a good leading proxy for purchases and actions of the general public. Having the additional
                                                                                    [at
2[a
                                             data is insufficient on its own. Processing search data and combining it with traditional sources is
                             ika
22
                                             vital in creating a successful prediction: We found that raw search query volume is insufficient in
                                                                                                     02
                         ns
20
                                                                    ta
                                                                 up
up
                                             Being intelligent about which signals to draw from big data requires care, and best practices can be
                                                            _G
_G
                                             case-specific. For example, single queries from a user might be less important than multiple queries
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    3
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   8
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             from a user. Although we used search data in this case study, a firm could just as easily use the
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                             location of users visiting their website or link detailed sales data to a customer’s location.
pta
b.e
                                                                                                                                                                                         sik
                                             Improved pricing. Using a single price is economically inefficient because part of the demand curve
Gu
t]is
an
                                                                                                                                                                                      an
                                             that could be profitably served is priced out of the market. As a consequence, firms regularly offer
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             targeted discounts, promotions, and segment-based pricing to target different consumers. E-
                                             commerce websites have a distinct advantage in pursuing such an approach because they log
sik
02
                                                                                                                                                                              du
                                             detailed information on customer browsing, not just the goods they end up purchasing, and
an
ta2
                                                                                                                                                                            .ed
                                             aggressively adjust prices over time. These price adjustments are a form of experimentation and,
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                             jointly with big data, allow firms to learn more about their customers’ price responsiveness.
ta/
                                                                                                                                                                      t]is
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                             Offline retailers can mimic e-commerce’s nuanced pricing strategies by tracking consumers through
Gu
ika
                                                                                                                                                                 22
                                             smartphone connectivity and logging which customers enter the store, what type of goods they look
                                                                                                                                                                02
                                                                                                a
                                             at, and whether they make a purchase. Machine learning applied to this data can algorithmically
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                             generate customer segments based on price responsiveness and preferences, which generally offers a
/Ha
                                                                                                                                                            pta
                                                                                       an
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                             Our experience with pricing advertising on the Bing search engine is that using big data can produce
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                             substantial gains by better matching advertisers to consumers. The success of algorithmic targeting
                                                                          du
                                             has been well documented and is a key driver of revenue in online advertising market. Advances in
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
                                             measurement technology increasingly allow offline firms to benefit from these types of gains through
                                                                                                                                              Ha
                                                                                                                                             Ha
                                                               t]is
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                             Predictive maintenance. Smoothly operating supply chains are vital for stable profits. Machine
                                                 02
                                             downtime imposes a cost to firms due to forgone productivity and can be particularly disruptive in
                                                                                                                          aG
                                                                                                                                      aG
                                                                          du
                                              ta2
                                             both complex manufacturing supply chains and consumer products. Executives in asset-intensive
                                                                    b.e
sik
sik
                                             industries often state that the primary operational risk to their businesses is unexpected failures of
                                         up
t]is
                                             their assets. A wave of new data generated by the “internet of things” (IoT) can provide real-time
                                                                                                            an
                                                                                                                               an
                                    G
/H
/H
02
                                                                                                                            du
                      n
                                             Airlines are particularly interested in predicting mechanical failures in advance so that they can
                                              ta2
                                                                                               .ed
                   Ha
b.e
                                             reduce flight delays or cancellations. Microsoft data scientists from the Cortana Intelligence Suite
                                         up
]isb
                                             team are able to predict the probability of aircrafts being delayed or canceled in the future based on
                                                                                                                t]is
                                   _G
                                             relevant data sources, such as maintenance history and flight route information. A machine-learning
                                                                                    [at
2[a
                                             solution based on historical data and applied in real time predicts the type of mechanical issue that
                             ika
22
                                             will result in a delay or cancellation of a flight within the next 24 hours, allowing the airlines to take
                                                                                                     02
                         ns
20
                                             maintenance actions while the aircrafts are being serviced, thus preventing possible delays or
                                                                                               ta2
                   Ha
ta
                                             cancellations.
                                                                 up
                                                                                      up
                                                            _G
_G
                                             Similar predictive-maintenance solutions are also built in other industries — for example, tracking
                                             real-time telemetry data to predict the remaining useful life of an aircraft engine, using sensor data to
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    4
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                   9
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             predict the failure of an ATM cash withdrawal transaction, employing telemetry data to predict the
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                             failure of electric submersible pumps used to extract crude in the oil and gas industry, predicting the
pta
                                                                                                                                                                                           b.e
                                             failures of circuit boards at early stages in the manufacturing process, predicting credit defaults, and
                                                                                                                                                                                         sik
                                             forecasting energy demand in hyperlocal regions to predict the overload situations of energy grids.
Gu
t]is
an
                                                                                                                                                                                      an
                                             Machine learning will make supply chains less brittle and reduce the effects of disruptions for many
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             goods and services.
sik
02
                                                                                                                                                                              du
                                             These cases help highlight a few general principles:
an
ta2
.ed
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                             • The value derived from the analytics piece can greatly exceed the cost of the infrastructure. This
ta/
                                                                                                                                                                      t]is
                                                 indicates there will be strong growth in big data consulting services and specialized roles within
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                                 firms.
Gu
ika
                                                                                                                                                                 22
                                             • Big data is less about size and more about introducing fundamentally new information to
                                                                                                                                                                02
                                                                                                a
                                                 prediction and decision processes. This information matters most when existing data sources are
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                                 insufficient to provide accurate or actionable predictions — for example, due to small sample sizes
/Ha
                                                                                                                                                            pta
                                                                                       an
or coarseness of historical sales (small effective regions, niche products, new offerings, etc.).
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                             • The new information is often buried in detailed and relatively unstructured data logs (known as a
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                                 “data lake”), and techniques from computer science are needed to extract insights from it. To
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                                 leverage big data, it is vital to have talented data engineers, statisticians, and behavioral scientists
                                                                          du
                                                 working in tandem. “Data scientist” is often used to refer to someone who has these three skills,
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
                                                                                                                                             Ha
                                                               t]is
an
                                                                                                                                          ta/
                                             Radically new applications. The cases that we’ve discussed concern how big data can be employed to
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                             improve existing processes (e.g., more-precise demand forecasts, better price sensitivity estimates,
                                                 02
                                             better predictions of machine failure). But it also has the potential to be applied in ways that disrupt
                                                                                                                          aG
                                                                                                                                      aG
                                                                          du
                                              ta2
                                             existing processes. For example, machine-learning models taking massive data sets as inputs,
                                                                    b.e
sik
sik
                                             coupled with clever designs that account for patient histories, have to the potential to revolutionize
                                         up
t]is
                                             how certain diseases are diagnosed and treated. Another example involves matching distributed
                                                                                                            an
                                                                                                                               an
                                    G
                                             electricity generation (e.g., solar panels on roofs) to localized electricity demand, unlocking huge
                                                         2 [a
                                 a_
/H
/H
02
                                                                                                                            du
                      n
                                             The value described from predicting demand more accurately, better pricing, and predictive
                                              ta2
                                                                                               .ed
                   Ha
b.e
                                             maintenance are the specific use cases that easily justify large firms’ investments in big data
                                         up
]isb
                                             infrastructure and data science. These uses are likely to drive value of the same order of magnitude as
                                                                                                                t]is
                                   _G
                                             the investments. The value of radically new applications is challenging to understand ex ante and
                                                                                    [at
2[a
                                             speculative by nature. It is reasonable to expect losses for many firms, due to uncertain and higher
                             ika
22
20
                                                                                               ta2
                   Ha
ta
                                             Jacob LaRiviere is an economist at Microsoft Technology and Research, an adjunct professor at the University of
                                                                 up
up
_G
                                             Preston McAfee is a corporate vice president and the chief economist at Microsoft.
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    5
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                  10
                                                                                                                                                                                                          a
                                                                                                                                                                                                        ]is
an
                                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                  up
                                                                                                                                                                                                  22
                                             Justin Rao is an economist at Microsoft Research and an affiliate faculty member at the University of Washington.
                                                                                                                                                                                             aG
                                                                                                                                                                                               20
                                                                                                                                                                                               du
                                                                                                                                                                                           pta
b.e
                                                                                                                                                                                         sik
                                             Vijay K. Narayanan leads the Algorithms and Data Science Solutions unit of the Data Group at Microsoft.
Gu
t]is
an
                                                                                                                                                                                      an
                                             Walter Sun is the founder of Bing Predicts and a partner data scientist at Microsoft. He is an affiliate faculty member of
                                                                                                                                                                                   2[a
                                                                                                                                                                                   a_
/H
                                                                                                                                                                                  /H
                                             the University of Washington and an adjunct professor at Seattle University.
sik
02
                                                                                                                                                                              du
                                                                                                                     an
ta2
.ed
                                                                                                                                                                           b.e
                                                                                                                H
up
                                                                                                                                                                       ]isb
                                                                                                             ta/
                                                                                                                                                                      t]is
                                                                                                                                  _G
                                                                                                        p
[at
                                                                                                                                                                  2[a
                                                                                                     Gu
ika
22
                                                                                                                                                                02
                                                                                                a
ns
                                                                                                                                                               20
                                                                                            sik
                                                                                                                                                             ta2
                                                                                                                 /Ha
                                                                                                                                                            pta
                                                                                       an
                                                                                                                                                          up
                                                                                                                                                         Gu
                                                                                                           ta
                                                                                 /H
                                                                                                                                                       _G
                                                                                                        up
                                                                                                                                                       _
                                                                                                                                                    ika
                                                                                                                                                    ika
                                                                                               aG
                                                                          du
                                                                    b.e
ns
                                                                                                                                                 ns
                                                                                            sik
Ha
                                                                                                                                             Ha
                                                               t]is
an
                                                                                                                                          ta/
                                                                                                                               ta
                                                   2[a
/H
up
                                                                                                                                        up
                                                 02
aG
                                                                                                                                      aG
                                                                          du
                                              ta2
b.e
sik
                                                                                                                                  sik
                                         up
t]is
an
                                                                                                                               an
                                    G
                                                         2 [a
                                 a_
/H
                                                                                                                                   /H
                           sik
02
                                                                                                                            du
                      n
ta2
                                                                                               .ed
                   Ha
                                                                                                                    b.e
                                         up
]isb
                                                                                                                t]is
                                   _G
[at
                                                                                                          2[a
                             ika
22
                                                                                                     02
                         ns
20
                                                                                               ta2
                   Ha
                                                                    ta
                                                                 up
                                                                                      up
                                                            _G
                                                                                    _G
                                                      ika
                                                                               ika
                                                  ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.                                                                                                                                                    6
                                            Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                  11
                                                                                                                                                                                                            a
                                                                                                                                                                                                          ]is
an
                                                                                                                                                                                                       ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                                    up
                                                                                                                                                                                                    22
                                                                                                                                                                                               aG
                                                                                                                                                                                                 20
                                                                                                                                                                                                 du
               Copyright 2016 Harvard Business Publishing. All Rights Reserved. Additional restrictions
pta
b.e
                                                                                                                                                                                           sik
               may apply including the use of this content as assigned course material. Please consult your
Gu
                                                                                                                                                                                        t]is
               institution's librarian about any restrictions that might apply under the license with your
an
                                                                                                                                                                                        an
               institution. For more information and teaching resources from Harvard Business Publishing
                                                                                                                                                                                     2[a
                                                                                                                                                                                     a_
/H
                                                                                                                                                                                    /H
               including Harvard Business School Cases, eLearning products, and business simulations
               please visit hbsp.harvard.edu.
sik
02
                                                                                                                                                                                du
                                                                                                                       an
ta2
.ed
                                                                                                                                                                             b.e
                                                                                                                  H
up
                                                                                                                                                                         ]isb
                                                                                                               ta/
                                                                                                                                                                        t]is
                                                                                                                                   _G
                                                                                                         p
[at
                                                                                                                                                                    2[a
                                                                                                      Gu
ika
22
                                                                                                                                                                  02
                                                                                                 a
ns
                                                                                                                                                                 20
                                                                                             sik
                                                                                                                                                               ta2
                                                                                                                  /Ha
                                                                                                                                                              pta
                                                                                         an
                                                                                                                                                            up
                                                                                                                                                           Gu
                                                                                                             ta
                                                                                   /H
                                                                                                                                                         _G
                                                                                                          up
                                                                                                                                                         _
                                                                                                                                                      ika
                                                                                                                                                      ika
                                                                                                aG
                                                                            du
                                                                      b.e
ns
                                                                                                                                                   ns
                                                                                             sik
Ha
                                                                                                                                               Ha
                                                                t]is
an
                                                                                                                                            ta/
                                                                                                                                ta
                                                    2[a
/H
up
                                                                                                                                          up
                                                  02
aG
                                                                                                                                        aG
                                                                            du
                                               ta2
b.e
sik
                                                                                                                                    sik
                                          up
t]is
an
                                                                                                                                 an
                                     G
                                                          2 [a
                                  a_
/H
                                                                                                                                    /H
                             sik
02
                                                                                                                             du
                        n
ta2
                                                                                                 .ed
                     Ha
                                                                                                                      b.e
                                          up
]isb
                                                                                                                  t]is
                                     _G
[at
                                                                                                            2[a
                               ika
22
                                                                                                       02
                          ns
20
                                                                                                 ta2
                     Ha
                                                                      ta
                                                                   up
                                                                                        up
                                                             _G
                                                                                      _G
                                                        ika
                                                                                ika
                                                   ns
                                                                            ns
                                              Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
                                                                                                                12
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                                                                                   UV0871
pta
b.e
                                                                                                                                                         sik
                                                                                                                                                      Gu
t]is
an
                                                                                                                                                      an
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                                             SCREENING FOR CHRONIC KIDNEY DISEASE
sik
02
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                        Chronic Kidney Disease (CKD) is a progressive condition that results in
up
                                                                                                                                       ]isb
                        significant morbidity and mortality. Because of the important role the kidneys
ta/
                                                                                                                                      t]is
                        play in maintaining homeostasis, CKD can affect almost every body system.
                                                                                                 _G
                                                                               p
[at
                                                                                                                                  2[a
                        Early recognition and intervention are essential to slowing disease progression,
Gu
                                                                                             ika
                        maintaining quality of life, and improving outcomes. Family physicians have the
22
                                                                                                                                02
                        opportunity to screen at-risk patients, identify affected patients, and ameliorate the
                                                                         a
ns
                                                                                                                               20
                                                                     sik
                        impact of CKD by initiating early therapy and monitoring disease progression.1
                                                                                                                             ta2
                                                                                    /Ha
                                                                                                                            pta
                                                                  an
                                                                                                                          up
                                                                                                                         Gu
                                                                                 ta
                                                              /H
The purpose of this case is to create an easy-to-use screening tool to identify patients at
                                                                                                                       _G
                                                                              up
             risk for CKD. Despite the wide availability and low cost of a test for CKD based on one or more
                                                                                                                       _
                                                                                                                    ika
                                                                                                                    ika
             blood samples, studies have shown that many in the at-risk population have not been tested. One
                                                                        aG
                                                         du
             reason for this is that awareness of CKD is low. Given the proven benefits of early detection and
                                                    b.e
ns
                                                                                                                 ns
             treatment, the need for some kind of screening tool is clear. Although there is no reason to test
                                                                     sik
Ha
                                                                                                             Ha
             everyone, those patients with a high enough probability of having CKD should be tested. The
                                                t]is
an
             purpose of this case is to see if those high-risk patients can be identified using easily obtainable
                                                                                                            /
                                                                                                          ta/
                                                                                                ta
                                         2[a
             patient data.
                                                              /H
up
                                                                                                        up
                                       02
aG
                                                                                                      aG
                                                         du
b.e
sik
                                                                                                  sik
                                 up
                    Since 1975, the National Center for Health Statistics of the Centers for Disease Control
                                                t]is
an
an
             and Prevention has conducted nationwide surveys of U.S. adults. Using trained personnel, the
                              G
             center collected a wide variety of demographic and health information using direct interviews,
                                            2 [a
                           a_
/H
/H
             examinations, and blood samples. The data set consists of selected information from 8,819 adults
                       sik
02
             20 years of age or older taken from the 1999–2000 and 2001–2002 surveys. The sample subjects
                                                                            u
du
             were randomly divided into two pools: a 6,000-case training set and a 2,819-case validation
                    n
ta2
                                                                        .ed
                 Ha
b.e
             sample.
                                 up
]isb
                                                                                    t]is
                            _G
                   1
                 Catherine S. Snively, MD, and Cecilia Gutierrez, MD, “Chronic Kidney Disease: Prevention and Treatment of
                                                                [at
2[a
22
                                                                            02
                       ns
20
             This case was prepared by Professor Phillip E. Pfeifer and Professor Heejung Bang (Weill Cornell Medical
                                                                        ta2
             College). It was written as a basis for class discussion rather than to illustrate effective or ineffective handling of an
                 Ha
ta
             administrative situation. Copyright © 2007 by the University of Virginia Darden School Foundation, Charlottesville,
                                                  up
up
             VA. All rights reserved. To order copies, send an e-mail to sales@dardenbusinesspublishing.com. No part of this
             publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by
                                              _G
_G
             any means—electronic, mechanical, photocopying, recording, or otherwise—without the permission of the Darden
             School Foundation.
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     13
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -2-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
                     A test for CKD was administered to everyone in the study population.2 The variable of
Gu
t]is
an
                                                                                                                                                      an
             interest is CDK, a 0/1 dummy variable indicating whether or not the subject had CKD. Exhibit 1
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
             defines the 34 variables in the data set. Notice that variables in columns A through J are
             demographic in nature, K through V were collected during the physical exam, and W through
sik
                                                                                                                                                02
             AH are based, in part, on self-reported health histories.
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                                                                                     H
up
                                                                                                                                       ]isb
             The Causes of CKD3
ta/
                                                                                                                                      t]is
                                                                                                 _G
                                                                               p
[at
                                                                                                                                  2[a
                     The two main causes of chronic kidney disease are diabetes and high blood pressure,
Gu
                                                                                             ika
             which are responsible for up to two-thirds of the cases. Diabetes happens when your blood sugar
22
                                                                                                                                02
             is too high, causing damage to many organs in your body, including the kidneys and heart, as
                                                                         a
ns
                                                                                                                               20
                                                                     sik
             well as blood vessels, nerves, and eyes. High blood pressure, or hypertension, occurs when the
                                                                                                                             ta2
                                                                                    /Ha
                                                                                                                            pta
             pressure of your blood against the walls of your blood vessels increases. If uncontrolled, or
                                                                  an
                                                                                                                          up
             poorly controlled, high blood pressure can be a leading cause of heart attacks, strokes, and
                                                                                                                         Gu
                                                                                 ta
                                                              /H
chronic kidney disease. Also, chronic kidney disease can cause high blood pressure.
                                                                                                                       _G
                                                                              up
                                                                                                                       _
                                                                                                                    ika
                                                                                                                    ika
                        Other conditions that affect the kidneys are:
                                                                        aG
                                                         du
                                                    b.e
ns
                                                                                                                 ns
                   •    Glomerulonephritis, a group of diseases that cause inflammation and damage to the
                                                                     sik
                        kidney’s filtering units. These disorders are the third most common type of kidney
                                                                                                              Ha
                                                                                                             Ha
                                                t]is
                        disease.
                                                                  an
                                                                                                          ta/
                                                                                                ta
                                         2[a
/H
                   •    Inherited diseases, such as polycystic kidney disease, which causes large cysts to form in
                                                                                             up
                                                                                                        up
                        the kidneys and damage the surrounding tissue.
                                       02
aG
                                                                                                      aG
                                                         du
                   •    Malformations that occur as a baby develops in its mother’s womb. For example, a
                                    ta2
                        narrowing may occur that prevents normal outflow of urine and causes urine to flow back
                                                    b.e
sik
sik
                        up to the kidney. This causes infections and may damage the kidneys.
                                 up
t]is
an
an
                   •    Lupus and other diseases that affect the body’s immune system.
                              G
                                            2 [a
                           a_
/H
/H
02
                                                                                            du
                    n
.ed
                   •
                 Ha
                                                                                        b.e
                                 up
]isb
                                                                                    t]is
                            _G
[at
                                                                                2[a
                        ika
22
                                                                            02
                       ns
20
                   2
                     The test used a formula to estimate glomerular filtration rate based on measured serum creatinine
                                                                        ta2
             concentration, age, gender, and race. CKD was defined as estimated filtration rate less than 60 ml/min/1.73 m2. For
                 Ha
ta
             details, see Heejung Bang, David A. Shoham, Philip J. Klemmer, Ronald J. Falk, Madhu Mazumdar, Debbie
                                                  up
up
             Gipson, Romulo E. Colindres, and Abhijit V. Kshirsagar, “SCreening for Occult Renal Disease (SCORED): A
             Simple Prediction Model for Chronic Kidney Disease,” Archives of Internal Medicine, 2007.
                                              _G
_G
                 3
                   This section is excerpted from the National Kidney Foundation Web site (www.kidney.org), © 2007, National
             Kidney Foundation, Inc., 30 East 33rd Street, New York, NY 10016.
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     14
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -3-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
             Who Is at Risk?4
Gu
t]is
an
                                                                                                                                                      an
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                     While anyone at any age can develop chronic kidney disease (CKD), a number of risk
             factors have been identified that may lead to possible problems with your kidneys. These
sik
                                                                                                                                                02
             include:
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                   •    Diabetes. Diabetes is the leading cause of CKD. If you have diabetes, talk with your
up
                                                                                                                                       ]isb
                        doctor about how to keep your blood glucose as close to normal as possible to ensure
ta/
                                                                                                                                      t]is
                        your diabetes is under control.
                                                                                                 _G
                                                                               p
[at
                                                                                                                                  2[a
                                                                            Gu
                   •    Hypertension. Hypertension, also called high blood pressure, is the second-highest cause
ika
                                                                                                                                 22
                        of CKD. Keep your blood pressure under control. A number of effective medications are
                                                                                                                                02
                                                                         a
ns
                                                                                                                               20
                        available to help you with this task. Your doctor will help you to determine which
                                                                     sik
                                                                                                                             ta2
                        medication is right for you.
/Ha
                                                                                                                            pta
                                                                  an
                                                                                                                          up
                   •    Cardiovascular disease. In addition to hypertension, other diseases of the heart and
                                                                                                                         Gu
                                                                                 ta
                                                              /H
blood vessels may increase your risk for kidney disease. People who have had heart
                                                                                                                       _G
                                                                              up
                        attacks or strokes, congestive heart failure, coronary artery disease, or peripheral vascular
                                                                                                                       _
                        disease need to be monitored carefully for kidney problems.
ika
                                                                                                                    ika
                                                                        aG
                                                         du
                   •    Family history of kidney disease. Some kidney diseases are genetic. People with a
                                                    b.e
ns
                                                                                                                 ns
                                                                     sik
                        mother, father, brother, or sister who has had a kidney disease are more likely to develop
                                                                                                              Ha
                                                                                                             Ha
                                                t]is
                        Age. People 60 years and older are at a higher risk for developing CKD.
                                                                                                          ta/
                                                                                                ta
                                         2[a
                   •
                                                              /H
up
                                                                                                        up
                   •    Race. People belonging to certain ethnic groups, such as First Nations (Canadian
                                       02
                        aboriginal peoples) and Pacific Islanders, are at a higher risk for developing this disease.
                                                                                           aG
                                                                                                      aG
                                                         du
                                    ta2
b.e
sik
                                                                                                  sik
                                 up
             The Challenge
                                                t]is
an
                                                                                               an
                              G
                                            2 [a
                           a_
                     The list of risk factors above is a reflection of the results of several separate studies. What
                                                                                /H
/H
             we want to do is figure out how to combine all the possible risk factors to measure the overall
                       sik
02
                                                                                            du
                    n
ta2
                                                                        .ed
                 Ha
b.e
                     The 34 variables in the data set are all easily obtained by a family physician during
                                 up
]isb
             routine checkups. Only the cholesterol measurements and the hemoglobin count (used to help
                                                                                    t]is
             define anemia) require blood tests. The challenge is to come up with some kind of way to use the
                            _G
[at
             first 33 variables to predict the 34th. The idea would be to create something very simple (like the
                                                                                2[a
                        ika
             quizzes you see in popular magazines, for example) that would identify subjects at risk of having
                                                            22
02
             CKD. The high-risk subjects would then be encouraged to have their serum creatinine levels
                       ns
20
             checked and/or undergo a complete urinalysis. The challenge here is strictly one of prediction.
                                                                        ta2
                 Ha
ta
             The variables used need not cause CKD. They need only be indicators of the presence of CKD.
                                                  up
                                                                  up
                                              _G
_G
                   4
                 This section was excerpted from the Web site of the government of British Columbia on 18 June 2007.
             (www.gov.bc.ca), © 2001, Province of British Columbia.
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     15
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -4-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
                     It is also important to note that the study population is not a random sample of U.S.
Gu
t]is
an
                                                                                                                                                      an
             adults. That means that our predictions will not apply directly to the U.S. population and should
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
             not be used for actual decision-making.
sik
                                                                                                                                                02
                     To get us started, Exhibit 2 reports summary statistics for the 6,000-subject training set
                                                                                                                                              du
                                                                                        an
ta2
                                                                                                                                            .ed
             for each of the numerical variables. These statistics are reported for those with and without CKD.
                                                                                                                                           b.e
             A T-statistic to test the equality of the means for the two groups is also reported. Of the 11
up
                                                                                                                                       ]isb
             numerically scaled variables, age is the most significant predictor of CKD with the average age
ta/
                                                                                                                                      t]is
             of those with CKD being 73 compared to 47 for those without CKD.
                                                                                                 _G
                                                                               p
[at
                                                                                                                                  2[a
                                                                            Gu
                                                                                             ika
                     For categorical variables, a chi-squared test of association is appropriate. Exhibit 3
22
                                                                                                                                02
             reports the cross tabulation counts as well as the calculated chi-squared statistics. Remember, the
                                                                         a
ns
                                                                                                                               20
                                                                     sik
             degrees of freedom associated with each of these chi-squares depend on the number of categories
                                                                                                                             ta2
                                                                                    /Ha
                                                                                                                            pta
             taken on by each variable. Remember also that subjects with missing values have been ignored
                                                                  an
                                                                                                                          up
             when constructing Exhibits 2 and 3. The most significant predictor of CKD from among the
                                                                                                                         Gu
                                                                                 ta
                                                              /H
categorical variables is hypertension. Of those with hypertension, 15.5% had CKD compared to
                                                                                                                       _G
                                                                              up
             2.7% of those without hypertension. It also appears Hispanics are under-represented in the CKD
                                                                                                                       _
                                                                                                                    ika
                                                                                                                    ika
             population and whites are over-represented. It also appears that those who list “noplace” as
                                                                        aG
                                                         du
             where they get their health care are very unlikely to have CKD.
                                                    b.e
ns
                                                                                                                 ns
                                                                     sik
Ha
                                                                                                             Ha
                                                t]is
an
                                                                                                          ta/
                                                                                                ta
                                         2[a
/H
up
                                                                                                        up
                                       02
aG
                                                                                                      aG
                                                         du
                                    ta2
b.e
sik
                                                                                                  sik
                                 up
t]is
an
                                                                                               an
                              G
                                            2 [a
                           a_
/H
                                                                                                  /H
                       sik
02
                                                                                            du
                    n
ta2
                                                                        .ed
                 Ha
                                                                                        b.e
                                 up
]isb
                                                                                    t]is
                            _G
[at
                                                                                2[a
                        ika
22
                                                                            02
                     ns
20
                                                                        ta2
                 Ha
                                                     ta
                                                  up
                                                                  up
                                              _G
                                                                _G
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     16
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -5-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
                                                                                Exhibit 1
Gu
t]is
an
                                                                                                                                                      an
                                             SCREENING FOR CHRONIC KIDNEY DISEASE
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                                                                       Variable Definitions
sik
02
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
               Col.     Variable                     Definition
up
                                                                                                                                       ]isb
                A       ID                           Identification number
ta/
                                                                                                                                      t]is
                B       Age                          Age (years)
                                                                                                 _G
                                                                               p
                                                                                                                                   [at
                C       Female                       1 if female
                                                                                                                                  2[a
                                                                            Gu
                D       Racegrp                      Self-reported race/ethnic group (white, black, Hispanic, other)
ika
                                                                                                                                 22
                E       Educ                         1 if more than high school
                                                                                                                                02
                                                                         a
ns
                                                                                                                               20
                F       Unmarried                    1 if unmarried  sik
                                                                                                                             ta2
                G       Income                       1 if household income is above the median
/Ha
                                                                                                                            pta
                H       CareSource                   Self-reported source of medical care (Dr./HMO, clinic, noplace, other)
                                                                  an
                                                                                                                          up
                I       Insured                      1 if covered by health insurance.
                                                                                                                         Gu
                                                                                 ta
                                                              /H
                                                                                                                       _G
                                                                              up
                K       Height                       Height (cm)
                                                                                                                       _
                                                                                                                    ika
                                                                                                                    ika
                                                                        aG
                L       BMI
                M       Obese                        1 if BMI is greater than 30 kg/m2
                                                    b.e
ns
                                                                                                                 ns
                                                                     sik
                                                                                                             Ha
                O       SBP                          Systolic blood pressure (max)
                                                t]is
an
                                                                                                          ta/
                                                                                                ta
                                         2[a
                                                                                                        up
                                       02
                                                                                                      aG
                                                         du
                U       PVD                          Peripheral vascular disease reflected by reduced SBP at the leg relative to the arm.
                                                    b.e
sik
                                                                                                  sik
                                 up
                                                     Mostly sit (1); stand or walk a lot (2); lift light loads or climb stairs often (3);
               V        Activity
                                                t]is
an
an
                                            2 [a
                           a_
/H
/H
02
                Y       Hypertension                 The presence of at least one of four indicators of high blood pressure.
                                                                            u
                                                                                            du
                      n
                                                                        .ed
                   Ha
b.e
]isb
t]is
               AC       Stroke                       Self-reported response to "Has a doctor ever told you that you had a stroke?"
                            _G
                                                     Response to "Has a doctor ever told you that you had angina pectoris,
                                                                [at
2[a
               AD       CVD
                                                     myocardial infarction, or stroke?"
                        ika
22
20
ta
up
               AG       Anemia
                                                     or hemoglobin at exam lower than 11g/dL
                                              _G
_G
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     17
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -6-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
                                                                                Exhibit 2
Gu
t]is
an
                                                                                                                                                      an
                                             SCREENING FOR CHRONIC KIDNEY DISEASE
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                                             Descriptive Statistics for Numerically Scaled Variables
sik
                                                                                                                                                02
                                                 (training-set data broken out by CKD groups)
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                                                                                     H
up
                                                                                                                                       ]isb
                                                                                  ta/
                                                                                                                                      t]is
                                                        CKD=0                                               CKD=1
                                                                                                 _G
                                        Average          Std Dev            Count            Average          Std Dev            Count             T-stat
[at
                                                                                                                                  2[a
                                                                            Gu
                  Age                     47.15            17.90             5536              73.05            11.71             464             -43.56
ika
                                                                                                                                 22
                  Weight                  79.17            19.60             5432              77.74            19.25             435               1.49
                                                                                                                                02
                                                                         a
                                                                                       ns
                  Height                 167.25            10.12             5433             165.29            10.41             428               3.77
                                                                                                                               20
                                                                     sik
                                                                                                                             ta2
                  BMI                     28.24             6.22             5377              28.35             5.98             417              -0.36
/Ha
                                                                                                                            pta
                  Waist                   96.54            15.24             5365             100.10            14.44             420              -4.85
                                                                  an
                                                                                                                          up
                  SBP                    124.27            20.14             5352             141.47            25.28             442             -13.94
                                                                                                                         Gu
                                                                                 ta
                                                              /H
                                                                                                                       _G
                  DBP                     71.86            12.24             5318              67.73            14.28             430               5.83
                                                                              up
                                                                                                                       _
                  HDL                     51.97            15.79             5529              50.08            16.18             463               2.41
ika
                                                                                                                    ika
                                                                        aG
                                                         du
ns
                                                                                                                 ns
                                                                     sik
                                                                                                             Ha
                                                t]is
an
                                                                                                          ta/
                                                                                                ta
                                         2[a
/H
up
                                                                                                        up
                                       02
aG
                                                                                                      aG
                                                         du
                                    ta2
b.e
sik
                                                                                                  sik
                                 up
t]is
an
                                                                                               an
                              G
                                            2 [a
                           a_
/H
                                                                                                  /H
                       sik
02
                                                                                            du
                    n
ta2
                                                                        .ed
                 Ha
                                                                                        b.e
                                 up
]isb
                                                                                    t]is
                            _G
[at
                                                                                2[a
                        ika
22
                                                                            02
                     ns
20
                                                                        ta2
                 Ha
                                                     ta
                                                  up
                                                                  up
                                              _G
                                                                _G
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     18
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
                                                                                    -7-                                                            UV0871
pta
b.e
                                                                                                                                                         sik
                                                                                Exhibit 3
Gu
t]is
an
                                                                                                                                                      an
                                             SCREENING FOR CHRONIC KIDNEY DISEASE
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                                                           CrossTabs for Categorical Variables
sik
                                                                                                                                                02
                                                                   Training Set Data
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                                                                                     H
up
                                                                                                                                       ]isb
                                                              Variable=0                                 Variable=1
ta/
                                                                                                                                      t]is
                   Variable                       CKD=0         CKD=1     %1s                CKD=0         CKD=1              %1s         Chi-square
                                                                                                 _G
                                                                               p
                                                                                                                                   [at
                   Female                          2655           210    7.3%*                2881           254              8.1%            1.3
                                                                                                                                  2[a
                                                                            Gu
                                                                                             ika
                   Educ                            3064           308    9.1%                 2458           155              5.9%           21.1
22
                                                                                                                                02
                   Unmarried                       3335           227    6.4%                 1926           211              9.9%           23.1
ns
                                                                                                                               20
                   Income                          2723           293sik 9.7%                 2088           104              4.7%           44.5
                                                                                                                             ta2
                   Insured                         1137            17    1.5%                 4329           439              9.2%           78.2
/Ha
                                                                                                                            pta
                                                                  an
                                                                                                                          up
                   Dyslipidemia                    4951           414    7.7%                  585            50              7.9%            0.0
                                                                                                                         Gu
                                                                                 ta
                                                              /H
                                                                                                                       _G
                   PVD                             5379           395    6.8%                  157            69             30.5%          171.1
                                                                              up
                                                                                                                       _
                   Poor Vision                     4932           355    6.7%                  277            60             17.8%           57.0
ika
                                                                                                                    ika
                                                                        aG
                                                         du
ns
                                                                                                                 ns
                   Fam Hypertension                4231           388    8.4%                 1305            76              5.5%           12.5
                                                                     sik
Ha
                                                                                                             Ha
                   Diabetes                        4998           334    6.3%                  537           130             19.5%          145.3
                                                t]is
an
                                                                                                          ta/
                                                                                                ta
/H
                                                                                                        up
                                       02
                                                                                                      aG
                                                         du
sik
                                                                                                  sik
                                 up
t]is
                 *Read: Of the subjects who were not female, 7.3% (210) had CKD. Of the females, 8.1% (254) had CKD.
                                                                                  an
                                                                                               an
                              G
                                            2 [a
                           a_
/H
                                                                                                  /H
                       sik
02
                                                                                            du
                    n
ta2
.ed
b.e
]isb
[at
2[a
                                                                            02
                     ns
20
                                                                        ta2
                 Ha
                                                     ta
                                                  up
                                                                  up
                                              _G
                                                                _G
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     19
                                                                                                                                                                                          a
                                                                                                                                                                                        ]is
an
                                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                  up
                                                                                                                                                                                  22
                                                                                                                                                                             aG
                                                                                                                                                                               20
                                                                                                                                                                               du
                                                                                                                                                                                UV0745
pta
                                                                                                                                                                           b.e
                                                                                                                                                                       Rev. Mar. 28, 2018
                                                                                                                                                                         sik
                                                                                                                                                                      Gu
t]is
an
                                                                                                                                                                      an
                                                                                                                                                                   2[a
                                                                                                                                                                   a_
/H
                                                                                                                                                                  /H
                                                                                                        sik
                                                                                                                                                                02
                                                               Cluster Analysis for Segmentation
                                                                                                                                                              du
                                                                                                      an
ta2
.ed
                                                                                                                                                           b.e
                                                                                                  H
up
                                                                                                                                                       ]isb
                                                                                               ta/
                                                                                                                                                      t]is
                                                                                                                 _G
             Introduction
[at
                                                                                                                                                  2[a
                                                                                        Gu
                                                                                                            ika
                 We all understand that consumers are not all alike. This provides a challenge for the development and
22
                                                                                                                                                02
             marketing of profitable products and services. Not every offering will be right for every customer, nor will
                                                                                    a
ns
                                                                                                                                               20
             every customer be equally responsive to your marketing efforts. Segmentation is a way of organizing
                                                                                sik
                                                                                                                                             ta2
                                                                                                   /Ha
             customers into groups with similar traits, product preferences, or expectations. Once segments are identified,
                                                                                                                                            pta
                                                                            an
marketing messages and in many cases even products can be customized for each segment. The better the
                                                                                                                                          up
             segment(s) chosen for targeting by a particular organization, the more successful the organization is assumed
                                                                                                                                         Gu
                                                                                              ta
                                                                       /H
                                                                                                                                       _G
             to be in the marketplace. Since its introduction in the late 1950s, market segmentation has become a central
                                                                                           up
                                                                                                                                       _
             concept of marketing practice.
ika
                                                                                                                                    ika
                                                                                   aG
                                                                 du
                 Segments are constructed on the basis of customers’ (1) demographic characteristics, (2) psychographics,
                                                            b.e
ns
                                                                                                                                 ns
             (3) desired benefits from products/services, and (4) past-purchase and product-use behaviors. These days,
                                                                                sik
             most firms possess rich information about customers’ actual purchase behavior, geodemographic, and
                                                                                                                              Ha
                                                                                                                             Ha
                                                       t]is
an
             psychographic characteristics. In cases where firms do not have access to detailed information about each
                                                                                                                            /
                                                                                                                          ta/
             customer, information from surveys of a representative sample of the customers can be used as the basis for
                                                                                                               ta
                                             2[a
/H
             segmentation.
                                                                                                            up
                                                                                                                        up
                                           02
aG
                                                                                                                      aG
                                                                 du
             An Example
                                        ta2
b.e
sik
                                                                                                                  sik
                                    up
                 Consider Geico, an auto insurance company. Suppose Geico hypothetically plans to customize its auto
                                                       t]is
             insurance offerings and needs to understand what its customers view as important from their insurance
                                                                                               an
                                                                                                               an
                                G
             provider. Geico can ask its customers to rate how important the following two attributes are to them when
                                                  2 [a
                             a_
/H
/H
02
                         savings on premium
                                                                                         u
                                                                                                            du
                    n
ta2
.ed
                                                                                                      b.e
                                    up
]isb
                 The importance of the attributes is measured using a seven-point Likert-type scale, where a rating of one
                                                                                                  t]is
             represents not important and seven represents very important. Unless every respondent who is surveyed gives
                               _G
             identical ratings, the data will contain variations that you can use to cluster or group respondents together, and
                                                                         [at
2[a
             such clusters are the segments. The groupings of customers are most similar to each other if they are part of
                          ika
22
             the same segment and most different from each other if they are part of different segments. By inference,
                                                                                        02
                       ns
20
                                                                                   ta2
                 Ha
                                                             ta
                                                          up
up
             This technical note was prepared by Rajkumar Venkatesan, Associate Professor of Business Administration. Copyright  2007 by the University of
             Virginia Darden School Foundation, Charlottesville, VA. All rights reserved. To order copies, send an email to sales@dardenbusinesspublishing.com. No
                                                    _G
_G
             part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means—electronic, mechanical, photocopying,
             recording, or otherwise—without the permission of the Darden School Foundation. Our goal is to publish materials of the highest quality, so please submit any
                                                ika
ika
             errata to editorial@dardenbusinesspublishing.com.
                                            ns
                                                                 ns
                                       Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                                   20
                                                                                                                                                                               a
                                                                                                                                                                             ]is
an
                                                                                                                                                                          ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                       up
                                                                                                                                                                       22
                                                                                                                                                                  aG
                                                                                                                                                                    20
                                                                                                                                                                    du
             Page 2                                                                                                                                  UV0745
pta
b.e
                                                                                                                                                              sik
                                                                                                                                                           Gu
                                                                                                                                                           t]is
             then, actions taken toward customers in the same segment should lead to similar responses, and actions taken
an
                                                                                                                                                           an
             toward customers in different segments should lead to different responses.
                                                                                                                                                        2[a
                                                                                                                                                        a_
/H
                                                                                                                                                       /H
                 Another way of saying this is that the aspects of auto insurance that are important to any given customer
sik
                                                                                                                                                     02
             in one segment will also be important to other customers in that same segment. Furthermore, those aspects
                                                                                                                                                   du
                                                                                          an
             that are important to that customer will be different from what is important to a customer in a different
ta2
.ed
                                                                                                                                                b.e
             segment. Figure 1 shows what the analysis in this example might look like:
up
                                                                                                                                            ]isb
                                                                                    ta/
                                                                                                                                           t]is
                                                          Figure 1. Segmentation of Geico customers.
                                                                                                     _G
                                                                                 p
[at
                                                                                                                                       2[a
                                                                              Gu
                                                                                                ika
                                                                               Premium Savings
                                                                                                                                      22
                                                                                Very Important
                                                                                                                                     02
                                                                           a
ns
                                                                                                                                    20
                                                                       sik
                                                                                                                                  ta2
                                                                                       /Ha
                                                                                                                                 pta
                                                               Segment A                                    Segment C
                                                                    an
                                                                                                                               up
                                                                 (49%)                                        (15%)
                                                                                                                              Gu
                                                                                    ta
                                                               /H
                                                                                                                            _G
                                                                                 up
                          Agent Not
                                                                                                                            _
                                                                                                                                   Agent Very
ika
                                                                                                                         ika
                          Important
                                                                          aG
                                                          du
                                                                                                                                   Important
                                                     b.e
ns
                                                                                                                      ns
                                                                       sik
                                                                                                            Segment B
                                                                                                                   Ha
                                                                                                                  Ha
                                                 t]is
                                                                                                              (36%)
                                                                    an
                                                                                                               ta/
                                                                                                   ta
                                         2[a
/H
up
                                                                                                             up
                                       02
                                                                                    Premium
                                                                                              aG
                                                                                                           aG
                                                          du
                                                                                   Savings Not
                                    ta2
                                                                                    Important
                                                     b.e
sik
                                                                                                       sik
                                 up
an
                                                                                                    an
                              G
                 The analysis shows three distinct segments. The majority of Geico’s customers (Segment A, 49%) prefer
                                            2 [a
                           a_
/H
/H
             savings on their premium, and they do not prefer having a neighborhood agent. Customers who belong to
                       sik
             Segment B (about 36%) prefer having a neighborhood agent and premium savings is not important to them.
                                         02
du
             Some customers (Segment C, 15%) prefer both the savings on their premium as well as a neighborhood
                    n
ta2
.ed
             agent. This analysis shows that Geico can benefit by adding an offline channel (i.e., developing a network of
                 Ha
b.e
             neighborhood agents) to serve Segment B and also charge a higher premium to them for providing this
                                 up
]isb
             convenience. Of course, the caveat is the increased competition with other insurance providers, such as
                                                                                       t]is
[at
                                                                                  2[a
                        ika
22
             Cluster Analysis
                                                                              02
                     ns
20
ta2
                 Cluster analysis is a class of statistical techniques that can be applied to data that exhibit natural
                 Ha
ta
             groupings. Cluster analysis makes no distinction between dependent and independent variables. The entire set
                                                   up
up
             of interdependent relationships is examined. Cluster analysis sorts through the raw data on customers and
             groups them into clusters. A cluster is a group of relatively homogeneous customers. Customers who belong to
                                               _G
_G
             the same cluster are similar to each other. They are also dissimilar to customers outside the cluster,
                                          ika
                                                             ika
                                       ns
                                                          ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                        21
                                                                                                                                                                            a
                                                                                                                                                                          ]is
an
                                                                                                                                                                       ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                    up
                                                                                                                                                                    22
                                                                                                                                                               aG
                                                                                                                                                                 20
                                                                                                                                                                 du
             Page 3                                                                                                                                  UV0745
pta
b.e
                                                                                                                                                           sik
                                                                                                                                                        Gu
                                                                                                                                                        t]is
             particularly customers in other clusters. The primary input for cluster analysis is a measure of similarity
an
                                                                                                                                                        an
             between customers, such as correlation coefficients, distance measures, and association coefficients.
                                                                                                                                                     2[a
                                                                                                                                                     a_
/H
                                                                                                                                                    /H
                   The following are the basic steps involved in cluster analysis:
sik
02
                                                                                                                                                du
                   1. Formulate the problem—select the variables you want to use as the basis for clustering.
an
ta2
.ed
                                                                                                                                             b.e
                   2. Compute the distance between customers along the selected variables.
up
                                                                                                                                         ]isb
                                                                                   ta/
                   3. Apply the clustering procedure to the distance measures.
                                                                                                                                        t]is
                                                                                                     _G
                                                                               p
                                                                                                                                     [at
                   4. Decide on the number of clusters.
                                                                                                                                    2[a
                                                                            Gu
                                                                                               ika
                   5. Map and interpret clusters—draw conclusions—illustrative techniques like perceptual maps are
22
                                                                                                                                  02
                      useful.
                                                                         a
ns
                                                                                                                                 20
                                                                     sik
                                                                                                                               ta2
                                                                                       /Ha
                                                                                                                              pta
                                                                  an
             Distance Measures
                                                                                                                            up
                                                                                                                           Gu
                                                                                  ta
                                                              /H
                                                                                                                         _G
                 The main input into any cluster analysis procedure is a measure of distance between individuals who are
                                                                               up
             being clustered. The objective of a distance measure is to quantify the difference between two individuals on
                                                                                                                         _
                                                                                                                      ika
                                                                                                                      ika
                                                                        aG
             the variables you are using for the segmentation. A shorter (longer) distance between two individuals would
                                                         du
             imply they have similar (dissimilar) preferences on the segmentation variables. Distance between two
                                                     b.e
ns
                                                                                                                   ns
             individuals is obtained through a measure called Euclidean distance. If two individuals, Joe and Sam, are being
                                                                     sik
             clustered on the basis of n variables, then the Euclidean distance between Joe and Sam is represented as:
                                                                                                                Ha
                                                                                                               Ha
                                                t]is
an
                                                                                                            ta/
                                                                                                 2                              2
                                                                                                  ta
                                         2[a
/H
                                                                               Joe,1
                                          Euclidean distance =
                                                                                               up
                                                                                                          up
                                       02
             where:
                                                                                             aG
                                                                                                        aG
                                                         du
                                    ta2
b.e
                                                                                                    sik
                                 up
an
                                                                                                 an
                              G
             A pairwise distance matrix among individuals who are being clustered can be created using the Euclidean
                                            2 [a
                           a_
/H
/H
             distance measure. Extending the preceding example, consider three individuals—Joe, Sam, and Sara—who
                       sik
             are being clustered based on their preference for Premium Savings and a Neighborhood Agent. The
                                        02
du
             importance ratings on these two attributes for Joe, Sam, and Sara are shown in Table 1.
                    n
ta2
                                                                        .ed
                 Ha
b.e
]isb
t]is
2[a
                                              Joe                      4                                  7
                        ika
22
                                              Sam                      3                                  4
                                                                            02
                     ns
                                              Sara                     5                                  3
                                                         20
                                                                        ta2
                 Ha
                                                     ta
                                                  up
                                                                  up
                                              _G
                                                                _G
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                        22
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
             Page 4                                                                                                                                  UV0745
pta
b.e
                                                                                                                                                         sik
                                                                                                                                                      Gu
                                                                                                                                                      t]is
             The Euclidean distance between Joe and Sam is obtained as:
an
                                                                                                                                                      an
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
                                              Euclidean distance (Joe, Sam) =
                                                                                            4  32  7  42 = 3.2.
sik
02
                                                                                                                                              du
                                                                                        an
             The first term in this Euclidean distance measure is the squared difference between Joe and Sam on the
ta2
.ed
                                                                                                                                           b.e
             importance score for Premium Savings, and the second term is the squared difference between them on the
                                                                                     H
             importance score for Neighborhood Agent. The Euclidean distances are then computed for each pairwise
up
                                                                                                                                       ]isb
                                                                                  ta/
                                                                                                                                      t]is
             combination of the three individuals being clustered to obtain a pairwise distance matrix. The pairwise
                                                                                                 _G
             distance matrix for Joe, Sam, and Sara is shown in Table 2.
[at
                                                                                                                                  2[a
                                                                            Gu
ika
                                                                                                                                 22
                                                                Table 2. Pairwise distance matrix.
                                                                                                                                02
                                                                         a
ns
                                                                                                                               20
                                                                     sik Joe                   Sam                    Sara
                                                                                                                             ta2
                                                  Joe                     0                     3.2                   4.1
/Ha
                                                                                                                            pta
                                                  Sam                                            0                    2.2
                                                                  an
                                                                                                                          up
                                                  Sara                                                                 0
                                                                                                                         Gu
                                                                                  ta
                                                              /H
                                                                                                                       _G
                                                                               up
             The distance between Joe and Sam is 3.2, as shown in Table 2. This pairwise distance matrix is then provided
                                                                                                                       _
             as an input to a clustering algorithm.
ika
                                                                                                                    ika
                                                                        aG
                                                         du
                                                    b.e
ns
                                                                                                                 ns
                                                                     sik
                                                                                                             Ha
                                                t]is
an
                 K-means clustering belongs to the nonhierarchical class of clustering algorithms. It is one of the more
                                                                                                            /
                                                                                                          ta/
                                                                                                ta
                                         2[a
/H
             popular algorithms used for clustering in practice because of its simplicity and speed. It is considered to be
                                                                                             up
                                                                                                        up
             more robust to different types of variables, is more appropriate for large datasets that are common in
                                       02
             marketing, and is less sensitive to some customers who are outliers (in other words, extremely different from
                                                                                           aG
                                                                                                      aG
                                                         du
             others).
                                    ta2
b.e
sik
sik
                 For K-means clustering, the user has to specify the number of clusters required before the clustering
                                 up
an
                                                                                               an
                              G
                   Algorithm
                                            2 [a
                           a_
/H
                                                                                                  /H
                       sik
                                                                                            du
                    n
                                                                        .ed
                 Ha
b.e
]isb
t]is
[at
                   5. Repeat the two previous steps until some convergence criterion is met. Usually the convergence
                                                                                2[a
                      criterion is that the assignment of customers to clusters has not changed over multiple iterations.
                        ika
22
                                                                            02
                     ns
                 A cluster centroid is simply the average of all the points in that cluster. Its coordinates are the arithmetic
                                                         20
ta2
             mean for each dimension separately over all the points in the cluster. Consider Joe, Sam, and Sara in the
                 Ha
ta
             previous example. Let’s represent them based on their importance ratings on Premium Savings and
                                                  up
up
             Neighborhood Agent as: Joe = {4,7}, Sam = {3,4}, Sara = {5,3}. If you assume that they belong to the same
             cluster, then the center for their cluster is obtained as:
                                              _G
_G
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     23
                                                                                                                                                                          a
                                                                                                                                                                        ]is
an
                                                                                                                                                                     ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                  up
                                                                                                                                                                  22
                                                                                                                                                             aG
                                                                                                                                                               20
                                                                                                                                                               du
             Page 5                                                                                                                                  UV0745
pta
b.e
                                                                                                                                                         sik
                                                                                                                                                      Gu
                                                                                                                                                      t]is
             z1 is measured as the average of the ratings of Joe, Sam, and Sara on Premium Savings. Similarly, z2 is
an
                                                                                                                                                      an
             measured as the average of their ratings on Neighborhood Agent. Figure 2 provides a visual representation
                                                                                                                                                   2[a
                                                                                                                                                   a_
/H
                                                                                                                                                  /H
             of K-means clustering.
sik
                                                                                                                                                02
                                                  Figure 2. Visual representation of K-means clustering.
                                                                                                                                              du
                                                                                        an
ta2
.ed
                                                                                                                                           b.e
                                                                                     H
up
                                                                                                                                       ]isb
                                                                                  ta/
                                                                                                                                      t]is
                                                                                                 _G
                                                                               p
[at
                                                                                                                                  2[a
                                                                            Gu
ika
22
                                                                                                                                02
                                                                         a
ns
                                                                                                                               20
                                                                     sik
                                                                                                                             ta2
                                                                                    /Ha
                                                                                                                            pta
                                                                  an
                                                                                                                          up
                                                                                                                         Gu
                                                                                 ta
                                                              /H
                                                                                                                       _G
                                                                              up
                                                                                                                       _
                                                                                                                    ika
                                                                                                                    ika
                                                                        aG
                                                         du
                                                    b.e
ns
                                                                                                                 ns
                                                                     sik
Ha
                                                                                                             Ha
                                                t]is
an
                   Number of clusters
                                                                                                            /
                                                                                                          ta/
                                                                                                ta
                                         2[a
/H
up
                                                                                                        up
                 One of the main issues with K-means clustering is that it does not provide an estimate of the number of
                                       02
             clusters that exists in the data. The K-means clustering has to be repeated several times with different “Ks”
                                                                                           aG
                                                                                                      aG
                                                         du
             (or number of clusters) to determine the number of clusters that is appropriate for the data. A commonly
                                    ta2
b.e
                                                                                                  sik
                                 up
t]is
                  The elbow criterion states that you should choose a number of clusters so that adding another cluster
                                                                                  an
                                                                                               an
                              G
             does not add sufficient information. The elbow is identified by plotting the ratio of the within cluster variance to
                                            2 [a
                           a_
/H
/H
             between cluster variance against the number of clusters. The within cluster variance is an estimate of the average
             of the variance in the variables used as a basis for segmentation (Importance Score ratings for Premium
                       sik
02
             Savings and Neighborhood Agent in the Geico example) among customers who belong to a particular cluster.
                                                                            u
du
             The between cluster variance is an estimate of the variance of the segmentation basis variables between
                    n
ta2
                                                                        .ed
                 Ha
b.e
             customers who belong to different segments. The objective of cluster analysis (as mentioned before) is to
             minimize the within cluster variance and maximize the between cluster variance. Therefore, as the number of
                                 up
]isb
t]is
             clusters is increasing, the ratio of the within cluster variance to the between cluster variance will keep
                            _G
             decreasing.
                                                                [at
                                                                                2[a
                        ika
22
                 But at some point, the marginal gain from adding an additional cluster will drop, giving an angle in the
                                                                            02
             graph (the elbow). In Figure 3, the elbow is indicated by the circle. The number of clusters chosen should
                     ns
20
             therefore be 3.
                                                                        ta2
                 Ha
                                                     ta
                                                  up
                                                                  up
                                              _G
                                                                _G
                                          ika
                                                            ika
                                       ns
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                     24
                                                                                                                                                                                                     a
                                                                                                                                                                                                   ]is
an
                                                                                                                                                                                                ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                                             up
                                                                                                                                                                                             22
                                                                                                                                                                                        aG
                                                                                                                                                                                          20
                                                                                                                                                                                          du
             Page 6                                                                                                                                                       UV0745
pta
b.e
                                                                                                                                                                                    sik
                                                                                                                                                                                 Gu
                                                                                                                                                                                 t]is
                                                                                    Figure 3. Elbow plot for determining number of clusters.
an
                                                                                                                                                                                 an
                                                                                                                                                                              2[a
                                                                                                                                                                              a_
/H
                                                                                                                                                                             /H
                                                                                                             Elbow Plot
sik
02
                                                                                                                                                                         du
                                                                300
an
ta2
                                                                                                                                                                       .ed
                       Ratio of Within Cluster to Between Cluster
                                                                                                                                                                      b.e
                                                                                                                 H
                                                                250
up
                                                                                                                                                                  ]isb
                                                                                                              ta/
                                                                                                                                                                 t]is
                                                                                                                           _G
                                                                200
[at
                                                                                                                                                             2[a
                                                                                                           Gu
                                         Variance
ika
                                                                                                                                                            22
                                                                150
                                                                                                                                                           02
                                                                                                         a
ns
                                                                                                                                                          20
                                                                                                     sik
                                                                                                                                                        ta2
                                                                100
/Ha
                                                                                                                                                       pta
                                                                                                 an
                                                                                                                                                     up
                                                                    50
                                                                                                                                                    Gu
                                                                                                              ta
                                                                                                /H
                                                                                                                                                  _G
                                                                                                           up
                                                                                                                                                  _
                                                                     0
ika
                                                                                                                                               ika
                                                                              1            2           3              4              5              6              7
                                                                                                        aG
                                                                                           du
ns
                                                                                                                                            ns
                                                                                                     sik
Ha
                                                                                                                                        Ha
                                                                                    t]is
             It should also be noted that the initial assignment of cluster seeds has a bearing on the final model
                                                                                                 an
             performance. Some common methods for ensuring the stability of the results obtained from K-means
                                                                                                                                       /
                                                                                                                                     ta/
                                                                                                                           ta
                                                                              2[a
/H
             clustering include:
                                                                                                                        up
                                                                                                                                   up
                   
                                                                            02
                                        Running the algorithm multiple times with different starting values. When using random starting
                                                                                                                      aG
                                                                                                                                 aG
                                                                                           du
                                        points, running the algorithm multiple times will ensure a different starting point each time.
                                                                         ta2
b.e
                                       Splitting the data randomly into two halves and running the cluster analysis separately on each half.
                                                                                                                   sik
                                                                                                                             sik
                                                                         up
                                        The results are robust and stable if the number of clusters and the size of different clusters are similar
                                                                                    t]is
                                        in both halves.
                                                                                                               an
                                                                                                                          an
                                                           G
                                                                                  2 [a
                                                        a_
/H
/H
             Profiling Clusters
                             sik
02
                                                                                                                          du
                    n
                  Once clusters are identified, the description of the clusters in terms of the variables used for clustering—
                                                                         ta2
                                                                                                      .ed
                 Ha
b.e
             or using additional data such as demographics—helps to customize marketing strategy for each segment. This
             process of describing the clusters is called profiling. Figure 1 is an example of such a process. A good deal of
                                                                         up
]isb
t]is
             cluster-analysis software also provides information on which cluster a customer belongs to. This information
                                                                    _G
             can be used to calculate the means of the profiling variables for each cluster. In the Geico example, it is useful
                                                                                                [at
2[a
             to investigate whether the segments also differ with respect to demographic variables such as age and income.
                                        ika
22
             In Table 3, consider the distribution of age and income for Segments A, B, and C as provided in Figure 1.
                                                                                                           02
                       ns
20
                                                                                                      ta2
                 Ha
                                                                                           ta
                                                                                        up
                                                                                                  up
                                                                                   _G
                                                                                                _G
                                                                                  ika
                                                                                               ika
                                                                          ns
                                                                                           ns
                                                                         Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
                                                                                                                 25
                                                                                                                                                                           a
                                                                                                                                                                         ]is
an
                                                                                                                                                                      ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
                                                                                                                                                                   up
                                                                                                                                                                   22
                                                                                                                                                              aG
                                                                                                                                                                20
                                                                                                                                                                du
             Page 7                                                                                                                                    UV0745
pta
b.e
                                                                                                                                                          sik
                                                                                                                                                       Gu
                                                                                                                                                       t]is
                                                      Table 3. Age and income distribution for segments.
an
                                                                                                                                                       an
                                                                                                                                                    2[a
                                                                                                                                                    a_
/H
                                                                                                                                                   /H
                                                  Segment            Mean                         Range
                                                                 Age Income ($)             Age   Income ($)
sik
                                                                                                                                                 02
                                                  A              21  15,000                 16–25 0–25,000
                                                                                                                                               du
                                                                                         an
                                                  B              45  120,000                33–55 75,000–215,000
ta2
.ed
                                                                                                                                            b.e
                                                  C              39  40,000                 39–54 24,000–60,000
up
                                                                                                                                        ]isb
                                                                                   ta/
                                                                                                                                       t]is
                 Mean represents the averages of age and income of customers belonging to a particular segment. Range
                                                                                                  _G
             represents the minimum and maximum values of age and income for customers in a segment. Whereas the
[at
                                                                                                                                   2[a
                                                                             Gu
             mean is useful for identifying the central tendency of a segment, the range helps in evaluating whether the
ika
                                                                                                                                  22
             segments overlap with regards to the profile variable.
                                                                                                                                 02
                                                                          a
ns
                                                                                                                                20
                  From Table 3, you see that Segment A customers who prefer high savings on their premium and do not
                                                                      sik
                                                                                                                              ta2
                                                                                     /Ha
             prefer having a neighborhood agent tend to be younger and have low income. These could probably be
                                                                                                                             pta
                                                                   an
college students or recent graduates who are more comfortable with transacting online. Customers who
                                                                                                                           up
                                                                                                                          Gu
             belong to Segment B, on the other hand, are older and have higher income levels. It would be interesting to
                                                                                  ta
                                                              /H
                                                                                                                        _G
             evaluate if these customers also tend to be married with kids. The security of having a neighborhood agent
                                                                               up
                                                                                                                        _
             who can help in case of an accident or emergency is very important to them, and they do not mind paying a
ika
                                                                                                                     ika
                                                                         aG
                                                         du
             higher price for this sense of security. These customers may also not be comfortable in transacting (or
             providing personal information) online.
                                                      b.e
ns
                                                                                                                  ns
                                                                      sik
                 Finally, while Segment C customers are as old as Segment B customers, they tend to have lower incomes
                                                                                                               Ha
                                                                                                              Ha
                                                 t]is
an
             and do not prefer to have a neighborhood agent (probably because of low disposable incomes). Identification
                                                                                                             /
                                                                                                           ta/
             of the segments through these demographic characteristics enables a marketer to target as well as customize
                                                                                                 ta
                                         2[a
/H
             communications to each segment. For example, if Geico decides to develop a network of neighborhood
                                                                                              up
                                                                                                         up
             agents, it can first focus on neighborhoods (identified through their zip codes) that match the profile of
                                       02
aG
                                                                                                       aG
                                                         du
             Segment B customers.
                                    ta2
b.e
sik
                                                                                                   sik
                                 up
             Conclusion
                                                 t]is
an
                                                                                                an
                              G
                  Given a segmentation basis, the K-means clustering algorithm would identify clusters and the customers
                                            2 [a
                           a_
/H
/H
             that belong to each cluster. The management, however, has to carefully select the variables to use for
             segmentation. Criteria frequently used for evaluating the effectiveness of a segmentation scheme include:
                       sik
02
             identifiability, sustainability, accessibility, and actionability.1 Identifiability refers to the extent that managers can
                                                                             u
                                                                                             du
                    n
ta2
.ed
             recognize segments in the marketplace. In the Geico example, the profiling of customers allows you to
                 Ha
b.e
             identify customer segments through their age and income information. PRIZM and ACORN are popular
                                 up
]isb
             databases that provide geodemographic information that can be used for segmentation as well as profiling.
                                                                                     t]is
             The sustainability criterion is satisfied if the segments represent a large enough portion of the market to ensure
                            _G
             profitable customization of the marketing program. The extent to which managers can reach the identified
                                                                [at
2[a
             segments through their marketing campaigns is captured by the accessibility criterion. Finally, actionability refers
                        ika
22
             to whether customers in the segment and the marketing mix necessary to satisfy their needs are consistent
                                                                             02
             with the goals and core competencies of the firm. The success of any segmentation process therefore requires
                     ns
20
ta2
                                                      ta
                                                   up
                                                                   up
                                              _G
_G
               1 For more details, refer to Wagner Kamakura and Michel Wedel, Market Segmentation: Conceptual and Methodological Foundations, 2nd ed. (Norwell, MA:
                                          ika
ika
                                                         ns
                                   Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
26