Business Analytics Using Data Mining: Term 6
Business Analytics Using Data Mining: Term 6
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
ru
pta
b.e
sik
Gu
t]is
an
an
2[a
a_
/H
/H
sik
02
du
an
ta2
.ed
b.e
Business Analytics using Data Mining
up
]isb
ta/
t]is
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
sik
ta2
Professor Vandith Pamuru
/Ha
pta
an
up
Gu
ta
/H
_G
up
_
ika
ika
aG
du
b.e
ns
ns
Term 6
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
b.e
2021-22
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
DISCLAIMER: The academic course pack contains copyrighted materials which are only
ta2
Ha
ta
meant to be downloaded by the authorized users for their course work. Please note that the
up
up
access is made available only to the duration of the course. Sharing of access with any
one (copying, forwarding, or other means) is a violation of copyright law and is strictly
_G
_G
prohibited.
ika
ika
ns
ns
Ha
Ha
1
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
pta
b.e
sik
Business Analytics using Data Mining
Gu
t]is
an
an
2[a
a_
/H
/H
Table of Contents
sik
02
du
an
ta2
.ed
b.e
H
up
]isb
ta/
S.No Topic Page No.
t]is
_G
p
[at
2[a
1
Gu
A Predictive Analytics Primer 03
ika
22
02
a
ns
20
sik
ta2
2 Where predictive analytics is having the biggest impact 07
/Ha
pta
an
up
Gu
ta
/H
_G
up
3 Screening for Chronic Kidney Disease 13
_
ika
ika
aG
du
ns
ns
sik
4 Ha Link
Ha
t]is
an
https://www.predictiveanalyticsworld.com/machinelearningtimes/12-
/
ta/
ta
2[a
/H
predictive-analytics-screw-ups/2049/
up
up
5 Cluster Analysis for Segmentation 20
02
aG
aG
du
ta2
b.e
sik
sik
6 Link
t]is
an
an
G
https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
2 [a
a_
/H
/H
sik
02
* Reference book at LRC for the book Data Mining for the book- Business Analytics: Concepts, Techniques, and Application
u
du
n
ta2
.ed
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
2
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
pta
b.e
sik
Gu
t]is
an
an
2[a
a_
/H
/H
sik
02
ANALYTICS
du
A Predictive Analytics
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
Primer
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
by Thomas H. Davenport sik
ta2
SEPTEMBER 02, 2014
/Ha
pta
an
up
Gu
ta
/H
_G
No one has the ability to capture and analyze data from the future. However, there is a way to predict
up
_
the future using data from the past. It’s called predictive analytics, and organizations do it every day.
ika
ika
aG
du
b.e
ns
ns
Has your company, for example, developed a customer lifetime value (CLTV) measure? That’s using
sik
Ha
Ha
predictive analytics to determine how much a customer will buy from the company over time. Do
t]is
an
you have a “next best offer” or product recommendation capability? That’s an analytical prediction of
/
ta/
ta
2[a
/H
the product or service that your customer is most likely to buy next. Have you made a forecast of
up
up
next quarter’s sales? Used digital marketing models to determine what ad to place on what
02
aG
aG
du
b.e
sik
sik
up
Predictive analytics are gaining in popularity, but what do you—a manager, not an analyst—really
t]is
an
an
need to know in order to interpret results and make better decisions? How do your data scientists do
G
2 [a
a_
what they do? By understanding a few basics, you will feel more comfortable working with and
/H
/H
communicating with others in your organization about the results and recommendations from
sik
02
predictive analytics. The quantitative analysis isn’t magic—but it is normally done with a lot of past
u
du
n
ta2
.ed
data, a little statistical wizardry, and some important assumptions. Let’s talk about each of these.
Ha
b.e
up
]isb
t]is
The Data: Lack of good data is the most common barrier to organizations seeking to employ
_G
[at
predictive analytics. To make predictions about what customers will buy in the future, for example,
2[a
ika
you need to have good data on who they are buying (which may require a loyalty program, or at least
22
02
a lot of analysis of their credit cards), what they have bought in the past, the attributes of those
ns
20
ta2
products (attribute-based predictions are often more accurate than the “people who buy this also buy
Ha
ta
this” type of model), and perhaps some demographic attributes of the customer (age, gender,
up
up
residential location, socioeconomic status, etc.). If you have multiple channels or customer
_G
_G
ika
ika
ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 2
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
3
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
touchpoints, you need to make sure that they capture data on customer purchases in the same way
aG
20
du
your previous channels did.
pta
b.e
sik
All in all, it’s a fairly tough job to create a single customer data warehouse with unique customer IDs
Gu
t]is
an
an
on everyone, and all past purchases customers have made through all channels. If you’ve already
2[a
a_
/H
/H
done that, you’ve got an incredible asset for predictive customer analytics.
sik
02
du
The Statistics: Regression analysis in its various forms is the primary tool that organizations use for
an
ta2
.ed
predictive analytics. It works like this in general: An analyst hypothesizes that a set of independent
b.e
H
up
]isb
variables (say, gender, income, visits to a website) are statistically correlated with the purchase of a
ta/
t]is
product for a sample of customers. The analyst performs a regression analysis to see just how
_G
p
[at
2[a
correlated each variable is; this usually requires some iteration to find the right combination of
Gu
ika
22
variables and the best model. Let’s say that the analyst succeeds and finds that each variable in the
02
a
model is important in explaining the product purchase, and together the variables explain a lot of
ns
20
sik
ta2
variation in the product’s sales. Using that regression equation, the analyst can then use the
/Ha
pta
an
regression coefficients—the degree to which each variable affects the purchase behavior—to create a
up
Gu
score predicting the likelihood of the purchase.
ta
/H
_G
up
_
ika
ika
aG
Voila! You have created a predictive model for other customers who weren’t in the sample. All you
du
have to do is compute their score, and offer the product to them if their score exceeds a certain level.
b.e
ns
ns
sik
It’s quite likely that the high scoring customers will want to buy the product—assuming the analyst
Ha
Ha
t]is
did the statistical work well and that the data were of good quality.
an
ta/
ta
2[a
/H
up
up
The Assumptions: That brings us to the other key factor in any predictive model—the assumptions
02
that underlie it. Every model has them, and it’s important to know what they are and monitor
aG
aG
du
ta2
whether they are still true. The big assumption in predictive analytics is that the future will continue
b.e
sik
sik
to be like the past. As Charles Duhigg describes in his book The Power of Habit, people establish
up
t]is
strong patterns of behavior that they usually keep up over time. Sometimes, however, they change
an
an
G
those behaviors, and the models that were used to predict them may no longer be valid.
2 [a
a_
/H
/H
sik
02
What makes assumptions invalid? The most common reason is time. If your model was created
u
du
n
several years ago, it may no longer accurately predict current behavior. The greater the elapsed time,
ta2
.ed
Ha
b.e
the more likely customer behavior has changed. Some Netflix predictive models, for example, that
up
]isb
were created on early Internet users had to be retired because later Internet users were substantially
t]is
_G
different. The pioneers were more technically-focused and relatively young; later users were
[at
2[a
essentially everyone.
ika
22
02
ns
20
Another reason a predictive model’s assumptions may no longer be valid is if the analyst didn’t
ta2
Ha
ta
include a key variable in the model, and that variable has changed substantially over time. The great
up
up
—and scary—example here is the financial crisis of 2008-9, caused largely by invalid models
_G
_G
predicting how likely mortgage customers were to repay their loans. The models didn’t include the
possibility that housing prices might stop rising, and even that they might fall. When they did start
ika
ika
ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 3
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
4
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
falling, it turned out that the models became poor predictors of mortgage repayment. In essence, the
aG
20
du
fact that housing prices would always rise was a hidden assumption in the models.
pta
b.e
sik
Since faulty or obsolete assumptions can clearly bring down whole banks and even (nearly!) whole
Gu
t]is
an
an
economies, it’s pretty important that they be carefully examined. Managers should always ask
2[a
a_
/H
/H
analysts what the key assumptions are, and what would have to happen for them to no longer be
valid. And both managers and analysts should continually monitor the world to see if key factors
sik
02
du
involved in assumptions might have changed over time.
an
ta2
.ed
b.e
H
up
]isb
With these fundamentals in mind, here are a few good questions to ask your analysts:
ta/
t]is
_G
p
[at
2[a
• Can you tell me something about the source of data you used in your analysis?
Gu
ika
22
• Are you sure the sample data are representative of the population?
02
a
• Are there any outliers in your data distribution? How did they affect the results?
ns
20
sik
ta2
• What assumptions are behind your analysis?
/Ha
pta
an
• Are there any conditions that would make your assumptions invalid?
up
Gu
ta
/H
_G
up
Even with those cautions, it’s still pretty amazing that we can use analytics to predict the future. All
_
ika
ika
aG
we have to do is gather the right data, do the right type of statistical model, and be careful of our
du
assumptions. Analytical predictions may be harder to generate than those by the late-night television
b.e
ns
ns
sik
soothsayer Carnac the Magnificent, but they are usually considerably more accurate. Ha
Ha
t]is
an
ta/
ta
2[a
/H
Thomas H. Davenport is the president’s distinguished professor in management and information technology at Babson
up
up
College, and cofounder of the International Institute for Analytics. He also contributes to the MIT Initiative on the Digital
02
Economy as a fellow, and as a senior advisor to Deloitte Analytics. Author of over a dozen management books, his latest
aG
aG
du
is Only Humans Need Apply: Winners and Losers in the Age of Smart Machines.
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
COPYRIGHT © 2014 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 4
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
5
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Copyright 2014 Harvard Business Publishing. All Rights Reserved. Additional restrictions
pta
b.e
sik
may apply including the use of this content as assigned course material. Please consult your
Gu
t]is
institution's librarian about any restrictions that might apply under the license with your
an
an
institution. For more information and teaching resources from Harvard Business Publishing
2[a
a_
/H
/H
including Harvard Business School Cases, eLearning products, and business simulations
please visit hbsp.harvard.edu.
sik
02
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
sik
ta2
/Ha
pta
an
up
Gu
ta
/H
_G
up
_
ika
ika
aG
du
b.e
ns
ns
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
6
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
pta
b.e
sik
Gu
t]is
an
an
2[a
a_
/H
/H
sik
02
ANALYTICS
du
Where Predictive
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
Analytics Is Having the
_G
p
[at
2[a
Gu
ika
22
Biggest Impact
02
a
ns
20
sik
ta2
/Ha
pta
an
up
by Jacob LaRiviere, Preston McAfee, Justin Rao, Vijay K. Narayanan and Walter Sun
Gu
ta
/H
_G
MAY 25, 2016
up
_
ika
ika
aG
du
b.e
ns
ns
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
2[a
ika
22
The big data revolution is upon us. Firms are scrambling to hire a new brand of analysts dubbed “data
02
ns
scientists,” and universities have responded to this demand by introducing data science courses into
20
ta2
degrees ranging from computer science to business. Survey-based reports find that firms are
Ha
ta
currently spending an estimated $36 billion on storage and infrastructure, and that is expected to
up
up
double by 2020.
_G
_G
ika
ika
ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 2
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
7
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
Once companies are logging and storing detailed data on all their customer engagements and internal
aG
20
du
processes, what’s next? Presumably, firms are investing in big data infrastructure because they
pta
b.e
believe that it offers a positive return on investment. However, looking at the surveys and consulting
sik
reports, it is unclear what the precise use cases are that will drive this positive ROI from big data.
Gu
t]is
an
an
2[a
a_
/H
/H
Our goal in this article is to offer specific, real-world case studies to show how big data has provided
value for companies that have worked with Microsoft’s analytics teams. These cases reveal the
sik
02
du
circumstances in which big data predictive analytics are likely to enable novel and high-value
an
ta2
.ed
solutions, and the situations where the gains are likely to be minimal.
b.e
H
up
]isb
ta/
t]is
Predicting demand. The first use case involves predicting demand for consumer products that are in
_G
p
[at
2[a
the “long tail” of consumption. Firms value accurate demand forecasts because inventory is
Gu
ika
22
expensive to keep on shelves and stockouts are detrimental to both short-term revenue and long-
02
a
term customer engagement. Aggregated total sales is a poor proxy because firms need to distribute
ns
20
sik
ta2
inventory geographically, necessitating hyperlocal forecasts. The traditional way of solving this
/Ha
pta
an
problem is using time-series econometrics with historical sales data. This method works well for
up
Gu
ta
popular products in large regions but tends to fail when data gets thin because random noise
/H
_G
up
overwhelms the underlying signal.
_
ika
ika
aG
du
A big data solution to this problem is to use anonymized and aggregated web search or sentiment
b.e
ns
ns
sik
data linked to each store’s location on top of the existing time-series data. Microsoft data scientists
Ha
Ha
t]is
have employed this approach to help a forecasting firm predict auto sales. Building models with web
an
/
search data as one of the inputs reduces mean absolute forecast error, a standard measure of
ta/
ta
2[a
/H
up
up
prediction accuracy, for monthly national sales predictions on the order of 40% from baseline for
02
auto makes with relatively small market shares, compared to traditional time-series models.
aG
aG
du
ta2
Although the gains were smaller for the most popular models at the national level, the relative
b.e
sik
sik
t]is
an
an
G
In this case, the big data solution leverages the previously unused data point that people do a
2 [a
a_
/H
/H
considerable amount of social inquiry and research online before buying a car. The increased
sik
02
prediction accuracy, in turn, makes it possible to achieve large increases in operational efficiency
u
du
n
.ed
Ha
b.e
up
]isb
Anonymized web search data has proven to be helpful for other forecasts as well since online activity
t]is
_G
often is a good leading proxy for purchases and actions of the general public. Having the additional
[at
2[a
data is insufficient on its own. Processing search data and combining it with traditional sources is
ika
22
vital in creating a successful prediction: We found that raw search query volume is insufficient in
02
ns
20
ta
up
up
Being intelligent about which signals to draw from big data requires care, and best practices can be
_G
_G
case-specific. For example, single queries from a user might be less important than multiple queries
ika
ika
ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 3
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
8
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
from a user. Although we used search data in this case study, a firm could just as easily use the
aG
20
du
location of users visiting their website or link detailed sales data to a customer’s location.
pta
b.e
sik
Improved pricing. Using a single price is economically inefficient because part of the demand curve
Gu
t]is
an
an
that could be profitably served is priced out of the market. As a consequence, firms regularly offer
2[a
a_
/H
/H
targeted discounts, promotions, and segment-based pricing to target different consumers. E-
commerce websites have a distinct advantage in pursuing such an approach because they log
sik
02
du
detailed information on customer browsing, not just the goods they end up purchasing, and
an
ta2
.ed
aggressively adjust prices over time. These price adjustments are a form of experimentation and,
b.e
H
up
]isb
jointly with big data, allow firms to learn more about their customers’ price responsiveness.
ta/
t]is
_G
p
[at
2[a
Offline retailers can mimic e-commerce’s nuanced pricing strategies by tracking consumers through
Gu
ika
22
smartphone connectivity and logging which customers enter the store, what type of goods they look
02
a
at, and whether they make a purchase. Machine learning applied to this data can algorithmically
ns
20
sik
ta2
generate customer segments based on price responsiveness and preferences, which generally offers a
/Ha
pta
an
up
Gu
ta
/H
_G
up
Our experience with pricing advertising on the Bing search engine is that using big data can produce
_
ika
ika
aG
substantial gains by better matching advertisers to consumers. The success of algorithmic targeting
du
has been well documented and is a key driver of revenue in online advertising market. Advances in
b.e
ns
ns
sik
measurement technology increasingly allow offline firms to benefit from these types of gains through
Ha
Ha
t]is
ta/
ta
2[a
/H
up
up
Predictive maintenance. Smoothly operating supply chains are vital for stable profits. Machine
02
downtime imposes a cost to firms due to forgone productivity and can be particularly disruptive in
aG
aG
du
ta2
both complex manufacturing supply chains and consumer products. Executives in asset-intensive
b.e
sik
sik
industries often state that the primary operational risk to their businesses is unexpected failures of
up
t]is
their assets. A wave of new data generated by the “internet of things” (IoT) can provide real-time
an
an
G
/H
/H
02
du
n
Airlines are particularly interested in predicting mechanical failures in advance so that they can
ta2
.ed
Ha
b.e
reduce flight delays or cancellations. Microsoft data scientists from the Cortana Intelligence Suite
up
]isb
team are able to predict the probability of aircrafts being delayed or canceled in the future based on
t]is
_G
relevant data sources, such as maintenance history and flight route information. A machine-learning
[at
2[a
solution based on historical data and applied in real time predicts the type of mechanical issue that
ika
22
will result in a delay or cancellation of a flight within the next 24 hours, allowing the airlines to take
02
ns
20
maintenance actions while the aircrafts are being serviced, thus preventing possible delays or
ta2
Ha
ta
cancellations.
up
up
_G
_G
Similar predictive-maintenance solutions are also built in other industries — for example, tracking
real-time telemetry data to predict the remaining useful life of an aircraft engine, using sensor data to
ika
ika
ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 4
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
9
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
predict the failure of an ATM cash withdrawal transaction, employing telemetry data to predict the
aG
20
du
failure of electric submersible pumps used to extract crude in the oil and gas industry, predicting the
pta
b.e
failures of circuit boards at early stages in the manufacturing process, predicting credit defaults, and
sik
forecasting energy demand in hyperlocal regions to predict the overload situations of energy grids.
Gu
t]is
an
an
Machine learning will make supply chains less brittle and reduce the effects of disruptions for many
2[a
a_
/H
/H
goods and services.
sik
02
du
These cases help highlight a few general principles:
an
ta2
.ed
b.e
H
up
]isb
• The value derived from the analytics piece can greatly exceed the cost of the infrastructure. This
ta/
t]is
indicates there will be strong growth in big data consulting services and specialized roles within
_G
p
[at
2[a
firms.
Gu
ika
22
• Big data is less about size and more about introducing fundamentally new information to
02
a
prediction and decision processes. This information matters most when existing data sources are
ns
20
sik
ta2
insufficient to provide accurate or actionable predictions — for example, due to small sample sizes
/Ha
pta
an
or coarseness of historical sales (small effective regions, niche products, new offerings, etc.).
up
Gu
ta
• The new information is often buried in detailed and relatively unstructured data logs (known as a
/H
_G
up
“data lake”), and techniques from computer science are needed to extract insights from it. To
_
ika
ika
aG
leverage big data, it is vital to have talented data engineers, statisticians, and behavioral scientists
du
working in tandem. “Data scientist” is often used to refer to someone who has these three skills,
b.e
ns
ns
sik
Ha
t]is
an
ta/
Radically new applications. The cases that we’ve discussed concern how big data can be employed to
ta
2[a
/H
up
up
improve existing processes (e.g., more-precise demand forecasts, better price sensitivity estimates,
02
better predictions of machine failure). But it also has the potential to be applied in ways that disrupt
aG
aG
du
ta2
existing processes. For example, machine-learning models taking massive data sets as inputs,
b.e
sik
sik
coupled with clever designs that account for patient histories, have to the potential to revolutionize
up
t]is
how certain diseases are diagnosed and treated. Another example involves matching distributed
an
an
G
electricity generation (e.g., solar panels on roofs) to localized electricity demand, unlocking huge
2 [a
a_
/H
/H
02
du
n
The value described from predicting demand more accurately, better pricing, and predictive
ta2
.ed
Ha
b.e
maintenance are the specific use cases that easily justify large firms’ investments in big data
up
]isb
infrastructure and data science. These uses are likely to drive value of the same order of magnitude as
t]is
_G
the investments. The value of radically new applications is challenging to understand ex ante and
[at
2[a
speculative by nature. It is reasonable to expect losses for many firms, due to uncertain and higher
ika
22
20
ta2
Ha
ta
Jacob LaRiviere is an economist at Microsoft Technology and Research, an adjunct professor at the University of
up
up
_G
Preston McAfee is a corporate vice president and the chief economist at Microsoft.
ika
ika
ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 5
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
10
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
Justin Rao is an economist at Microsoft Research and an affiliate faculty member at the University of Washington.
aG
20
du
pta
b.e
sik
Vijay K. Narayanan leads the Algorithms and Data Science Solutions unit of the Data Group at Microsoft.
Gu
t]is
an
an
Walter Sun is the founder of Bing Predicts and a partner data scientist at Microsoft. He is an affiliate faculty member of
2[a
a_
/H
/H
the University of Washington and an adjunct professor at Seattle University.
sik
02
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
sik
ta2
/Ha
pta
an
up
Gu
ta
/H
_G
up
_
ika
ika
aG
du
b.e
ns
ns
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
COPYRIGHT © 2016 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED. 6
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
11
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Copyright 2016 Harvard Business Publishing. All Rights Reserved. Additional restrictions
pta
b.e
sik
may apply including the use of this content as assigned course material. Please consult your
Gu
t]is
institution's librarian about any restrictions that might apply under the license with your
an
an
institution. For more information and teaching resources from Harvard Business Publishing
2[a
a_
/H
/H
including Harvard Business School Cases, eLearning products, and business simulations
please visit hbsp.harvard.edu.
sik
02
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
sik
ta2
/Ha
pta
an
up
Gu
ta
/H
_G
up
_
ika
ika
aG
du
b.e
ns
ns
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
Reproduced with permission from the Publisher for use only in “Business Analytics using Data Mining [Term 6_PGP]” taught by “Professor Vandith Pamuru” at Indian School of Business-Mohali scheduled on “January 31 – March 03, 2022
12
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
UV0871
pta
b.e
sik
Gu
t]is
an
an
2[a
a_
/H
/H
SCREENING FOR CHRONIC KIDNEY DISEASE
sik
02
du
an
ta2
.ed
b.e
Chronic Kidney Disease (CKD) is a progressive condition that results in
up
]isb
significant morbidity and mortality. Because of the important role the kidneys
ta/
t]is
play in maintaining homeostasis, CKD can affect almost every body system.
_G
p
[at
2[a
Early recognition and intervention are essential to slowing disease progression,
Gu
ika
maintaining quality of life, and improving outcomes. Family physicians have the
22
02
opportunity to screen at-risk patients, identify affected patients, and ameliorate the
a
ns
20
sik
impact of CKD by initiating early therapy and monitoring disease progression.1
ta2
/Ha
pta
an
up
Gu
ta
/H
The purpose of this case is to create an easy-to-use screening tool to identify patients at
_G
up
risk for CKD. Despite the wide availability and low cost of a test for CKD based on one or more
_
ika
ika
blood samples, studies have shown that many in the at-risk population have not been tested. One
aG
du
reason for this is that awareness of CKD is low. Given the proven benefits of early detection and
b.e
ns
ns
treatment, the need for some kind of screening tool is clear. Although there is no reason to test
sik
Ha
Ha
everyone, those patients with a high enough probability of having CKD should be tested. The
t]is
an
purpose of this case is to see if those high-risk patients can be identified using easily obtainable
/
ta/
ta
2[a
patient data.
/H
up
up
02
aG
aG
du
b.e
sik
sik
up
Since 1975, the National Center for Health Statistics of the Centers for Disease Control
t]is
an
an
and Prevention has conducted nationwide surveys of U.S. adults. Using trained personnel, the
G
center collected a wide variety of demographic and health information using direct interviews,
2 [a
a_
/H
/H
examinations, and blood samples. The data set consists of selected information from 8,819 adults
sik
02
20 years of age or older taken from the 1999–2000 and 2001–2002 surveys. The sample subjects
u
du
were randomly divided into two pools: a 6,000-case training set and a 2,819-case validation
n
ta2
.ed
Ha
b.e
sample.
up
]isb
t]is
_G
1
Catherine S. Snively, MD, and Cecilia Gutierrez, MD, “Chronic Kidney Disease: Prevention and Treatment of
[at
2[a
22
02
ns
20
This case was prepared by Professor Phillip E. Pfeifer and Professor Heejung Bang (Weill Cornell Medical
ta2
College). It was written as a basis for class discussion rather than to illustrate effective or ineffective handling of an
Ha
ta
administrative situation. Copyright © 2007 by the University of Virginia Darden School Foundation, Charlottesville,
up
up
VA. All rights reserved. To order copies, send an e-mail to sales@dardenbusinesspublishing.com. No part of this
publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by
_G
_G
any means—electronic, mechanical, photocopying, recording, or otherwise—without the permission of the Darden
School Foundation.
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
13
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-2- UV0871
pta
b.e
sik
A test for CKD was administered to everyone in the study population.2 The variable of
Gu
t]is
an
an
interest is CDK, a 0/1 dummy variable indicating whether or not the subject had CKD. Exhibit 1
2[a
a_
/H
/H
defines the 34 variables in the data set. Notice that variables in columns A through J are
demographic in nature, K through V were collected during the physical exam, and W through
sik
02
AH are based, in part, on self-reported health histories.
du
an
ta2
.ed
b.e
H
up
]isb
The Causes of CKD3
ta/
t]is
_G
p
[at
2[a
The two main causes of chronic kidney disease are diabetes and high blood pressure,
Gu
ika
which are responsible for up to two-thirds of the cases. Diabetes happens when your blood sugar
22
02
is too high, causing damage to many organs in your body, including the kidneys and heart, as
a
ns
20
sik
well as blood vessels, nerves, and eyes. High blood pressure, or hypertension, occurs when the
ta2
/Ha
pta
pressure of your blood against the walls of your blood vessels increases. If uncontrolled, or
an
up
poorly controlled, high blood pressure can be a leading cause of heart attacks, strokes, and
Gu
ta
/H
chronic kidney disease. Also, chronic kidney disease can cause high blood pressure.
_G
up
_
ika
ika
Other conditions that affect the kidneys are:
aG
du
b.e
ns
ns
• Glomerulonephritis, a group of diseases that cause inflammation and damage to the
sik
kidney’s filtering units. These disorders are the third most common type of kidney
Ha
Ha
t]is
disease.
an
ta/
ta
2[a
/H
• Inherited diseases, such as polycystic kidney disease, which causes large cysts to form in
up
up
the kidneys and damage the surrounding tissue.
02
aG
aG
du
• Malformations that occur as a baby develops in its mother’s womb. For example, a
ta2
narrowing may occur that prevents normal outflow of urine and causes urine to flow back
b.e
sik
sik
up to the kidney. This causes infections and may damage the kidneys.
up
t]is
an
an
• Lupus and other diseases that affect the body’s immune system.
G
2 [a
a_
/H
/H
02
du
n
.ed
•
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
2
The test used a formula to estimate glomerular filtration rate based on measured serum creatinine
ta2
concentration, age, gender, and race. CKD was defined as estimated filtration rate less than 60 ml/min/1.73 m2. For
Ha
ta
details, see Heejung Bang, David A. Shoham, Philip J. Klemmer, Ronald J. Falk, Madhu Mazumdar, Debbie
up
up
Gipson, Romulo E. Colindres, and Abhijit V. Kshirsagar, “SCreening for Occult Renal Disease (SCORED): A
Simple Prediction Model for Chronic Kidney Disease,” Archives of Internal Medicine, 2007.
_G
_G
3
This section is excerpted from the National Kidney Foundation Web site (www.kidney.org), © 2007, National
Kidney Foundation, Inc., 30 East 33rd Street, New York, NY 10016.
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
14
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-3- UV0871
pta
b.e
sik
Who Is at Risk?4
Gu
t]is
an
an
2[a
a_
/H
/H
While anyone at any age can develop chronic kidney disease (CKD), a number of risk
factors have been identified that may lead to possible problems with your kidneys. These
sik
02
include:
du
an
ta2
.ed
b.e
• Diabetes. Diabetes is the leading cause of CKD. If you have diabetes, talk with your
up
]isb
doctor about how to keep your blood glucose as close to normal as possible to ensure
ta/
t]is
your diabetes is under control.
_G
p
[at
2[a
Gu
• Hypertension. Hypertension, also called high blood pressure, is the second-highest cause
ika
22
of CKD. Keep your blood pressure under control. A number of effective medications are
02
a
ns
20
available to help you with this task. Your doctor will help you to determine which
sik
ta2
medication is right for you.
/Ha
pta
an
up
• Cardiovascular disease. In addition to hypertension, other diseases of the heart and
Gu
ta
/H
blood vessels may increase your risk for kidney disease. People who have had heart
_G
up
attacks or strokes, congestive heart failure, coronary artery disease, or peripheral vascular
_
disease need to be monitored carefully for kidney problems.
ika
ika
aG
du
• Family history of kidney disease. Some kidney diseases are genetic. People with a
b.e
ns
ns
sik
mother, father, brother, or sister who has had a kidney disease are more likely to develop
Ha
Ha
t]is
Age. People 60 years and older are at a higher risk for developing CKD.
ta/
ta
2[a
•
/H
up
up
• Race. People belonging to certain ethnic groups, such as First Nations (Canadian
02
aboriginal peoples) and Pacific Islanders, are at a higher risk for developing this disease.
aG
aG
du
ta2
b.e
sik
sik
up
The Challenge
t]is
an
an
G
2 [a
a_
The list of risk factors above is a reflection of the results of several separate studies. What
/H
/H
we want to do is figure out how to combine all the possible risk factors to measure the overall
sik
02
du
n
ta2
.ed
Ha
b.e
The 34 variables in the data set are all easily obtained by a family physician during
up
]isb
routine checkups. Only the cholesterol measurements and the hemoglobin count (used to help
t]is
define anemia) require blood tests. The challenge is to come up with some kind of way to use the
_G
[at
first 33 variables to predict the 34th. The idea would be to create something very simple (like the
2[a
ika
quizzes you see in popular magazines, for example) that would identify subjects at risk of having
22
02
CKD. The high-risk subjects would then be encouraged to have their serum creatinine levels
ns
20
checked and/or undergo a complete urinalysis. The challenge here is strictly one of prediction.
ta2
Ha
ta
The variables used need not cause CKD. They need only be indicators of the presence of CKD.
up
up
_G
_G
4
This section was excerpted from the Web site of the government of British Columbia on 18 June 2007.
(www.gov.bc.ca), © 2001, Province of British Columbia.
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
15
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-4- UV0871
pta
b.e
sik
It is also important to note that the study population is not a random sample of U.S.
Gu
t]is
an
an
adults. That means that our predictions will not apply directly to the U.S. population and should
2[a
a_
/H
/H
not be used for actual decision-making.
sik
02
To get us started, Exhibit 2 reports summary statistics for the 6,000-subject training set
du
an
ta2
.ed
for each of the numerical variables. These statistics are reported for those with and without CKD.
b.e
A T-statistic to test the equality of the means for the two groups is also reported. Of the 11
up
]isb
numerically scaled variables, age is the most significant predictor of CKD with the average age
ta/
t]is
of those with CKD being 73 compared to 47 for those without CKD.
_G
p
[at
2[a
Gu
ika
For categorical variables, a chi-squared test of association is appropriate. Exhibit 3
22
02
reports the cross tabulation counts as well as the calculated chi-squared statistics. Remember, the
a
ns
20
sik
degrees of freedom associated with each of these chi-squares depend on the number of categories
ta2
/Ha
pta
taken on by each variable. Remember also that subjects with missing values have been ignored
an
up
when constructing Exhibits 2 and 3. The most significant predictor of CKD from among the
Gu
ta
/H
categorical variables is hypertension. Of those with hypertension, 15.5% had CKD compared to
_G
up
2.7% of those without hypertension. It also appears Hispanics are under-represented in the CKD
_
ika
ika
population and whites are over-represented. It also appears that those who list “noplace” as
aG
du
where they get their health care are very unlikely to have CKD.
b.e
ns
ns
sik
Ha
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
16
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-5- UV0871
pta
b.e
sik
Exhibit 1
Gu
t]is
an
an
SCREENING FOR CHRONIC KIDNEY DISEASE
2[a
a_
/H
/H
Variable Definitions
sik
02
du
an
ta2
.ed
b.e
Col. Variable Definition
up
]isb
A ID Identification number
ta/
t]is
B Age Age (years)
_G
p
[at
C Female 1 if female
2[a
Gu
D Racegrp Self-reported race/ethnic group (white, black, Hispanic, other)
ika
22
E Educ 1 if more than high school
02
a
ns
20
F Unmarried 1 if unmarried sik
ta2
G Income 1 if household income is above the median
/Ha
pta
H CareSource Self-reported source of medical care (Dr./HMO, clinic, noplace, other)
an
up
I Insured 1 if covered by health insurance.
Gu
ta
/H
_G
up
K Height Height (cm)
_
ika
ika
aG
L BMI
M Obese 1 if BMI is greater than 30 kg/m2
b.e
ns
ns
sik
Ha
O SBP Systolic blood pressure (max)
t]is
an
ta/
ta
2[a
up
02
aG
du
U PVD Peripheral vascular disease reflected by reduced SBP at the leg relative to the arm.
b.e
sik
sik
up
Mostly sit (1); stand or walk a lot (2); lift light loads or climb stairs often (3);
V Activity
t]is
an
an
2 [a
a_
/H
/H
02
Y Hypertension The presence of at least one of four indicators of high blood pressure.
u
du
n
.ed
Ha
b.e
]isb
t]is
AC Stroke Self-reported response to "Has a doctor ever told you that you had a stroke?"
_G
Response to "Has a doctor ever told you that you had angina pectoris,
[at
2[a
AD CVD
myocardial infarction, or stroke?"
ika
22
20
ta
up
AG Anemia
or hemoglobin at exam lower than 11g/dL
_G
_G
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
17
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-6- UV0871
pta
b.e
sik
Exhibit 2
Gu
t]is
an
an
SCREENING FOR CHRONIC KIDNEY DISEASE
2[a
a_
/H
/H
Descriptive Statistics for Numerically Scaled Variables
sik
02
(training-set data broken out by CKD groups)
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
CKD=0 CKD=1
_G
Average Std Dev Count Average Std Dev Count T-stat
[at
2[a
Gu
Age 47.15 17.90 5536 73.05 11.71 464 -43.56
ika
22
Weight 79.17 19.60 5432 77.74 19.25 435 1.49
02
a
ns
Height 167.25 10.12 5433 165.29 10.41 428 3.77
20
sik
ta2
BMI 28.24 6.22 5377 28.35 5.98 417 -0.36
/Ha
pta
Waist 96.54 15.24 5365 100.10 14.44 420 -4.85
an
up
SBP 124.27 20.14 5352 141.47 25.28 442 -13.94
Gu
ta
/H
_G
DBP 71.86 12.24 5318 67.73 14.28 430 5.83
up
_
HDL 51.97 15.79 5529 50.08 16.18 463 2.41
ika
ika
aG
du
ns
ns
sik
Ha
t]is
an
ta/
ta
2[a
/H
up
up
02
aG
aG
du
ta2
b.e
sik
sik
up
t]is
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
Ha
b.e
up
]isb
t]is
_G
[at
2[a
ika
22
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
18
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
-7- UV0871
pta
b.e
sik
Exhibit 3
Gu
t]is
an
an
SCREENING FOR CHRONIC KIDNEY DISEASE
2[a
a_
/H
/H
CrossTabs for Categorical Variables
sik
02
Training Set Data
du
an
ta2
.ed
b.e
H
up
]isb
Variable=0 Variable=1
ta/
t]is
Variable CKD=0 CKD=1 %1s CKD=0 CKD=1 %1s Chi-square
_G
p
[at
Female 2655 210 7.3%* 2881 254 8.1% 1.3
2[a
Gu
ika
Educ 3064 308 9.1% 2458 155 5.9% 21.1
22
02
Unmarried 3335 227 6.4% 1926 211 9.9% 23.1
ns
20
Income 2723 293sik 9.7% 2088 104 4.7% 44.5
ta2
Insured 1137 17 1.5% 4329 439 9.2% 78.2
/Ha
pta
an
up
Dyslipidemia 4951 414 7.7% 585 50 7.9% 0.0
Gu
ta
/H
_G
PVD 5379 395 6.8% 157 69 30.5% 171.1
up
_
Poor Vision 4932 355 6.7% 277 60 17.8% 57.0
ika
ika
aG
du
ns
ns
Fam Hypertension 4231 388 8.4% 1305 76 5.5% 12.5
sik
Ha
Ha
Diabetes 4998 334 6.3% 537 130 19.5% 145.3
t]is
an
ta/
ta
/H
up
02
aG
du
sik
sik
up
t]is
*Read: Of the subjects who were not female, 7.3% (210) had CKD. Of the females, 8.1% (254) had CKD.
an
an
G
2 [a
a_
/H
/H
sik
02
du
n
ta2
.ed
b.e
]isb
[at
2[a
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
19
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
UV0745
pta
b.e
Rev. Mar. 28, 2018
sik
Gu
t]is
an
an
2[a
a_
/H
/H
sik
02
Cluster Analysis for Segmentation
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
_G
Introduction
[at
2[a
Gu
ika
We all understand that consumers are not all alike. This provides a challenge for the development and
22
02
marketing of profitable products and services. Not every offering will be right for every customer, nor will
a
ns
20
every customer be equally responsive to your marketing efforts. Segmentation is a way of organizing
sik
ta2
/Ha
customers into groups with similar traits, product preferences, or expectations. Once segments are identified,
pta
an
marketing messages and in many cases even products can be customized for each segment. The better the
up
segment(s) chosen for targeting by a particular organization, the more successful the organization is assumed
Gu
ta
/H
_G
to be in the marketplace. Since its introduction in the late 1950s, market segmentation has become a central
up
_
concept of marketing practice.
ika
ika
aG
du
Segments are constructed on the basis of customers’ (1) demographic characteristics, (2) psychographics,
b.e
ns
ns
(3) desired benefits from products/services, and (4) past-purchase and product-use behaviors. These days,
sik
most firms possess rich information about customers’ actual purchase behavior, geodemographic, and
Ha
Ha
t]is
an
psychographic characteristics. In cases where firms do not have access to detailed information about each
/
ta/
customer, information from surveys of a representative sample of the customers can be used as the basis for
ta
2[a
/H
segmentation.
up
up
02
aG
aG
du
An Example
ta2
b.e
sik
sik
up
Consider Geico, an auto insurance company. Suppose Geico hypothetically plans to customize its auto
t]is
insurance offerings and needs to understand what its customers view as important from their insurance
an
an
G
provider. Geico can ask its customers to rate how important the following two attributes are to them when
2 [a
a_
/H
/H
02
savings on premium
u
du
n
ta2
.ed
b.e
up
]isb
The importance of the attributes is measured using a seven-point Likert-type scale, where a rating of one
t]is
represents not important and seven represents very important. Unless every respondent who is surveyed gives
_G
identical ratings, the data will contain variations that you can use to cluster or group respondents together, and
[at
2[a
such clusters are the segments. The groupings of customers are most similar to each other if they are part of
ika
22
the same segment and most different from each other if they are part of different segments. By inference,
02
ns
20
ta2
Ha
ta
up
up
This technical note was prepared by Rajkumar Venkatesan, Associate Professor of Business Administration. Copyright 2007 by the University of
Virginia Darden School Foundation, Charlottesville, VA. All rights reserved. To order copies, send an email to sales@dardenbusinesspublishing.com. No
_G
_G
part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means—electronic, mechanical, photocopying,
recording, or otherwise—without the permission of the Darden School Foundation. Our goal is to publish materials of the highest quality, so please submit any
ika
ika
errata to editorial@dardenbusinesspublishing.com.
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
20
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 2 UV0745
pta
b.e
sik
Gu
t]is
then, actions taken toward customers in the same segment should lead to similar responses, and actions taken
an
an
toward customers in different segments should lead to different responses.
2[a
a_
/H
/H
Another way of saying this is that the aspects of auto insurance that are important to any given customer
sik
02
in one segment will also be important to other customers in that same segment. Furthermore, those aspects
du
an
that are important to that customer will be different from what is important to a customer in a different
ta2
.ed
b.e
segment. Figure 1 shows what the analysis in this example might look like:
up
]isb
ta/
t]is
Figure 1. Segmentation of Geico customers.
_G
p
[at
2[a
Gu
ika
Premium Savings
22
Very Important
02
a
ns
20
sik
ta2
/Ha
pta
Segment A Segment C
an
up
(49%) (15%)
Gu
ta
/H
_G
up
Agent Not
_
Agent Very
ika
ika
Important
aG
du
Important
b.e
ns
ns
sik
Segment B
Ha
Ha
t]is
(36%)
an
ta/
ta
2[a
/H
up
up
02
Premium
aG
aG
du
Savings Not
ta2
Important
b.e
sik
sik
up
an
an
G
The analysis shows three distinct segments. The majority of Geico’s customers (Segment A, 49%) prefer
2 [a
a_
/H
/H
savings on their premium, and they do not prefer having a neighborhood agent. Customers who belong to
sik
Segment B (about 36%) prefer having a neighborhood agent and premium savings is not important to them.
02
du
Some customers (Segment C, 15%) prefer both the savings on their premium as well as a neighborhood
n
ta2
.ed
agent. This analysis shows that Geico can benefit by adding an offline channel (i.e., developing a network of
Ha
b.e
neighborhood agents) to serve Segment B and also charge a higher premium to them for providing this
up
]isb
convenience. Of course, the caveat is the increased competition with other insurance providers, such as
t]is
[at
2[a
ika
22
Cluster Analysis
02
ns
20
ta2
Cluster analysis is a class of statistical techniques that can be applied to data that exhibit natural
Ha
ta
groupings. Cluster analysis makes no distinction between dependent and independent variables. The entire set
up
up
of interdependent relationships is examined. Cluster analysis sorts through the raw data on customers and
groups them into clusters. A cluster is a group of relatively homogeneous customers. Customers who belong to
_G
_G
the same cluster are similar to each other. They are also dissimilar to customers outside the cluster,
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
21
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 3 UV0745
pta
b.e
sik
Gu
t]is
particularly customers in other clusters. The primary input for cluster analysis is a measure of similarity
an
an
between customers, such as correlation coefficients, distance measures, and association coefficients.
2[a
a_
/H
/H
The following are the basic steps involved in cluster analysis:
sik
02
du
1. Formulate the problem—select the variables you want to use as the basis for clustering.
an
ta2
.ed
b.e
2. Compute the distance between customers along the selected variables.
up
]isb
ta/
3. Apply the clustering procedure to the distance measures.
t]is
_G
p
[at
4. Decide on the number of clusters.
2[a
Gu
ika
5. Map and interpret clusters—draw conclusions—illustrative techniques like perceptual maps are
22
02
useful.
a
ns
20
sik
ta2
/Ha
pta
an
Distance Measures
up
Gu
ta
/H
_G
The main input into any cluster analysis procedure is a measure of distance between individuals who are
up
being clustered. The objective of a distance measure is to quantify the difference between two individuals on
_
ika
ika
aG
the variables you are using for the segmentation. A shorter (longer) distance between two individuals would
du
imply they have similar (dissimilar) preferences on the segmentation variables. Distance between two
b.e
ns
ns
individuals is obtained through a measure called Euclidean distance. If two individuals, Joe and Sam, are being
sik
clustered on the basis of n variables, then the Euclidean distance between Joe and Sam is represented as:
Ha
Ha
t]is
an
ta/
2 2
ta
2[a
/H
Joe,1
Euclidean distance =
up
up
02
where:
aG
aG
du
ta2
b.e
sik
up
an
an
G
A pairwise distance matrix among individuals who are being clustered can be created using the Euclidean
2 [a
a_
/H
/H
distance measure. Extending the preceding example, consider three individuals—Joe, Sam, and Sara—who
sik
are being clustered based on their preference for Premium Savings and a Neighborhood Agent. The
02
du
importance ratings on these two attributes for Joe, Sam, and Sara are shown in Table 1.
n
ta2
.ed
Ha
b.e
]isb
t]is
2[a
Joe 4 7
ika
22
Sam 3 4
02
ns
Sara 5 3
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
22
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 4 UV0745
pta
b.e
sik
Gu
t]is
The Euclidean distance between Joe and Sam is obtained as:
an
an
2[a
a_
/H
/H
Euclidean distance (Joe, Sam) =
4 32 7 42 = 3.2.
sik
02
du
an
The first term in this Euclidean distance measure is the squared difference between Joe and Sam on the
ta2
.ed
b.e
importance score for Premium Savings, and the second term is the squared difference between them on the
H
importance score for Neighborhood Agent. The Euclidean distances are then computed for each pairwise
up
]isb
ta/
t]is
combination of the three individuals being clustered to obtain a pairwise distance matrix. The pairwise
_G
distance matrix for Joe, Sam, and Sara is shown in Table 2.
[at
2[a
Gu
ika
22
Table 2. Pairwise distance matrix.
02
a
ns
20
sik Joe Sam Sara
ta2
Joe 0 3.2 4.1
/Ha
pta
Sam 0 2.2
an
up
Sara 0
Gu
ta
/H
_G
up
The distance between Joe and Sam is 3.2, as shown in Table 2. This pairwise distance matrix is then provided
_
as an input to a clustering algorithm.
ika
ika
aG
du
b.e
ns
ns
sik
Ha
t]is
an
K-means clustering belongs to the nonhierarchical class of clustering algorithms. It is one of the more
/
ta/
ta
2[a
/H
popular algorithms used for clustering in practice because of its simplicity and speed. It is considered to be
up
up
more robust to different types of variables, is more appropriate for large datasets that are common in
02
marketing, and is less sensitive to some customers who are outliers (in other words, extremely different from
aG
aG
du
others).
ta2
b.e
sik
sik
For K-means clustering, the user has to specify the number of clusters required before the clustering
up
an
an
G
Algorithm
2 [a
a_
/H
/H
sik
du
n
.ed
Ha
b.e
]isb
t]is
[at
5. Repeat the two previous steps until some convergence criterion is met. Usually the convergence
2[a
criterion is that the assignment of customers to clusters has not changed over multiple iterations.
ika
22
02
ns
A cluster centroid is simply the average of all the points in that cluster. Its coordinates are the arithmetic
20
ta2
mean for each dimension separately over all the points in the cluster. Consider Joe, Sam, and Sara in the
Ha
ta
previous example. Let’s represent them based on their importance ratings on Premium Savings and
up
up
Neighborhood Agent as: Joe = {4,7}, Sam = {3,4}, Sara = {5,3}. If you assume that they belong to the same
cluster, then the center for their cluster is obtained as:
_G
_G
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
23
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 5 UV0745
pta
b.e
sik
Gu
t]is
z1 is measured as the average of the ratings of Joe, Sam, and Sara on Premium Savings. Similarly, z2 is
an
an
measured as the average of their ratings on Neighborhood Agent. Figure 2 provides a visual representation
2[a
a_
/H
/H
of K-means clustering.
sik
02
Figure 2. Visual representation of K-means clustering.
du
an
ta2
.ed
b.e
H
up
]isb
ta/
t]is
_G
p
[at
2[a
Gu
ika
22
02
a
ns
20
sik
ta2
/Ha
pta
an
up
Gu
ta
/H
_G
up
_
ika
ika
aG
du
b.e
ns
ns
sik
Ha
Ha
t]is
an
Number of clusters
/
ta/
ta
2[a
/H
up
up
One of the main issues with K-means clustering is that it does not provide an estimate of the number of
02
clusters that exists in the data. The K-means clustering has to be repeated several times with different “Ks”
aG
aG
du
(or number of clusters) to determine the number of clusters that is appropriate for the data. A commonly
ta2
b.e
sik
up
t]is
The elbow criterion states that you should choose a number of clusters so that adding another cluster
an
an
G
does not add sufficient information. The elbow is identified by plotting the ratio of the within cluster variance to
2 [a
a_
/H
/H
between cluster variance against the number of clusters. The within cluster variance is an estimate of the average
of the variance in the variables used as a basis for segmentation (Importance Score ratings for Premium
sik
02
Savings and Neighborhood Agent in the Geico example) among customers who belong to a particular cluster.
u
du
The between cluster variance is an estimate of the variance of the segmentation basis variables between
n
ta2
.ed
Ha
b.e
customers who belong to different segments. The objective of cluster analysis (as mentioned before) is to
minimize the within cluster variance and maximize the between cluster variance. Therefore, as the number of
up
]isb
t]is
clusters is increasing, the ratio of the within cluster variance to the between cluster variance will keep
_G
decreasing.
[at
2[a
ika
22
But at some point, the marginal gain from adding an additional cluster will drop, giving an angle in the
02
graph (the elbow). In Figure 3, the elbow is indicated by the circle. The number of clusters chosen should
ns
20
therefore be 3.
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
24
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 6 UV0745
pta
b.e
sik
Gu
t]is
Figure 3. Elbow plot for determining number of clusters.
an
an
2[a
a_
/H
/H
Elbow Plot
sik
02
du
300
an
ta2
.ed
Ratio of Within Cluster to Between Cluster
b.e
H
250
up
]isb
ta/
t]is
_G
200
[at
2[a
Gu
Variance
ika
22
150
02
a
ns
20
sik
ta2
100
/Ha
pta
an
up
50
Gu
ta
/H
_G
up
_
0
ika
ika
1 2 3 4 5 6 7
aG
du
ns
ns
sik
Ha
Ha
t]is
It should also be noted that the initial assignment of cluster seeds has a bearing on the final model
an
performance. Some common methods for ensuring the stability of the results obtained from K-means
/
ta/
ta
2[a
/H
clustering include:
up
up
02
Running the algorithm multiple times with different starting values. When using random starting
aG
aG
du
points, running the algorithm multiple times will ensure a different starting point each time.
ta2
b.e
Splitting the data randomly into two halves and running the cluster analysis separately on each half.
sik
sik
up
The results are robust and stable if the number of clusters and the size of different clusters are similar
t]is
in both halves.
an
an
G
2 [a
a_
/H
/H
Profiling Clusters
sik
02
du
n
Once clusters are identified, the description of the clusters in terms of the variables used for clustering—
ta2
.ed
Ha
b.e
or using additional data such as demographics—helps to customize marketing strategy for each segment. This
process of describing the clusters is called profiling. Figure 1 is an example of such a process. A good deal of
up
]isb
t]is
cluster-analysis software also provides information on which cluster a customer belongs to. This information
_G
can be used to calculate the means of the profiling variables for each cluster. In the Geico example, it is useful
[at
2[a
to investigate whether the segments also differ with respect to demographic variables such as age and income.
ika
22
In Table 3, consider the distribution of age and income for Segments A, B, and C as provided in Figure 1.
02
ns
20
ta2
Ha
ta
up
up
_G
_G
ika
ika
ns
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
25
a
]is
an
ta
H
HanIsikIa_GIuptIa20I22[Iat]IisbI.edIu I/ HIansIikaI GuIptaI/HaInsiIka_IG
[at
/H
up
22
aG
20
du
Page 7 UV0745
pta
b.e
sik
Gu
t]is
Table 3. Age and income distribution for segments.
an
an
2[a
a_
/H
/H
Segment Mean Range
Age Income ($) Age Income ($)
sik
02
A 21 15,000 16–25 0–25,000
du
an
B 45 120,000 33–55 75,000–215,000
ta2
.ed
b.e
C 39 40,000 39–54 24,000–60,000
up
]isb
ta/
t]is
Mean represents the averages of age and income of customers belonging to a particular segment. Range
_G
represents the minimum and maximum values of age and income for customers in a segment. Whereas the
[at
2[a
Gu
mean is useful for identifying the central tendency of a segment, the range helps in evaluating whether the
ika
22
segments overlap with regards to the profile variable.
02
a
ns
20
From Table 3, you see that Segment A customers who prefer high savings on their premium and do not
sik
ta2
/Ha
prefer having a neighborhood agent tend to be younger and have low income. These could probably be
pta
an
college students or recent graduates who are more comfortable with transacting online. Customers who
up
Gu
belong to Segment B, on the other hand, are older and have higher income levels. It would be interesting to
ta
/H
_G
evaluate if these customers also tend to be married with kids. The security of having a neighborhood agent
up
_
who can help in case of an accident or emergency is very important to them, and they do not mind paying a
ika
ika
aG
du
higher price for this sense of security. These customers may also not be comfortable in transacting (or
providing personal information) online.
b.e
ns
ns
sik
Finally, while Segment C customers are as old as Segment B customers, they tend to have lower incomes
Ha
Ha
t]is
an
and do not prefer to have a neighborhood agent (probably because of low disposable incomes). Identification
/
ta/
of the segments through these demographic characteristics enables a marketer to target as well as customize
ta
2[a
/H
communications to each segment. For example, if Geico decides to develop a network of neighborhood
up
up
agents, it can first focus on neighborhoods (identified through their zip codes) that match the profile of
02
aG
aG
du
Segment B customers.
ta2
b.e
sik
sik
up
Conclusion
t]is
an
an
G
Given a segmentation basis, the K-means clustering algorithm would identify clusters and the customers
2 [a
a_
/H
/H
that belong to each cluster. The management, however, has to carefully select the variables to use for
segmentation. Criteria frequently used for evaluating the effectiveness of a segmentation scheme include:
sik
02
identifiability, sustainability, accessibility, and actionability.1 Identifiability refers to the extent that managers can
u
du
n
ta2
.ed
recognize segments in the marketplace. In the Geico example, the profiling of customers allows you to
Ha
b.e
identify customer segments through their age and income information. PRIZM and ACORN are popular
up
]isb
databases that provide geodemographic information that can be used for segmentation as well as profiling.
t]is
The sustainability criterion is satisfied if the segments represent a large enough portion of the market to ensure
_G
profitable customization of the marketing program. The extent to which managers can reach the identified
[at
2[a
segments through their marketing campaigns is captured by the accessibility criterion. Finally, actionability refers
ika
22
to whether customers in the segment and the marketing mix necessary to satisfy their needs are consistent
02
with the goals and core competencies of the firm. The success of any segmentation process therefore requires
ns
20
ta2
ta
up
up
_G
_G
1 For more details, refer to Wagner Kamakura and Michel Wedel, Market Segmentation: Conceptual and Methodological Foundations, 2nd ed. (Norwell, MA:
ika
ika
ns
Ha
Ha
This document is authorized for use only in Professor Vandith Pamuru's Business Analytics using Data Mining[PGP] at Indian School of Business (ISB) from Dec 2021 to Mar 2022.
26