KEMBAR78
Unit II. Methods and Techniques For Data Analytics | PDF | Machine Learning | Data Analysis
0% found this document useful (0 votes)
63 views91 pages

Unit II. Methods and Techniques For Data Analytics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views91 pages

Unit II. Methods and Techniques For Data Analytics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Unit II.

Methods and Techniques for Data


Analytics
• 2.1 Storytelling & Turkeys Exploratory Data Analysis
• 2.2 Types of data
• 2.3 Popular Data Visualization tools
• 2.4 Analysing digital data using statistics and machine learning
Storytelling & Turkeys Exploratory Data Analysis
• As Name indicates, exploratory research is mainly used to
explore the insights of the general research problem.
• It is used to find out the relevant variables to frame the
theoretical model.
• Decision alternatives are also explored by the EDA
• If researcher faces a problem of not knowing anything about
the problem, EDA is used to explore the different dimensions
of the problems so that better understanding of the research
framework can be developed.
• Even if the researcher has the background information about
the problem, he or she has to conduct the exploratory
research to accumulate the current and relevant information.
• Exploratory research is used to identify and define the key
research variables.
• Exploratory Research is also helpful in formulating the
hypothesis.
• John W. Tukey, the definer of the phrase exploratory data
analysis (EDA), made remarkable contributions to the physical
and social sciences.
• In the matter of data analysis, his groundbreaking
contributions included the fast Fourier transform algorithm
and EDA.
• He reenergized descriptive statistics through EDA and
changed the language and paradigm of statistics in doing so.
• Exploratory data analysis (EDA) is used by data scientists to
analyze and investigate data sets and summarize their main
characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get
the answers you need, making it easier for data scientists to
discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
Exploratory data analysis tools
• Clustering and dimension reduction techniques, which help
create graphical displays of high-dimensional data containing
many variables.
• Univariate visualization of each field in the raw dataset, with
summary statistics.
• Bivariate visualizations and summary statistics that allow you to
assess the relationship between each variable in the dataset
and the target variable you’re looking at.
• Multivariate visualizations, for mapping and understanding
interactions between different fields in the data.
• K-means Clustering is a clustering method in unsupervised
learning where data points are assigned into K groupsK-means
Clustering is commonly used in market segmentation, pattern
recognition, and image compression.
• Predictive models, such as linear regression, use statistics and
data to predict outcomes.
Types of exploratory data analysis
• Univariate non-graphical. This is simplest form of data
analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes
or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
• Univariate graphical. Non-graphical methods don’t provide a
full picture of the data. Graphical methods are therefore
required. Common types of univariate graphics include:
• Stem-and-leaf plots, which show all data values and the shape
of the distribution.
• Histograms, a bar plot in which each bar represents the
frequency (count) or proportion (count/total count) of cases
for a range of values.
• Box plots, which graphically depict the five-number summary
of minimum, first quartile, median, third quartile, and
maximum.
• Multivariate nongraphical: Multivariate data arises from more
than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables
of the data through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses graphics to
display relationships between two or more sets of data. The
most used graphic is a grouped bar plot or bar chart with each
group representing one level of one of the variables and each
bar within a group representing the levels of the other variable.
• Other common types of multivariate graphics include:
• Scatter plot, which is used to plot data points on a horizontal
and a vertical axis to show how much one variable is affected
by another.
• Multivariate chart, which is a graphical representation of the
relationships between factors and a response.
• Run chart, which is a line graph of data plotted over time.
• Bubble chart, which is a data visualization that displays
multiple circles (bubbles) in a two-dimensional plot.
• Heat map, which is a graphical representation of data where
values are depicted by color.
Storytelling Is Important for Analysis
• The importance of narrating a compelling story about what
the data can tell you is critical to analysis. When telling a story
with data, answer the business question posed and
demonstrate insights and knowledge derived from data that
includes a solid set of relevant and useful outcomes-focused
recommendations.
• The analyst must ensure that analysis is presented in the most
humanistic way possible focused on organizational behavior,
motivation, and human emotion.
• Thus, instead of blasting up data only with numbers and slides
with charts and graphs, make sure to weave a narrative
through the data.
• Do not make the mistake of presenting only data and
visualizations.
• Tell stories with the data.
Techniques when socializing analysis in the form
of story-based narratives:
• Identify why the analysis has occurred and why the story
you are about to tell is important: Businesspeople are
incredibly busy and require context for reporting and analysis.
Explain why they should care.

• Indicate the business challenge you want to discuss and the


cost of not fixing it: By clearly stating what business issue
necessitates the analysis and framing the recommendations,
you can eliminate confusion.
• Identify any forewarnings: If there are any errors, omissions,
caveats, or things to discuss, clearly indicate them in advance.

• Depersonalize the analysis by using fictional characters


to help humanize the data you are reporting: Using
fictional characters helps to depersonalize analysis and
lowers political risk.
• Creating fictional scenarios and abstracting concepts
when narrating an analysis helps to eliminate the risk of

offending a specific stakeholder or group.


• Use pictures: They are worth a thousand words. Save valuable
time by using charts, graphs, trendlines, and other data
visualization techniques. When possible, use illustrations to
communicate concepts.
• Don’t use overly complex, wonky vocabulary: Esoteric and
scientific vocabulary is best left within the analytics team. No
one will be impressed if you use words such as “stochastic.”
• Try to make the communication and presentation of analysis
as simple as possible, so that it is easily understood and acted
upon by your stakeholders.
• Identify what is required: Clearly indicate what you think
needs to be done in written language using action-oriented
verbs and descriptive nouns.
• Say what you think and what you want to do. Set expectations
as early as possible for cost and new resources needed to
support analytics.
• Identify the cost of inaction. Clearly indicate the financial
impact of doing nothing—and compare it against the cost of
doing something.
• It may help to present comparable costs from other
alternatives.
• Conclude with a series of recommendations that tie to value
generation (either reduced cost or increased revenue):
Although you may not be an expert at the same level as the
person requesting the analysis, analysts should express their
ideas and perspective on the data and business situation.
• Recommendations should be made that are clearly and
directly based on data analysis—and these recommendations
must withstand the scrutiny and questions.
Tukey’s Exploratory Data Analysis Is an
Important Concept in Digital Analytics
• Exploratory data analysis (EDA) is more of a mindset for analysis
rather than an explicit set of techniques and methods; Tukey’s
philosophy on data was one that favored observation and
visualization and the careful application of technique to make
sense of data.
• EDA is not about fitting data into your analytical model; rather, it is
about fitting a model to your data.
• As a result, Tukey and EDA created interest in non-Gaussian and
nonparametric techniques in which the shape of data indicates it is
not normally distributed and may have a fat head with a long tail.
Tenets that are useful to consider for digital
analytics, which involve the following:
• Visually examining data to understand patterns and trends:
Raw data should be examined to learn the trends and
patterns over time and between dimensions and concepts in
the data. Visual examination can help frame what analytical
methods are possible to apply to your digital data.
• Using the best possible methods to gain insights into not just
the data, but also what the data says:
• Tukey espouses getting beyond the data and its details and
understanding what the data says in the context of answering
your questions. This approach is integral to digital analytics.
• Identifying the best performing variables and model for the
data:
• Digital analytics is filled with so much big data, but how do you
know what is the right big and small data to use for solving a
business problem?
• EDA helps ascertain what variables are influential and important.
• Detecting anomalous and suspicious outlier data:
• Digital data has outliers and anomalies may be important and
highly relevant and meaningful to the business or just random
noise that can be ignored and possibly excluded from
analytical consideration.
• Testing hypothesis and assumptions. Digital analytics
emphasizes the approach to using insights derived from data
to create hypotheses and test hypothesis-driven changes
within digital experiences.
• The idea of using data to test hypotheses and assumptions is
crucial to EDA.
• Finding and applying the best possible model to fit the data:
• Predictive modeling and analysis requires an EDA approach
that is more focused on the data rather than the model.

• Tukey’s principle helps simplify the creation of digital


analysis; it emphasizes the visual exploration of the data as
the first step in the process of analysis, instead of first
determining the statistical method to apply to the data and
fitting the data to it.
Data Types and Scales
• Structured and Unstructured Data:
• Data at micro level can be classified as structured and
unstructured data.
• Structured data means that the data is described in a matrix
form with labeled row and columns
• Any data that is not originally in the matrix form with rows
and columns is an unstructured data.
Cross sectional , Time Series and Panel Data

• This classification is based on the type of data collected.


• Cross Sectional Data:
Types of data
• Structure data can be either numeric or alpha numeric and
may follow different scales of measurement.
• Nominal Scale (Qualitative Data)
• Ordinal Scale
• Interval Scale
• Ratio Scale
Nominal Scale (Qualitative Data)
• It refers to variables that are basically names ( qualitative
data) and also known as categorical variables
• i.e. marital status (single(1), married(2), divorced(3))
• It is usual to assign a numerical code to represent a nominal
variable.
Ordinal Scale
• Ordinal scale is a variable in which the value of the data is
captured from an ordered set, which is recorded in the order
of magnitude.
• In many surveys likert scale is used
• Likert scale is finite (usually a 5 point scale) and the data
collector would have defined the order of preference.
• 5 point likert scale 1= Poor, 2= Fair, = 3 Good, 4= Very Good
and 5= Excellent.
Interval Scale
• Interval scale corresponds to a variable in which the value is
chosen from an interval set.
• Variable such as temperature measured in centigrade or
intelligence quotient IQ score are the examples of interval
scale.
Ratio Scale
• Any variable for which the ration can be computed and are
meaningful is called ration scale.
• Most variables come under this type :

• Demand for a product

• Market share of a brand

• Sales

• Salary

• If Ms Hawai Sundari’s salary is Rs.40000/-p.m. and Ms Dwai


Sundari’s salary is Rs.90000/-per month then we can interpret
that Dawai Sundari earns 2.25 times the salary of Hawai
Popular Data Visualization tools
• Tableau
• Looker
• Zoho Analytics
• Sisense
• IBM Cognos Analytics
• Qlik Sense
• Domo
• Microsoft Power BI
• Klipfolio
• SAP Analytics Cloud
Tableau
• Tableau is a data visualization tool that can be used by data
analysts, scientists, statisticians, etc. to visualize the data and
get a clear opinion based on the data analysis. Tableau is very
famous as it can take in data and produce the required data
visualization output in a very short time. And it can do this
while providing the highest level of security with a guarantee
to handle security issues as soon as they arise or are found by

users.

• https://www.youtube.com/watch?v=YfE9jBq0
Looker
• Looker is a Looker data visualization tool that can go in-depth
in the data and analyze it to obtain useful insights. It provides
real-time dashboards of the data for more in-depth analysis
so that businesses can make instant decisions based on the
data visualizations obtained. Looker also provides connections
with Redshift, Snowflake, BigQuery, as well as more than 50
SQL supported dialects so you can connect to multiple
databases without any issues.
• https://www.youtube.com/watch?v=8Pzmrcu63oY
Zoho Analytics
• Zoho Analytics is a Business Intelligence and Data Analytics
software that can help you create wonderful looking data
visualizations based on your data in a few minutes. You can
obtain data from multiple sources and mesh it together to
create multidimensional data visualizations that allow you to
view your business data across de
• Zoho Analytics allows you to share or publish your reports with
your colleagues and add comments or engage in conversations
as required.partments.
• https://www.youtube.com/watch?v=Pc72RNNtXzc
Sisense
• Sisense is a business intelligence-based data visualization
system and it provides various tools that allow data analysts
to simplify complex data and obtain insights for their
organization and outsiders. Sisense believes that eventually,
every company will be a data-driven company and every
product will be related to data in some way.
• https://www.youtube.com/watch?v=6N3mkTWI5R4
IBM Cognos Analytics
• IBM Cognos Analytics is an Artificial Intelligence-based
business intelligence platform that supports data analytics
among other things. You can visualize as well as analyze your
data and share actionable insights with anyone in your
organization. Even if you have limited or no knowledge about
data analytics, you can use IBM Cognos Analytics easily as it
interprets the data for you and presents you with actionable
insights in plain language.
• https://www.youtube.com/watch?v=CMn-65yUM4U
Qlik Sense
• Qlik Sense is a data visualization platform that helps
companies to become data-driven enterprises by providing an
associative data analytics engine, sophisticated Artificial
Intelligence system, and scalable multi-cloud architecture that
allows you to deploy any combination of SaaS, on-premises or
a private cloud.
• https://www.youtube.com/watch?v=sd84bsRWSLY
Domo
• Domo is a business intelligence model that contains multiple
data visualization tools that provide a consolidated platform
where you can perform data analysis and then create
interactive data visualizations that allow other people to
easily understand your data conclusions. You can combine
cards, text, and images in the Domo dashboard so that you
can guide other people through the data while telling a data
story as they go
• https://www.youtube.com/watch?v=S3VW8FC47io
Microsoft Power BI
• Microsoft Power BI is a Data Visualization platform focused on
creating a data-driven business intelligence culture in all
companies today. To fulfill this, it offers self-service analytics
tools that can be used to analyze, aggregate, and share the
data in a meaningful fashion.
• https://www.youtube.com/watch?v=yKTSLffVGbk
Klipfolio
• Klipfolio is a Canadian business intelligence company that
provides one of the best data visualization tools. You can
access your data from hundreds of different data sources like
spreadsheets, databases, files, and web services applications
by using connectors. Klipfolio also allows you to create custom
drag-and-drop data visualizations wherein you can choose
from different options like charts, graphs, scatter plots, etc.
• https://www.youtube.com/watch?v=sw7qApKnS8U
SAP Analytics Cloud
• SAP Analytics Cloud uses business intelligence and data
analytics capabilities to help you evaluate your data and
create visualizations in order to predict business outcomes. It
also provides you with the latest modeling tools that help you
by alerting you of possible errors in the data and categorizing
different data measures and dimensions. SAP Analytics Cloud
also suggests Smart Transformations to the data that lead to
enhanced visualizations.
• https://www.youtube.com/watch?v=eGGZ33fzxK0
Machine learning
• Machine learning algorithms are part of artificial intelligence
(AI) that imitates the human learning process.
• Human learns through multiple experiences to perform a task.
• Similarly machine learning algorithms usually develop
multiple models and each model is equivalent to an
experience.
• Two groups: Knowledge acquisition and skill refinement

• Knowledge acquisition as learning concepts in physics etc.

• Skill refinement is similar to learning to play the piano or ride a


bike.
• Machine learning algorithms imitate both knowledge
acquisition and skill refinement process.
• These are classified into the following four categories:

1. Supervised Learning Algorithms

2. Unsupervised Learning Algorithms

3. Reinforcement Learning Algorithms

4. Evolutionary Learning Algorithms


Supervised Learning Algorithms

• When the training data set has both predictors (input) and
outcome (output) variables, we use supervised learning
algorithms.
• Learning is supervised by the fact that predictors and the
outcome are available for the model to use .
• Techniques such as regression , logistic regression, decision
tree learning , random forest.
• https://www.youtube.com/watch?v=Bo5dJT1QlHc
Unsupervised Learning Algorithms

• When the training data has ony predictor (input) variable, but
not the outcome variable, then we use unsupervised learning
algorithms.
• Techniques such as K means clustering and Hierarchical
clustering are examples of unsupervised learning algorithms.
• https://www.youtube.com/watch?v=4oB0fuOLWIY
Reinforcement Learning Algorithms

• In many cases input and output variables are uncertain


(predictive keyboard/ spell check). The algorithms are also
used in sequential decision making scenario.
• Techniques such as dynamic programming and Markov
decision process are examples of reinforcement learning
algorithms.
• https://www.youtube.com/watch?v=qhRNvCVVJaA
Evolutionary Learning Algorithms
• These are algorithms that imitate human/ animal learning
process.
• They most frequently used to solve prescriptive analytics
process.
• Techniques such as genetic algorithms and ant colony
optimization belongs to this category.
• https://www.youtube.com/watch?v=cxweR4i0ejA
• https://www.youtube.com/watch?v=qiKW1qX97qA
• Machine learning is the science of designing algorithms that
learn on their own from data and adapt without human
correction.
• As we feed data to these algorithms, they build their own
logic and, as a result, create solutions relevant to aspects of
our world as diverse as fraud detection, web searches, tumor
classification, and price prediction.
• Machine learning constitutes model-building automation for
data analysis. When we assign machines tasks like
classification, clustering, and anomaly detection — tasks at
the core of data analysis — we are employing machine
learning.
• We can design self-improving learning algorithms that take
data as input and offer statistical inferences. Without relying
on hard-coded programming, the algorithms make decisions
whenever they detect a change in pattern.
Machine-Learning Algorithms for Data Analysis
• Six well-known machine-learning algorithms used in data
analysis
• Clustering
• Decision-tree learning
• Ensemble learning
• Support-vector machine
• Linear regression
• Logistic regression
• Clustering : an unsupervised learning algorithm that looks for
patterns among input values and groups them accordingly

• Decision-tree learning : These learning algorithms take a single


data set and progressively divide it into smaller groups by creating
rules to differentiate the features it observes. Eventually, they
create sets small enough to be described by a specific label.

• Ensemble learning: Ensemble learning dictates that, taken


together, your predictions are likely to be distributed around the
right answer. The average will likely be closer to the mark than your
guess alone
• Support-vector machine : SVM algorithms can only be used
on categorical data, but it’s not always possible to
differentiate between classes with 2D graphs. To resolve this,
you can use a kernel: an established pattern to map data to
higher dimensions.
• Linear regression : This is a modeling method ideal for
forecasting and finding correlations between variables in data
analysis.
• Logistic regression: While linear regression algorithms look
for correlations between variables that are continuous by
nature, logistic regression is ideal for classifying categorical

data.

• https://forms.gle/GMXVfb9vVEtePoTs6
Statistics
• Digital data analytics is exploratory , observational , visual and
mathematical
• Common data analysis method is used commonly in the
organizations.
• Quantitative techniques applied judiciously to data to answer
business questions
• Certain techniques exist for understanding data
• Correlation is used to check relationship two or more data
points
• Regression analysis to determine if certain data can predict
other data
• Distribution and assessment of probability
• Hypothesis testing to create best fitting model for predictive
power.
Correlating
• The statistics adage is that “Correlation is not causation,”
which is certainly true. Correlation, however, does imply
association and dependence.
• The analyst’s job is thus to prove that observed associations in
data are truly dependent, relevant to the business questions,
and ultimately whether the variable(s) cause the relationship
calculated
Regressing Data: Linear, Logistic, and So On
• The phrase regression analysis means applying a
mathematical method for understanding the relationship
between one or more variables.
• In more formal vocabulary, a regression analysis attempts to
identify the impact of one or more independent variables on a
dependent variable.
• Analytics professionals and the people who ask for analytical
deliverables often talk about regression, regression analysis,
the best fitting line, and ways to describe determining or
predicting the impact of one or more factors on a single or
multiple other factors.
• Impact of marketing program on sales
• In digital analytics, the regression analysis is used to
determine the impact of one or more factors on another
factor.
• Single and Multiple Linear Regression
• a simple linear regression is used when an analyst
hypothesizes that there is a relationship between the
movements of two variables in which the movements of one
variable impact either positive or negatively the movements
of another variable.
• Multiple linear regression and other forms of regression
where the dependent variable—that is, the variable for which
you are predicting—is predicted based on more than one
independent variable are used in digital analytics.
Understanding the marketing mix and how different
marketing channels impact response is often modeled using
multiple logistic regression.
• Logistic Regression
• Logistic regression enables predicting a categorical variable
based on several independent (predictor) variables.
• The output of a logistic regression is binomial if only two
answers are possible or multinomial if more than one answer
is possible.
• A 0 or 1 may be the results of binomial logistic regression,
whereas an output of “yes, no, or maybe” may be the output
of a multinomial logistic regression.
• Probability and Distributions
• The shape of data and observing shape can help an analyst
understand the data and the type of analytical methods to
use on the data.
• After all, the way an analyst applies a method to a normal
distribution versus a non-normal distribution is different.
• Probability simply stated is the study of random events. In
analytics, you use statistics and math to model and
understand probability of all sorts of things.
• In digital analytics, you are concerned about probabilities
related to whether a person will buy, visit again, or have a
deeper and more engaging experience and so on.
A digital analyst should be familiar with the
following concepts:
• Modeling probability and conditionality
• Building a model requires selecting (and often in analytics,
creating/collecting) accurate data, the dimension, and the
measures that can create your predictor variables.
• Central to the tendency to create models is statistical aptitude
and an understanding of measures, probability, and
conditionality.
• Measuring random variables
• A random variable is a type of data in which the value isn’t
fixed; it keeps changing based on conditions. In digital
analytics, most variables, whether continuous or discrete, are
random.
• Understanding binomial distributions and hypothesis testing
• A common way to test for statistical significance is to use
binomial distribution when you have two or more values
(such as yes or no, heads or tails).
• This type of testing considers the null hypothesis is done using
Z and T tables and P-values. The types of test are one-tailed or
two-tailed.
• If you want to understand more than two variables, you
would use a multinomial test and go beyond simple
hypothesis testing to perhaps chi-squares.
• Learning from the sample mean.
• The sample mean helps you understand the distribution and
is subject, of course, to the central limit theorem, which
indicates the larger the sample population, the more closely
the distribution will approximate normal.
• Thus, when modeling data, the sample mean and the related
measures of standard deviation variance can help you
understand the relationship between variables, especially
with smaller data sets.
Experimenting and Sampling Data

• Experimenting with digital analytics means changing one

element of the digital experience to a sample of visitors and

comparing the behavior and outcomes of those visitors to a

control group that received the expected digital experience.

• The goal of experimentation is to test hypotheses, validate

ideas, and better understand the audience/customer. In reality,

though, digital is not biology, and it is often impossible to hold

all elements of digital behavior equal and change just one

thing.
• Thus, experimenting in digital means controlled
experimentation. A controlled experiment is an experiment
that uses statistics to validate the probability that a sample is
as close as possible to identical to the control group.
• Although the boundaries of a controlled experiment may be
perceived as less rigorous than a true experiment in which
only one variable changes, that’s not actually true because
controlled experiments, when performed correctly, use the
scientific method and are statistically valid.
Population
• The aggregate group of people on which the controlled
experiment is performed or which data already collected is
analyzed.
• The population is divided into at least two groups: the control
group and the test group.
• The control group does not receive the test, whereas the test
group, of course, does.
Sampling method
• The way you select the people, customer, visitors, and so on
for your experiment is important. And it depends on whether
you want to understand a static population or a process
because different sampling methods are required. Sampling is
important because a poorly or sloppily sampled group can
give you poor results from experimentation.
• Random sample
• Stratified sampling
• Systematic sampling
Expected error
• When analyzing the results of experiments by applying the
methods you need to go into your experiment with an idea of
the expected amount of error you are willing to tolerate.
• There are various types of errors (such as type 1 and type 2).
Confidence intervals and confidence levels are applied to
understand and limit expected error (or variability by chance)
to an acceptable level that meets your business needs.
• Independent variable
• What you are holding static in the population or is shared
among the population or subgroups are the independent
variables.
• Dependent variables
• The predicted variables that are the outcome of the data
analysis.
Confidence intervals
• Commonly stated at 95 percent or 99 percent. Other times
they could be as low as 50 percent. The meaning of a
confidence interval is generally said to be the “99 percent of
the population will do X or has Y,” but that interpretation
would be incorrect.
• A better way to think of confidence intervals in digital analysis
is that were you to perform the same analysis again on a
different sample, the model would include the population you
are testing 99 percent of the time.
Significance testing
• Involves calculating how much of an outcome is explained by
the model and its variables. Often expressed between 10
percent and 0.01 percent, the significance test enables you to
determine whether the results were caused by error or
chance.
• Done right, analysts can say that their model was significant
to 99 percent, meaning that there’s a 1 in 100 chance that the
observed behavior was random.
Comparisons of data over time

• Such as Year over Year, Week over Week, and Day over Day
are helpful for understanding data movements positively and
negatively over time. Outlier comparisons need to be
investigated.
Inferences
• What are made as a result of analysis. Inferences are the
logical conclusions—the insights—derived by using statistical
techniques and analytical methods.
• The result of an inference is a recommendation and insight
about the sampled population.
Attribution: Determining the Business Impact
and Profit
• During the last few years, the concept of attribution has become
important within digital analytics.
• In digital analytics, attribution is the activity and process for
establishing the origin of the people who visited a digital experience.
Attribution is a rich area explored by data scientists worldwide.
• The roots of attribution for digital analytics come from traditional
web and site analytics where business people, primarily marketing,
wanted to understand the reach (that is, the number of people), the
frequency, and the monetary impact of marketing programs and
campaigns.
• Going back even further, the idea of attribution has roots in
financial management and measurement.
• Attribution enables an analyst to identify from data that an
absolute number of visits or visitors came from a particular
source, such as paid search, display advertising, or an email
campaign.
• By understanding the sources that send people who convert
(and thus create economic value), business people can then
fine-tune and optimize their work to produce the best
financial result.
• Attribution in digital analytics includes the click but also goes
beyond the click.
• Interactions with digital experiences that may not require a
click (think of a touch-enabled smart device) can be attributed
—as can exposures to events, types of content, or advertising
(as in the case of view-thru conversion).
• A list of fairly common attribution models are listed where a
click describes what could also be an interaction or exposure:
First click (or interaction or exposure)

• Frames the attribution calculation on the first click a person


did prior to conversion or creating economic value.
• In first-click attribution, if you first visit a site through Google
Paid Search and then visit the site again through a display ad,
the credit for the purchase in this model would be associated
with Google Paid Search, not organic search because the first
click that led the person to the site was Google.
Last click (or interaction or exposure)

• Last click is probably the most common form of attribution


because it is supported by most analytics tools and is the
easiest to understand.
• Last click is the equi-opposite of first click. In the scenario
from the previous bullet, in which the person first came from
paid search and the last came from organic search before the
purchase, organic search would receive the attribution credit.
Last nondirect click (or interaction or
exposure)
• In cases in which a person has come to a digital experience
during the look-back window from more than two sources (such
as direct, organic search, and via an email campaign), this
model would give the attribution credit to the last nondirect
source of traffic whatever that may be.
• In the case in which the first visit was from organic search, the
second visit from paid search, and the final visit where the
conversion occurred came from direct traffic, this model would
give attribution credit to the last nondirect click.
Last N click
• Where N is a digital channel, such as search, mobile, and video.

• This attribution model is used to validate the impact of certain


sources of traffic by assigning the credit for attribution to the
last source as defined by your business.
• For example, in the case in which the first visit was from
display, the second visit from paid search, and the final visit
with the conversion occurred from a link in a video, this model
would give attribution credit to the source you define, such as
paid search.
Linear
• This model allocates an identical amount of credit to every
source of attribution.
• All observations are collected and aggregated to create a
linear attribution from factoring the equal weightings. For
example, a person came to a site via four different
interactions, interactions, and then each interaction would be
given 25 percent credit.
• By tallying all interaction types and averaging all credits given
to those types, linear attribution can be identified.
Time decay, time lapse, and latent
• The time decay, time lapse, and latent attribution models are
similar and often consider almost or totally identical. In this
case, more credit is given to the sources closer in time to the
conversion event.
• For example, if a visitor came from five different clicks (paid,
organic, display, online ads, and direct), this model might give
70 percent of the credit to the last click, 20 percent to the
second to last click, and so on. The weightings in time decay
attribution can be customized.
Construct-based, such as position
• Popularized in paid search where geo-spatial position on-
screen or on-device is important to the revenue model, the
concept of attribution from a particular construct is
important, such as position.
• Construct-based attribution can be used to determine the
impact of screen location and design, and for cost-per- click,
the bid and position help identify the attribution.
Event-based click (or interaction or exposure).
• This model provides attribution credit to specific, custom-
defined events that may be a click, interaction, exposure, or
another concept related to the movement of people into a
digital experience.
• In this case, it is possible to associate the impact of behavioral
events on the credit given to sources.
Rules-based click (or interaction or exposure).

• In this type of attribution, rules are created by the business


and assigned to clicks, interactions, exposures, and events
related to the revenue being generated or another metric.
The rules assigned can change the look-back window, the
weightings of the sources given credit, and other rules as
developed for your business context.
Algorithmic
• This catch-all phrase for attribution modeling references
those created using machine learning, data mining, and
statistical methods—some of which, like regression, were
reviewed earlier in this chapter.
• Algorithmic attribution can be identified as the model
when it is proprietary within a closed, commercial
system.
• The underlying algorithm is the trade secret or the
intellectual property; hence, algorithmic attribution
though real is often a term used to describe models in
which the details will may not be explained to the users.

You might also like