KEMBAR78
Data Analytics - Intermediate | PDF | Regression Analysis | Coefficient Of Determination
0% found this document useful (0 votes)
4 views36 pages

Data Analytics - Intermediate

Uploaded by

vdalal83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Data Analytics - Intermediate

Uploaded by

vdalal83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Analytics - Intermediate

Foundation Skills Academy


Index

1. Quick Recap of Basics

2. Data Engineer’s Ecosystem

3. Data Extraction & Storage

4. Languages for Data Professionals

5. Data Analysis – Regression Analysis

6. Marketing Mix Modeling


Data Analytics Basics Recap
Skills Required for Data Analytics

To excel in the field of Data Analytics need, synthesis of technical skills, domain skills and soft skills

Technical skills
• Data Analytics is not only about analyzing data but also about understanding how to collect, organise and store data
• Understanding important programming languages such as SQL, Python and R
• Form a strong base of statistical analysis, maths for advanced analytics
• Understanding Machine Learning and algorithms to identify data patterns and solve for a business case

Domain skills - Bridge the gap between raw data and recommendations, interpret results accurately, collaborate with SMEs

Soft skills
• Communication skills to be able to explain your data-driven insights and market trends to businesses and convince them to
make smart business decisions. Art of story telling is important.
• Visualization skills for effective reporting and dashboarding
Data Engineer’s Ecosystem
Data Engineer’s Ecosystem

• A data engineer’s ecosystem includes the infrastructure, tools, frameworks, and processes for extracting data from
disparate sources, architecting and managing data pipelines for transformation, integration, and storage of data.

• Data comes in a wide-ranging variety of file formats being collected from a variety of data sources, ranging from relational
and non-relational databases to APIs, web services, data streams, social platforms, and sensor devices.

• There are two main types of Data Repositories -Transactional and Analytical

• Transactional systems, also known as Online Transaction Processing (or OLTP) systems, are designed to store
high-volume day-to-day operational data e.g. online banking transactions, ATM transactions, and airline bookings.

• Analytical systems, also known as Online Analytical Processing (OLAP) systems, are optimized for complex data
analytics. These include relational and non-relational databases, data warehouses, data marts, data lakes.

• Data Integration tools combine data from disparate sources using processes such as the Extract-Transform-and-Load.
The ecosystem includes languages such as query languages, programming languages, and shell and scripting languages.

• BI and Reporting tools are used to collect data from multiple data sources and present them in a visual format, such as
interactive dashboards. These are drag and drop products that do not require the users to know any programming.
Data Extraction

• Typically, internal applications use relational databases such as SQL Server, Oracle, MySQL, DB2, to store data in a
structured way.

• External to the organization, there are other publicly and privately available datasets. For example, government
organizations releasing demographic and economic datasets. There are companies that sell specific data, for example,
Point-of-Sale data or Financial data, or Weather data.

• Many data providers and websites provide APIs (Application Program Interfaces) and Web Services. APIs and Web
Services typically listen for incoming requests and return data in plain text, XML, HTML, JSON, or media files e.g. X and
Facebook provide APIs to source data from tweets/posts, Stock Market APIs for pulling share and commodity prices.

• Web scraping makes it possible to download specific data from web pages based on defined parameters. Web scrapers
can extract text, contact information, images, videos, product items etc. which can be used to provide price comparisons;
extracting data on various forums and communities; collecting training and testing datasets for machine learning models.

• Data streams are another widely used source for aggregating constant streams of data flowing from sources such as
instruments, IoT devices, and applications, GPS data from cars, computer programs, websites, and social media posts.

• RSS (or Really Simple Syndication) feeds, are another popular data source used for capturing updated data from online
forums and news sites where data is refreshed on an ongoing basis.
Data Storage

• There are different types of databases. Several factors influence the choice of database, such as the data type and structure,
querying mechanisms, latency requirements, transaction speeds, and intended use of the data.
• Relational databases are optimized for data operations and querying involving many tables and much larger data volumes.
• Non-relational databases stores data in a schema-less or free-form fashion. NoSQL is widely used for processing this data.
• A data warehouse works as a central repository that merges information coming from disparate sources and consolidates it
into one comprehensive database for analytics and business intelligence.
• A data mart is a sub-section of the data warehouse, built specifically for a particular business function, purpose.
• A data lake exists as a repository of raw data in its native format, straight from the source, to be transformed.
• Big Data Stores include distributed computational and storage infrastructure to store, scale, and process large data sets.
Big Data Processing Technologies - Hadoop, Hive, Spark
• Hadoop is a collection of tools that provides distributed storage and processing of big data.
• In Hadoop distributed system, a node is a single computer, and a collection of nodes forms a cluster. Hadoop can scale up
from a single node to any number of nodes, each offering local storage and computation.
• HDFS provides scalable big data storage by partitioning files over multiple nodes. It splits large files across multiple
computers, allowing parallel access and processing. It also replicates file blocks on different nodes to prevent data loss.
• Hive is an open-source data warehouse software for reading, writing, and managing large data set files that are stored
directly in either HDFS or other data storage systems such as Apache HBase.
• Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide range
of applications, including Interactive Analytics, Streams Processing, Machine Learning, Data Integration, and ETL. It takes
advantage of in-memory processing to significantly increase the speed of computations and spilling to disk only when
memory is constrained.
Languages for Data Professionals
Languages for Data Professionals

• There are hundreds of programming languages out there, built for diverse purposes. These can be categorized as – query
languages, programming languages, and shell scripting.
• Query languages are designed for accessing and manipulating data in a database; for example, SQL
• Programming languages are designed for developing applications and controlling application behavior; for example,
Python, R, and Java
• Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, are ideal for repetitive and time-consuming
operational tasks
Languages for Data Professionals - SQL

• SQL, or Structured Query Language, is a querying language designed for accessing and manipulating information from,
mostly, though not exclusively, relational databases.
• Using SQL, we can write a set of instructions to perform operations such as Insert, update, and delete records in a
database; Create new databases, tables, and views; and Write stored procedures - which means you can write a set of
instructions and call them for later use.
• Here are some advantages of using SQL,
• SQL is portable and can be used independent of the platform. It can be used for querying data in a wide variety of
databases and data repositories, although each vendor may have some variations and special extensions.
• It has a simple syntax that is similar to the English language. Its syntax allows developers to write programs with
fewer lines than some of the other programming languages using basic keywords such as select, insert, into, and
update.
• It can retrieve large amounts of data quickly and efficiently.
• It runs on an interpreter system, which means code can be executed as soon as it is written, making prototyping quick
• Due to its large user community and the sheer volume of documentation accumulated over the years, it continues to
provide a uniform platform, worldwide, to all its users.
Languages for Data Professionals – Python, R

Python
• Python is a widely-used open-source, general-purpose, high-level programming language.
• Its syntax allows programmers to express their concepts in fewer lines of code, as compared to other older languages
• Python is perceived as one of the easiest languages to learn and has a large developer community making it ideal for
beginning programmers.
• Python provides libraries like Numpy and Pandas, which eases tasks by the use of parallel processing. It has in built
functions for almost all of the frequently used concepts.
• Python supports multiple programming paradigms, such as object-oriented, imperative, functional, and procedural, making it
suitable for a wide variety of use cases.
R
• R is a language and environment for statistical programming which includes statistical computing and graphics.
• It has many easy-to-use packages for performing tasks like ggplot2, caret, etc.
• R applications are ideal for visualizing your data in beautiful graphics.
Languages for Data Professionals - Unix

• A Unix/Linux Shell is a computer program written for the UNIX shell.


• It is a series of UNIX commands written in a plain text file to accomplish a specific task.
• Writing a shell script is fast and easy. It is most useful for repetitive tasks that may be time-consuming to execute by
typing one line at a time.
• Typical operations performed by shell scripts include: file manipulation, program execution, system administration tasks such
as disk backups and evaluating system logs, installation, scripts for complex programs, running batches
Regression Analysis
Data Analysis – Regression Analysis
Regression analysis is a statistical method for understanding and quantifying the relationship between two or more variables.
It helps a business estimate one dependent variable based on the values of one or more independent variables.

Dependent Variable: The dependent variable is essentially the "outcome" you’re trying to understand or predict. It’s the
focus of your study, whether you’re looking at quarterly sales figures, customer satisfaction ratings, or any other key result.

Independent Variable: Independent variables are the "factors" that might influence or cause changes in the dependent
variable. These are the variables you manipulate or observe to see their impact on your outcome of interest. For example, if
you adjust the price of a product, that price change is an independent variable that could affect sales figures.

Output of Regression Analysis

n
Data Analysis – Types of Regression Analysis

Simple Linear regression: Simple linear regression is used when a single independent variable predicts a dependent
variable. The linear regression formula is represented as Y = a + bX where, Y is the dependent var. X is the independent
var. a is the intercept (value of Y when X = 0). b is the slope, also called as coefficient (change in Y for a unit change in X).

Business Application: It's frequently used to identify how a change in one variable will affect another. For example, predicting
sales based on advertising expenditure or estimating employee productivity based on hours worked.

Multiple Linear regression: Multiple regression extends linear regression by considering multiple independent variables to
predict the dependent variable. The relationship is represented as Y = a + b₁X₁ + b₂X₂ + ... + bₙXₙ

Business Application: Businesses use it to understand how multiple factors influence outcomes. For instance, predicting
home prices based on features like square footage, number of bedrooms, and neighborhood.

Non-Linear regression: It is used in cases where the relationship between the dependent and independent variables is
nonlinear. The model can take various forms depending on the specific problem. It is generally represented as Y = f(X, θ)
where θ represents the parameters of the nonlinear function f.
Data Analysis – Types of Regression Analysis

Examples of Nonlinear regression,

Logistic Regression: Logistic regression is used when the dependent variable is binary (two possible outcomes). It models
the probability of a particular outcome occurring.

Business Application: In business, logistic regression is employed for tasks like predicting customer churn (yes/no), whether
a customer will purchase a product (yes/no), or whether a loan applicant will default on a loan (yes/no).

Polynomial Regression: Polynomial regression is used when the relationship between the independent and dependent
variables follows a polynomial curve and is not linear.

Business Application: It can be used to model more complex relationships in data, such as predicting the growth of a plant-
based on time and other environmental factors.

Exponential Regression: Exponential regression is a type of nonlinear regression that fits an exponential function to the
data. The general form of an exponential regression model is:

Power Regression: Power regression is a type of nonlinear regression that fits a power function to the data. The general
form of a power regression model is:
Data Analysis – Importance of Regression Analysis

Predictive Modeling: Regression analysis is commonly used for predictive modeling. By examining historical data and
identifying relationships between variables, businesses can make informed predictions about sales, demand, etc.

Identifying Key Drivers: Regression analysis can help identify which independent variables significantly impact the
dependent variable. For e.g., it can determine which marketing channels or advertising strategies influence sales most

Optimizing Decision Making: Whether it's optimizing pricing strategies, production processes, or marketing campaigns,
regression can help companies allocate resources efficiently and achieve better outcomes.

Risk Assessment: Businesses are exposed to various risks, such as economic fluctuations, market changes, and
competitive pressures. Regression analysis-powered risk assessment techniques can be used to assess how changes in
independent variables may affect business performance.

Performance Evaluation: Regression analysis can evaluate the effectiveness of different initiatives and strategies. For
instance, it can assess the impact of employee training on productivity or the relationship between customer satisfaction
and repeat purchases.

Market Research: In market research, regression analysis can be used to understand consumer behavior and
preferences. By examining demographics, pricing, and product features, businesses can tailor their products and
marketing efforts to specific target audiences.
How to Perform Regression Analysis?

Data collection and preparation: Gather and clean data, ensuring it meets assumptions like linearity and independence.

Appropriate regression model: Choose the correct type of regression (linear, polynomial etc.) based on data and objective

Data analysis and interpretation: Analyze results, assess model accuracy, and interpret coefficients to draw conclusions.

Model evaluation and validation: Test model's performance using metrics like R-squared, mean-squared error.

• p values and coefficients in regression analysis work together to tell which relationships in the model are statistically
significant and the nature of those relationships.

• The linear regression coefficients describe the mathematical relationship between each independent variable and the
dependent variable. The p values for the coefficients indicate whether these relationships are statistically significant.

• After fitting a regression model, check the residual plots to be sure that you have unbiased estimates.

• R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the
variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength
of the relationship between your model and the dependent variable on a 0 – 100% scale. For example, an R-squared of
60% reveals that 60% of the variability observed in the target variable is explained by the regression model. Generally, a
higher R-squared indicates more variability is explained by the model.

Using software tools: Use Python or R to perform regression analysis efficiently.


Simple Linear Regression Example

You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose
incomes range from $15k to $75k and ask them to rank their happiness on a scale from 1 to 10.

Your independent variable (income) and dependent variable (happiness) are both quantitative, so you can do a regression
analysis to see if there is a linear relationship between them.

R code for simple linear regression:

income.happiness.lm <- lm(happiness ~ income, data = income.data)

This code takes the data you have collected (data = income.data) and calculates the effect that the independent variable
income has on the dependent variable happiness using the equation for the linear model: lm().

To view the results of the model, you can use the summary() function in R:

summary(income.happiness.lm)
Simple Linear Regression Example

Results of the Model:

This output table first repeats the formula


that was used to generate the results
(‘Call’), then summarizes the model
residuals (‘Residuals’), which give an idea
of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first


row gives the estimates of the y-intercept,
and the second row gives the regression
coefficient of the model.

happiness = 0.20 + 0.71*income ± 0.018

The number in the table (0.713) tells us


that for every one unit increase in income
(where one unit of income = $10,000)
there is a corresponding 0.71 unit
increase in reported happiness (where
happiness is a scale of 1 to 10)
Simple Linear Regression Example

Results of the Model:


The Std. Error column shows how much
variation there is in our estimate of the
relationship between income and
happiness.

The t value column displays the test


statistic. The larger the test statistic, the
less likely it is that our results occurred by
chance.

The Pr(>| t |) column shows the p value.


The p-value indicates whether the
independent variable has a significant
influence. p-values smaller than 0.05 (or
sometimes 0.001) are considered as
significant.

Because the p value is so low (p < 0.001),


we can conclude that income has a
statistically significant effect on
happiness.
Simple Linear Regression Example
Residual Plots
A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a
predicted value and the observed actual value.

The most important assumption


of a linear regression model is
that the errors are independent
and normally distributed.

A few characteristics of a good


residual plot are as follows:

• It has a high density of


points close to the origin and
a low density of points away
from the origin
• It is symmetric about the
origin
Assumptions of Regression Analysis

Linearity: The relationship between the independent and dependent variables is linear.

Sample representativeness: The sample is representative of the population.

Normally distributed errors: The errors are normally distributed.

Homoscedasticity: The variance of the errors (residuals) remains constant across all levels of the independent
variable(s). Put simply, it signifies that the dispersion of residuals stays consistent, enhancing the accuracy and legitimacy
of regression predictions.

No Multicollinearity : When independent variables are highly correlated, it becomes challenging to determine their impact
on the dependent variable.

No outliers: There are no outliers in the data.

No endogeneity: There is no relationship between the errors and the independent variables that can lead to biased
results.
Marketing Mix Modeling
Marketing Mix Modeling

Market Mix Modeling (MMM) is a technique which helps in quantifying the impact of several marketing inputs on Sales or
Market Share. The purpose of using MMM is to understand how much each marketing input contributes to sales, and how
much to spend on each marketing input. Specifically, here are some ways MMM helps businesses thrive,

Optimizing marketing spending helps businesses understand what marketing activities contribute most effectively to
achieving business objectives.

Budget allocation After analyzing the ROI of various marketing channels and tactics, businesses can make more
informed decisions about where to allocate their marketing budget with the greatest yield

Forecasting and planning Businesses can simulate the impact of changes in marketing strategies or external factors and
use these insights to anticipate the potential outcomes and adjust their plans accordingly

Understanding customer behavior helps businesses understand how different customer segments respond to various
marketing stimuli, enabling more targeted and effective marketing strategies.

Continuous improvement Monitoring key performance metrics and analyzing trends enables businesses to identify
opportunities for optimization, test new strategies, and adapt to changing market conditions, ensuring that their marketing
efforts remain effective and competitive.
Marketing Mix Modeling - Regression

MMM uses Regression technique and the analysis performed is further used for extracting key insights.

The dependent variable could be Sales or Market Share. The independent variables usually used are distribution, price,
TV spends, outdoor campaigns spends, digital spends, newspaper and magazine spends, below the line promotional
spends, and consumer promotions information etc.

An equation is formed between the dependent variables and predictors. This equation could be linear or non-linear
depending on the relationship between the dependent variable and various marketing inputs.

In Market Mix Modeling, sales are divided into 2 components:


a. Base Sales: Base Sales is what marketers get if they do not do any advertisement. It is sales due to brand equity built
over the years. Base Sales are usually fixed unless there is some change in economic or environmental factors.
b. Incremental Sales: Sales generated by marketing activities like TV advertisement, print advertisement, and digital
spends, promotions etc. Total incremental sales is split into sales from each input to calculate contribution to total sales.
Marketing Mix Modeling – Simple Linear Regression

• The goal of simple regression


analysis is to find the line (i.e. the
intercept and slope) that goes
through the maximum data points in
this chart, minimizing the distance
between each point and its
projection on the line.

• The difference between the initial


position of the point and its
projection on the line, which is
represented by the dot in orange,
represents the error term. The
smaller the value of the error term,
the better is the regression line in
estimating the relationship
between x and y.
Marketing Mix Modeling – Multiple Linear Regression

This is a more representative setting as simple linear regression is hardly used in real life MMM projects; as it is too
simplistic and does not handle the complexity of consumer behavior and the media landscape.

In a typical marketing mix modeling project, multiple variables impact the sales performance. To be able to measure the
impact of those variables on sales or any other chosen KPI, the analyst needs to build a robust model which accounts for
all the variables influencing the movement of sales.

• x1​,x2​,...,xk are the independent variables influencing sales.

• The term βX represents the contribution of the variable X on sales: i.e. how much sales are driven
by the variable X​ (incremental impact)
Marketing Mix Modeling – Contribution Chart

A contribution chart visually represents different marketing tactics’ impact on sales. It shows how much each tactic
contributes to the total sales and highlights the most effective ones.

The chart can be used to benchmark performance, compare campaigns over time, and plan for future initiatives.
Contribution charts provide an easy-to-understand overview of where a marketer’s efforts should be focused to maximize
ROI and optimize campaign performance.
Limitations of Marketing Mix Modeling

Data quality: Poor data quality or incomplete datasets can lead to inaccurate results.

Model complexity: As models become more complex, they require more resources and expertise to develop and
maintain.

External factors: Since MMM relies on historical data, external changes or competitors’ actions may invalidate
assumptions made in the model and reduce its accuracy.

Cost: Marketing mix modeling is relatively expensive to implement and maintain. It also requires dedicated staff and
resources to develop, interpret, and act on the results.

Time lag: Due to its reliance on past data, MMM may not accurately predict what the future holds. The longer the time lag
between when a decision is made and when it is implemented, the less accurate the model will be.

Granularity: In order to provide more detailed insights, MMM models must often take into account multiple layers of
granular data points, which can require advanced analytics capabilities that are cost-prohibitive for some organizations.
Foundation Skills Academy

Thank You
Data Analysis – Advanced Techniques (Statistics)

• Regression Analysis: This entails analyzing the relationship between one or more independent variables and a
dependent variable. The independent variables are used to explain the dependent variable, showing how changes in
the independent variables influence the dependent variable.

• Factor Analysis: This entails taking a complex dataset with many variables and reducing the variables to a small
number. The goal of this maneuver is to attempt to discover hidden trends that would otherwise have been more
difficult to see.

• Cohort Analysis: This is the process of breaking a data set into groups of similar data, often into a customer
demographic. This allows data analysts and other users of data analytics to further dive into the numbers relating to a
specific subset of data.

• Monte Carlo Simulations: Models the probability of different outcomes happening. They're often used for risk
mitigation and loss prevention. These simulations incorporate multiple values and variables and often have greater
forecasting capabilities than other data analytics approaches.

• Time Series Analysis: Tracks data over time and solidifies the relationship between the value of a data point and the
occurrence of the data point. This data analysis technique is usually used to spot cyclical trends or to project financial
forecasts.
Factor Analysis

• Factor analysis is a technique used to reduce a large number of variables to a smaller number of factors. It works
on the basis that multiple separate, observable variables correlate with each other because they are all associated with
an underlying construct.

• Let’s imagine you want to get to know your customers better, so you send out a rather long survey comprising one
hundred questions. Some of the questions relate to how they feel about your company and product; for example,
“Would you recommend us to a friend?” and “How would you rate the overall customer experience?” Other questions
ask things like “What is your yearly household income?” and “How much are you willing to spend on skincare each
month?”

• Once the survey has been filled, you end up with a large dataset that essentially tells you one hundred different things
about each customer. Instead of looking at each of these responses (or variables) individually, you can use factor
analysis to group them into factors that belong together

• In this example, factor analysis works by finding survey items that are strongly correlated. This is known as covariance.
So, if there’s a strong positive correlation between household income and how much they’re willing to spend on
skincare each month (i.e. as one increases, so does the other), these items may be grouped together.

• In the end, you have a smaller number of factors rather than hundreds of individual variables. These factors are then
taken forward for further analysis

You might also like