Correlation and Regression
Correlation
The correlation coefficient (a value between -1 and +1) tells you how strongly two
variables are related to each other. We can use the CORREL function or the Analysis
Toolpak add-in in Excel to find the correlation coefficient between two variables.
- A correlation coefficient of +1 indicates a perfect positive correlation. As variable X
increases, variable Y increases. As variable X decreases, variable Y decreases.
- A correlation coefficient of -1 indicates a
perfect negative correlation. As variable X
increases, variable Z decreases. As variable X
decreases, variable Z increases.
                             A correlation coefficient near 0 indicates no
                             correlation.
1. On the Data tab, in the Analysis group, click
Data Analysis.
 2. Select Correlation and click OK.
3. For example, select the range A1:C6 as the
Input Range.
 4. Check Labels in first row.
 5. Select cell A8 as the Output Range.
 6. Click OK.
Result.
 Conclusion: variables A and C are positively correlated (0.91). Variables A
 and B are not correlated (0.19). Variables B and C are also not correlated
 (0.11) . You can verify these conclusions by looking at the graph.
Data Analysis Toolpak
The Data Analysis ToolPak is an Excel add-in
that provides data analysis tools for financial,
statistical, and engineering data analysis
Click the File tab, click Options, and then click
the Add-Ins category.
Select Analysis ToolPak and click on the Go
button.
Check Analysis ToolPak and click on OK.
On the Data tab, in the Analysis group, you can
now click on Data Analysis.
Run Regression Analysis
In Excel, we use regression analysis to estimate the relationships between two or more
variables. There are two basic terms that you need to be familiar with:
The Dependent Variable is the factor you are trying to predict.
The Independent Variable is the factor that might influence the dependent variable.
Consider the following data where we have a
number of COVID cases and masks sold in a
particular month.
Go to the Data tab > Analysis group > Data
analysis.
Select Regression and click OK.
The following argument window will open.
Select the Input Y Range as the number of masks sold and Input X Range as COVID cases. Check
the residuals and click OK.
You will get the following output:
                    Interpret Regression Analysis Output
Let us now understand the meaning of each of the terms in the output. We will divide the
output into four major parts for our understanding.
Summary Output
The summary output tells you how well the calculated
linear regression equation fits your data source.
The Multiple R is the Correlation Coefficient that measures the
strength of a linear relationship between two variables. The
larger the absolute value, the stronger is the relationship.
1 means a strong positive relationship
-1 means a strong negative relationship
0 means no relationship at all
R Square signifies the Coefficient of Determination, which shows the goodness of fit. It shows
how many points fall on the regression line. In our example, the value of R square is 0.96,
which is an excellent fit. In other words, 96% of the dependent variables (y-values) are
explained by the independent variables (x-values).
Adjusted R Square is the modified version of R square that adjusts for predictors that are not
significant to the regression model.
  Standard Error is another goodness-of-fit measure that shows the precision of your
  regression analysis.
ANOVA
ANOVA stands for Analysis of Variance. It gives
information about the levels of variability
within your regression model.
 Df is the number of degrees of freedom associated with the sources of variance.
 SS is the sum of squares. The smaller the Residual SS viz a viz the Total SS, the better the
 fitment of your model with the data.
 MS is the mean square.
 F is the F statistic or F-test for the null hypothesis. It is very effectively used to test the
 overall model significance.
 Significance F is the P-value of F.
                              Regression Graph In Excel
You can quickly visualize the relationship between the two variables by creating a graph. To create
linear regression graph, follow these steps:
Select the two variable columns of your data, including the headers.
Go to Insert tab > Charts group > Scatter Plot.
   Now to add the trend line, right-click on any
   point and select Add Trend line.
Example:
Excel produces the following Summary Output (rounded to 3 decimal places).
R Square
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is explained
by the independent variables Price and Advertising. The closer to 1, the better the regression
line (read on) fits the data.
                                                         Significance F and P-values
                                                       To check if your results are reliable
                                                       (statistically   significant),    look     at
                                                       Significance F (0.001). If this value is less
                                                       than 0.05, you're OK. If Significance F is
                                                       greater than 0.05, it's probably better to
                                                       stop using this set of independent
                                                       variables. Delete a variable with a high P-
                                                       value (greater than 0.05) and rerun the
                                                       regression until Significance F drops below
                                                       0.05.
                                                       Most or all P-values should be below
                                                       below 0.05. In our example this is the
                                                       case. (0.000, 0.001 and 0.005).
  Coefficients
  The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
  Advertising. In other words, for each unit increase in price, Quantity Sold decreases with
  835.722 units. For each unit increase in Advertising, Quantity Sold increases with 0.592
  units. This is valuable information.
  You can also use these coefficients to do a forecast. For example, if price equals $4 and
  Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -
  835.722 * 4 + 0.592 * 3000 = 6970.
Residuals
The residuals show you how far away the actual data points are fom the predicted data points
(using the equation). For example, the first data point equals 8500. Using the equation, the
predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 = 8523.009, giving a residual
of 8500 - 8523.009 = -23.009.
                                              You can also create a scatter plot of these
                                              residuals.