KEMBAR78
Data Analysis for Outlier Detection | PDF | Errors And Residuals | Regression Analysis
100% found this document useful (1 vote)
138 views28 pages

Data Analysis for Outlier Detection

This document discusses exploratory data analysis and outlier detection techniques. It begins with an overview of exploratory data analysis and then describes several visualization and mathematical techniques for identifying outliers, including box plots, scatter plots, z-scores, and IQR analysis. Specific examples are provided using the Boston Housing dataset to identify outliers. Finally, the document discusses various methods for preprocessing outliers, such as imputation, trimming, capping, and discretization.

Uploaded by

devashreereddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
138 views28 pages

Data Analysis for Outlier Detection

This document discusses exploratory data analysis and outlier detection techniques. It begins with an overview of exploratory data analysis and then describes several visualization and mathematical techniques for identifying outliers, including box plots, scatter plots, z-scores, and IQR analysis. Specific examples are provided using the Boston Housing dataset to identify outliers. Finally, the document discusses various methods for preprocessing outliers, such as imputation, trimming, capping, and discretization.

Uploaded by

devashreereddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Explanatory Data Analysis

Explanatory Data Analysis:


Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but
primarily EDA is for seeing what the data can tell us beyond the formal modelling or
hypothesis testing task.

Outlier Detection using different visualization techniques:

In statistics, an outlier is an observation point that is


distant from other observations. The Data Science project
starts with collection of data and that’s when outliers first
introduced to the population. Though, you will not know
about the outliers at all in the collection phase. The
outliers can be a result of a mistake during data collection
or it can be just an indication of variance in your data.
There are two types of analysis we will follow to find the
outliers- Uni-variate(one variable outlier analysis) and
Multi-variate(two or more variable outlier analysis). 

Discover outliers with visualization tools


Box plot-
a box plot is a method for graphically depicting groups of
numerical data through their quartiles. Box plots may also
have lines extending vertically from the boxes
(whiskers) indicating variability outside the upper and
lower quartiles, hence the terms box-and-whisker plot and
box-and-whisker diagram. Outliers may
be plotted as individual points. Above definition
suggests, that if there is an outlier it will plotted as point in
boxplot but other population will be grouped together and
display as boxes.

from sklearn import datasets

import pandas as pd

boston = load_boston()

x = boston.data

y = boston.target

columns = boston.feature_names#create the dataframe

boston_df = pd.DataFrame(boston.data)

boston_df.columns = columns

boston_df.head()

BOX PLOT PLOTTING:

import seaborn as sns

sns.boxplot(x=boston_df['DIS'])

Out[17]: <matplotlib.axes._subplots.AxesSubplot at 0x25f17fd9288>


Above plot shows three points between 10 to 12, these are
outliers as there are not included in the box of other
observation i.e no where near the quartiles.

Here we analysed Uni-variate outlier i.e. we used DIS


column only to check the outlier. But we can do
multivariate outlier analysis too. Can we do the
multivariate analysis with Box plot? Well it depends, if you
have a categorical values then you can use that with any
continuous variable and do multivariate outlier analysis.
As we do not have categorical value in our Boston Housing
dataset, we might need to forget about using box plot for
multivariate outlier analysis.

Scatter plot-
A scatter plot , is a type of plot or mathematical diagram
using Cartesian coordinates to display values for typically
two variables for a set of data. The data are displayed as
a collection of points, each having the value of one
variable determining the position on the horizontal axis
and the value of the other variable determining the
position on the vertical axis.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(16,8))

ax.scatter(boston_df['INDUS'], boston_df['TAX'])

ax.set_xlabel('Proportion of non-retail business acres per town')

ax.set_ylabel('Full-value property-tax rate per $10,000')

plt.show()
Looking at the plot above, we can most of data points are
lying bottom left side but there are points which are far
from the population like top right corner.

Discover outliers with mathematical function


Z-Score-
The Z-score is the signed number of standard deviations
by which the value of an observation or data point is
above the mean value of what is being observed or
measured.
Mathematical formula for calculating Z-Score :

he intuition behind Z-score is to describe any data point by


finding their relationship with the Standard Deviation and
Mean of the group of data points. Z-score is finding the
distribution of data where mean is 0 and standard
deviation is 1 i.e. normal distribution.

How Z score helps in detecting outliers:


while calculating the Z-score we re-scale and center the
data and look for data points which are too far from zero.
These data points which are way too far from zero will be
treated as the outliers. In most of the cases a threshold of 3
or -3 is used i.e if the Z-score value is greater than or less
than 3 or -3 respectively, that data point will be identified
as outliers.

from scipy import stats


import numpy as npz = np.abs(stats.zscore(boston_df))
print(z)

lets consider threshold as 3 but its mostly business requirement .here lets
be 3

threshold = 3

print(np.where(z > 3))


This output shows first array as row and second array as column

print(z[55][1])

3.375038763517309

So, the data point — 55th record of column 1is an outlier.

IQR score -

Box plot use the IQR method to display data and


outliers(shape of the data) but in order to be get a list of
identified outlier, we will need to use the mathematical
formula and retrieve the outlier data.

The interquartile range (IQR), also called


the midspread or middle 50%, or technically H-
spread, is a measure of statistical dispersion, being
equal to the difference between 75th and 25th percentiles,
or between upper and lower quartiles, IQR = Q3 − Q1.

print(boston_df< (Q1 - 1.5 * IQR)) |(boston_df> (Q3 + 1.5 *


IQR))

Working with Outliers:


Correcting, Removing

Z-Score
We can remove or filter the outliers and can get the clean
data. This can be done with just one line code as we have
already calculated the Z-score.
boston_df_out = boston_df_o1[~((boston_df_o1 < (Q1 - 1.5
* IQR)) |(boston_df_o1 > (Q3 + 1.5 *
IQR))).any(axis=1)]boston_df_out.shape

boston_df_o = boston_df_o[(z < 3).all(axis=1)]

Extreme Value Analysis:


The most basic form of outlier detection is Extreme Value
analysis. The key of this method is to determine the
statistical tails of the underlying distribution of the
variable and find the values at the extreme end of the tails.

In case of a Gaussian Distribution, the outliers will lie


outside the mean plus or minus 3 times the standard
deviation of the variable.
If the variable is not normally distributed (not a Gaussian
distribution), a general approach is to calculate the
quantiles and then the inter-quartile range.

IQR (Inter quantiles range)= 75th quantile — 25th


quantile

An outlier will be in the following upper and lower


boundaries:
Upper Boundary = 75th quantile +(IQR * 1.5)Lower Boundary
= 25th quantile — (IQR * 1.5)

Or for extreme cases:


Upper Boundary = 75th quantile +(IQR * 3)Lower Boundary =
25th quantile — (IQR * 3)

If the data point is above the upper boundary or below the


lower boundary, it can be considered as an outlier.

Code:

First, let's calculate the Inter Quantile Range for our


dataset,
IQR = data.annual_inc.quantile(0.75) -
data.annual_inc.quantile(0.25)

Using the IQR, we calculate the upper boundary using the


formulas mentioned above,
upper_limit = data.annual_inc.quantile(0.75) + (IQR *
1.5)
upper_limit_extreme = data.annual_inc.quantile(0.75) +
(IQR * 3)upper_limit, upper_limit_extreme

Now, let’s see the ratio of data points above the upper limit
& extreme upper limit. ie, the outliers.
total = np.float(data.shape[0])
print('Total borrowers:
{}'.format(data.annual_inc.shape[0]/total))
print('Borrowers that earn > 178k:
{}'.format(data[data.annual_inc>178000].shape[0]/total))
print('Borrowers that earn > 256k:
{}'.format(data[data.annual_inc>256000].shape[0]/total))

We can see that about 5% of the data is above the upper


limit and 1% of the data above the extreme upper limit.

Methods to Pre-Process Outliers:


1. Mean/Median or random Imputation
2. Trimming
3. Top, Bottom and Zero Coding
4. Discretization
1Mean/Median or random Imputation

If we have reasons to believe that outliers are due to


mechanical error or problems during measurement. That
means, the outliers are in nature similar to missing data,
then any method used for missing data imputation can we
used to replace outliers. 

2.Trimming:
In this method, we discard the outliers completely. That is,
eliminate the data points that are considered as outliers.
In situations where you won’t be removing a large number
of values from the dataset, trimming is a good and fast
approach.
index = data[(data['annual_inc'] >= 256000)].index
data.drop(index, inplace=True)

Here we use pandas drop method to remove all the


records that are more than the upper limit value we found
using extreme value analysis.

Top / bottom / zero Coding:


Top Coding means capping the maximum of the
distribution at an arbitrary set value. A top coded variable
is one for which data points above an upper bound are
censored. By implementing top coding, the outlier is
capped at a certain maximum value and looks like many
other observations.

Bottom coding is analogous but on the left side of the


distribution. That is, all values below a certain threshold,
are capped to that threshold. If the threshold is zero, then
it is known as zero-coding. For example, for variables
like “age” or “earnings”, it is not possible to have negative
values. Thus it’s reasonable to cap the lowest value to zero.

we are capping the data points with values greater than


256000 to 256000.
data.loc[data.annual_inc>256000,'annual_inc'] = 256000
data.annual_inc.max()

Discretization is the process of transforming continuous


variables into discrete variables by creating a set of
contiguous intervals that spans the range of the variable’s
values. Thus, these outlier observations no longer differ
from the rest of the values at the tails of the distribution,
as they are now all together in the same interval/bucket.

There are several approaches to transform continuous


variables into discrete ones. This process is also known
as binning, with each bin being each interval.

Discretization methods
 Equal width binning
 Equal frequency binning
Detecting Missing values:
Depending on data sources, missing data are identified
differently. Pandas always identify missing values as NaN. 
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-
3e9c6ebcf78b-- Data cleaning with python and pandas –Detecting missing values

Standard Missing values: blank space and NA

Using the isnull() method, we can confirm that both the


missing value and “NA” were recognized as missing
values. 

Non-Standard Missing Values


To detect these various formats is to put them in a list.
Then when we import the data, Pandas will recognize
them right away. Here’s an example of how we would do
that.

# Making a list of missing value types


missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property data.csv", na_values =
missing_values)

Looking at the NUM_BEDROOMS column


print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()
Unexpected Missing Values
if our feature is expected to be a string, but there’s a
numeric type, then technically this is also a missing value.
Example:

1. Loop through the OWN_OCCUPIED column


2. Try and turn the entry into an integer
3. If the entry can be changed into an integer, enter a missing
value
4. If the number can’t be an integer, we know it’s a string, so
keep going

Example:
# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1

If the value can be changed to an integer, we change the


entry to a missing value using Numpy’s np.nan.

On the other hand, if it can’t be changed to an integer,


we pass and keep going.
Summarizing Missing Values
After we’ve cleaned the missing values, we will probably
want to summarize them. For instance, we might want to
look at the total number of missing values for each feature.
# Total missing values for each feature
print df.isnull().sum()Out:
ST_NUM 2
ST_NAME 0
OWN_OCCUPIED 2
NUM_BEDROOMS 4
# Any missing values?
print df.isnull().values.any()Out:
True

total count of missing values.


# Total number of missing values
print df.isnull().sum().sum()Out:
Handling Missing values:
# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)

# Location based replacement


df.loc[2,'ST_NUM'] = 125

# Replace using median


median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)

Data transformation
Data transformation predominantly deals with
normalizing also known as scaling data , handling
skewness and aggregation of attributes.
Normalization
Normalization or scaling refers to bringing all the columns
into same range. We will discuss two most common
normalization techniques.

1. Min-Max
2. Z score

3. Min-Max normalization:
4. It is simple way of scaling values in a column. But, it
tries to move the values towards the mean of the
column. Here is the formula
5.

6.

Z score normalization:

Now, let us see what Z score normalization is. In Z score


normalization, we perform following mathematical
transformation.
Min- Max tries to get the values closer to mean. But when
there are outliers in the data which are important and we
don’t want to loose their impact ,we go with Z score
normalization.

Skewness of data:
According to Wikipedia,” In probability
theory and statistics, skewness is a measure of the
asymmetry of the probability distribution of a real-
valued random variable about its mean.”

Skewness basically gives the shape of normal distribution


of values.

If skewness value lies above +1 or below -1, data is highly


skewed. If it lies between +0.5 to -0.5, it is moderately
skewed. If the value is 0, then the data is symmetric
Once, we know the skewness level, we should know
whether it is positively skewed or negatively skewed.

Positively skewed data:

If tail is on the right as that of the second image in the


figure, it is right skewed data. It is also called positive
skewed data.

Common transformations of this data include square


root, cube root, and log.

Cube root transformation:

The cube root transformation involves converting


x to x^(1/3). This is a fairly strong transformation with a
substantial effect on distribution shape: but is weaker than
the logarithm. It can be applied to negative and zero
values too. Negatively skewed data.

Square root transformation:


Applied to positive values only. Hence, observe the values
of column before applying.

Logarithm transformation:

The logarithm, x to log base 10 of x, or x to log base e of x


(ln x), or x to log base 2 of x, is a strong transformation
and can be used to reduce right skewness.

Negatively skewed data:

If the tail is to the left of data, then it is called left skewed


data. It is also called negatively skewed data.

Common transformations include square , cube root


and logarithmic.

We will discuss what square transformation is as others


are already discussed.

Square transformation:

The square, x to x², has a moderate effect on distribution


shape and it could be used to reduce left skewness.

Another method of handling skewness is finding outliers


and possibly removing them.

Feature Scaling :
When should you perform feature scaling and
mean normalization on the given data? What are
the advantages of these techniques?
Few advantages of normalizing the data are as follows:

1. It makes your training faster.


2. It prevents you from getting stuck in local optima.
3. It gives you a better error surface shape.
4. Wweight decay and bayes optimization can be done more conveniently.

Hovewer, there are few algorithms such as Logistic Regression and Decision
Trees that are not affected by scaling of input data.

Let me answer this from general ML perspective and not only neural
networks. When you collect data and extract features, many times the data
is collected on different scales. For example, the age of employees in a
company may be between 21-70 years, the size of the house they live is
500-5000 Sq feet and their salaries may range from $30000-$80000. In this
situation if you use a simple Euclidean metric, the age feature will not play
any role because it is several order smaller than other features. However, it
may contain some important information that may be useful for the task.
Here, you may want to normalize the features independently to the same
scale, say [0,1], so they contribute equally while computing the distance.
However, normalization may also result in loss of information. Therefore,
you need to be sure about this aspect as well. Most of the time, it helps
when the objective function you are optimizing computes some sort of
distance or squared distance.

The above example can be extended to any ML algorithm. By


normalization, we are trying to reduce the impact of large valued features
extracted on a different scale and allowing small valued features to
contribute equally in optimizing an objective function.

The difference is that, in scaling, you're changing the range of your data while in
normalization you're changing the shape of the distribution of your data. Let's talk
a little more in-depth about each of these option

Scaling

This means that you're transforming your data so that it fits within a specific scale,
like 0-100 or 0-1. You want to scale data when you're using methods based on
measures of how far apart data points, like support vector machines, or SVMor k-
nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric
feature is given the same importance.

For example, you might be looking at the prices of some products in both Yen
and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your
prices methods like SVM or KNN will consider a difference in price of 1 Yen as
important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions
of the world. With currency, you can convert between currencies. But what about
if you're looking at something like height and weight? It's not entirely clear how
many pounds should equal one inch (or how many kilograms should equal one
meter).

By scaling your variables, you can help compare different variables on equal
footing.

Scaling Example:
1. # generate 1000 data points randomly drawn from an exponential
distribution
2. original_data = np.random.exponential(size = 1000)
3.  
4. # mix-max scale the data between 0 and 1
5. scaled_data = minmax_scaling(original_data, columns = [0])
6.  
7. # plot both together to compare
8. fig, ax=plt.subplots(1,2)
9. sns.distplot(original_data, ax=ax[0])
10. ax[0].set_title("Original Data")
11. sns.distplot(scaled_data, ax=ax[1])
12. ax[1].set_title("Scaled data")
Normalization
Scaling just changes the range of your data. Normalization is a more radical
transformation. The point of normalization is to change your observations so that
they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical


distribution where a roughly equal observations fall above and below the mean,
the mean and the median are the same, and there are more observations closer
to the mean. The normal distribution is also known as the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a
machine learning or statistics technique that assumes your data is normally
distributed. Some examples of these include t-tests, ANOVAs, linear regression,
linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method
with "Gaussian" in the name probably assumes normality.)

The method were using to normalize here is called the Box-Cox Transformation.

Normalization example :
1. # normalize the exponential data with boxcox
2. normalized_data = stats.boxcox(original_data)
3.  
4. # plot both together to compare
5. fig, ax=plt.subplots(1,2)
6. sns.distplot(original_data, ax=ax[0])
7. ax[0].set_title("Original Data")
8. sns.distplot(normalized_data[0], ax=ax[1])
9. ax[1].set_title("Normalized data")
Feature Scaling or Standardization: It is a step of Data Pre Processing which is
applied to independent variables or features of data. It basically helps to
normalise the data within a particular range. Sometimes, it also helps in speeding
up the calculations in an algorithm.

Package Used:
1. sklearn.preprocessing
Import:
1. from sklearn.preprocessing import StandardScaler
Formula used in Backend
Standardisation replaces the values by their Z scores.
Mostly the Fit method is used for Feature scaling

Examples of Algorithms where Feature Scaling matters


1. K-Means uses the Euclidean distance measure here feature scaling
matters.
2. K-Nearest-Neighbours also require feature scaling.
3. Principal Component Analysis (PCA): Tries to get the feature with
maximum variance, here too feature scaling is required.
4. Gradient Descent: Calculation speed increase as Theta calculation
becomes faster after feature scaling.

Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models are


not affected by feature scaling.
In Short, any Algorithm which is Not Distance based is Not affected by
Feature Scaling.

In a nutshell, tree based methods (decision trees, random forests, gradient


boosting) do not need scaling / centering at all (they are invariant by
monotonic transformations of the data).

Mean centering is not necessary when the learning is based on pairwise


distances (Gaussian kernel SVMs, k Nearest Neighbours…) but scaling
usually improves performance.

The other methods (penalized regressions, per example) work better with
centering and scaling the data.

Regression:

Regression technique vary from Linear Regression to SVR and


Random Forests Regression.

In this part, you will understand and learn how to implement the following
Machine Learning Regression models:

1. Simple Linear Regression


2. Multiple Linear Regression
3. Polynomial Regression
4. Support Vector for Regression (SVR)
5. Decision Tree Classification
6. Random Forest Classification

1. Linear Regression 

2. Linear regression: Linear regression involves using data to calculate a line that
best fits that data, and then using that line to predict scores on one variable from
another. Prediction is simply the process of estimating scores of the outcome (or
dependent) variable based on the scores of the predictor (or independent)
variable. To generate the regression line, we look for a line of best fit. A line
which can explain the relationship between independent and dependent
variable(s), better is said to be best fit line. The difference between the observed
value and actual value gives the error. Linear Regression gives an equation of
the following form: Y = m0 + m1x1 + m2x2 + m3x3 +…….mnxn where Y is the
dependent variable and X’s are the independent variables. The right-hand side of
this equation is also known as Hypothesis Function - H(x)

3. Line of Best Fit

The purpose of line of best fit is that the predicted values should be as close as
possible to the actual or observed values. This means the main objective in
determining the line of best fit is to “minimize” the difference predicted values and
observed values. These differences are called “errors” or “residuals”. 3 ways to
calculate the “error”  Sum of all errors: (∑(Y – h(X))) (This may result in the
cancellation of positive and negative errors. This will not be a correct metric to
use)  Sum of absolute value of all errors: (∑|Y-h(X)|)  Sum of square of all
errors ( ∑ (Y-h(X))2)  The line of best fit for 1 feature can be represented as : Y=
bx +c Where Y is the score or outcome variable we are trying to predict B =
regression coefficient or slope C = Y intercept or the regression constant This is
Linear regression with 1 variable.

4. Sum of Squared Errors  Squaring the difference between actual value and
predicted value “penalizes” more for each error. Hence minimizing the sum of
squared errors improves the quality of regression line.  This method of fitting the
data line so that there is minimal difference between the observations and the
line is called the method of least squares.  Baseline model refers to the line
which predicts each value as the average of the data points.  SSE or Sum of
Squared Errors is the total of all squares of the errors. It is a measure of the
quality of regression line. SSE is sensitive to the number of input data points. 
SST is Total Sum of Squares: It is the SSE for baseline model.

5. Regression Metrics

Mean Absolute Error : One way to measure error is by using absolute error to find
the predicted distance from the true value. The mean absolute error takes the
total absolute error of each example and averages the error based on the number
of data points. By adding up all the absolute values of errors of a model we can
avoid canceling out errors from being too high or below the true values and get
an overall error metric to evaluate the model on.
Mean Squared Error : Mean squared is the most common metric to measure
model performance. In contrast with absolute error, the residual error (the
difference between predicted and the true value) is squared. Some benefits of
squaring the residual error is that error terms are positive, it emphasizes larger
errors over smaller errors, and is differentiable. Being differentiable allows us to
use calculus to find minimum or maximum values, often resulting in being more
computationally efficient.

R-Squared: Its called coefficient of determination. The values for R2 range from 0
to 1, and it determines how much of the total variation in Y is explained by the
variation in X. A model with an R2 of 0 is no better than a model that always
predicts the mean of the target variable, whereas a model with an R2 of 1
perfectly predicts the target variable. Any value between 0 and 1 indicates what
percentage of the target variable, using this model, can be explained by the
features. A model can be given a negative R2 as well, which indicates that the
model is arbitrarily worse than one that always predicts the mean of the target
variable.

Cost Function 

The error of regression model is expressed as a cost function : Its is similar to


sum of squared errors. 1/m is means, we are calculating the average. The factor
½ is used to simplify mathematics. This function is minimized to reduce errors in
prediction. Minimizing this function, means we get the values of θ0 and θ1 which
find on average the minimal deviation of x from y when we use those parameters
in our hypothesis function.
4. Inside Cost Function Cost function :
5.
Lets assume, θ0 is 0. (Our hypothesis passes through origin) So, now we need
that value of θ1 for which Cost function is minimum. To find that out, plot J(θ1) vs
θ1
6. 8. Inside Cost Function Cost function : With both θ0 and θ1, The plot becomes
more complex So, now we need that value of θ1 for which Cost function is
minimum. To find that out, plot J(θ1, θ0 ) vs θ1 and θ0
7. 9. Gradient Descent The process of minimizing the cost function can be achieved
by Gradient Descent algorithm: The steps are: 1. Start with initial guess of
coefficients 2. Keep changing the coefficients a little bit to try and reduce Cost
Function J(θ0,θ1) 3. Each time, the parameters are changed, the gradient is
chosen which reduces J(θ0,θ1) the most. 4. Repeat 5. Keep doing till no
improvement is made.
8. 10. Polynomial Regression Instead of finding a best fit “line” on the given data
points, we can also try to find the best fit “curve”. This is the form of Polynomial
regression. The equation, in case of second-order polynomial will be: Y = θ0+
θ1x+ θ2x2 (Quadratic Regression) Third-order polynomial will be: Y = θ0+ θ1x+
θ2x2 + θ3x3 (Cubic Regression) When we use higher order powers in our
regression model, we say that we are increasing the “complexity” of the model.
The more the complexity of the model, the better it will “fit” on the given data.
9. 11. Overfitting and Underfitting So, should we always choose a “complex” model
with higher order polynomials to fit the data set? NO, it may be possible that such
a model gives very wrong predictions on Test data. Though it fits well on training
data but fails to estimate the real relationship among variables beyond the
training set. This is known as “Over-fitting” Similarly, we can have underfitting, it
occurs when our model neither fits the training data nor generalizes on the new
data.
10. 12. Bias and Variance Bias: Bias occurs when a model has enough data but is
not complex enough to capture the underlying relationships(or patterns). As a
result, the model consistently and systematically misrepresents the data, leading
to low accuracy in prediction. This is known as underfitting. Simply put, bias
occurs when we have an inadequate model. (Pays too little attention to data;
does the same thing over and over again; high error on training set) Variance:
When training a model, we typically use a limited number of samples from a
larger population. If we repeatedly train a model with randomly selected subsets
of data, we would expect its predictions to be different based on the specific
examples given to it. Here variance is a measure of how much the predictions
vary for any given test sample. (Pays too much attention to data; high error on
test set)  Some variance is normal, but too much variance indicates that the
model is unable to generalize its predictions to the larger population. High
sensitivity to the training set is also known as overfitting, and generally occurs
when either the model is too complex or when we do not have enough data to
support it.  We can typically reduce the variability of a model's predictions and
increase precision by training on more data. If more data is unavailable, we can
also control variance by limiting our model's complexity.
11. 13. Adjusted R-Squared  R-square will increase or remain constant, if we add
new predictors to our model. So there is no way to judge that by increasing
complexity of the model, are we making it more accurate?  We “adjust” R-
Square formula to include no of predictors in the model. The adjusted R- Square
only increases if the new term improves the model accuracy. R2 = Sample R
square p = Number of predictors N = total sample size

Assumptions of Linear Regression:

Linearity(can be seen using scatterplot).remove outliers

Homodestacity(variance of errors is constant)- no cone shape data

Multivariate normality(normality of error distribution)-

When we plot residual histogram,data points shud be normally distributed


so that in histogram majority of the values are close to zero)

(expected mean error of regression model is zero)

Independence of errors-(no autocorrelationcin errors).

Lack of Multicollinearity

Multiple linear regression: no need of feature scaling


Overfitting

Multicollinearity-

Overfitting:

Performs well on train data but not on test data

The solution is to reduce the number of parameters

Ex:By regularization which introduces mathematical constraints that favour


regression lines that are simpler and have less terms

Multicollinearity:

When we add more input variables it creates relationship among


themselves.

Ex:increasing backyard space increases grid space.

Then we are not sure which property is influencing much on output.

Sol:

We can use regularization techniques to avoid multicolllinear parameters or


just by using trail and error methods.

Correlation:equivalent change of one unit in one variaboe to unit change in


another variable

It shows strength of connection between variables

Correlation analysis:
Degree of association –correlation coefficient “r”

It is sometimes called as pearson correlation coeffirnt and it is a measure


of linear association

“r “ varies from -1 to +1

If a curved line must be used to explain correlation then other methods


have to be used

Calculating the Pearson


Corellation
Pandas –corr()

pearsoncorr = SuicideRate.corr(method='pearson')

heat maps can be used to see correlation

Pearson correlation coefficients measure only linear relationships. Spearman


correlation coefficients measure only monotonic relationships. So a meaningful
relationship can exist even if the correlation coefficients are 0. Examine a
scatterplot to determine the form of the relationship.

Spearman correlation is often used to evaluate relationships involving ordinal


variables. For example, you might use a Spearman correlation to evaluate whether
the order in which employees complete a test exercise is related to the number of
months they have been employed.

Scatterpllot can be used

Polynomial regression:

First apply polyreg on x set with some degree

Then apply this to linear regressor.


Helps in feature selection:

The variance of a feature determines how much it is


impacting the response variable. If the variance is low, it
implies there is no impact of this feature on response and
vice-versa.

You might also like