Predictive Modelling/Analytics
Predictive Analytics
• Predictive Analytics: Use statistics and modeling techniques to make
predictions about future outcomes
• Types: Regression and Classification
• Common types: Linear regression, Logistic regression, Decision trees,
Random forest, Neural networks, Time series forecasting, K Nearest
Neighbors (KNN), Naïve Bayes, Clustering
Regression
• Regression: Predict a dependent variable (target) using independent
variable(s) (features/predictors)
• y: Dependent numeric variable
• X: Independent variable(s)
• We know X and we want to predict y
• Linear regression: y is numeric
• Logistic regression: y is categorical
Regression and Classification
• Regression: Numeric target, Classification: Class target
Regression examples Classification examples
How many page views will we get? Is this a fraudulent transaction?
What will be the amount of loss? Whose face is in this picture?
What will be the blood sugar level? Which product is best fit for the
customer?
Confusion About Regression
Predictive Analytics
Numeric Categorical
Regression Classification
Linear Regression Logistic Regression Decision Trees
Naïve Bayes’ Classifier
Main Types of Regression
• Simple Linear Regression: Relationship between the independent
variable and the dependent variable is linear (Taxi milage -> Bill
amount)
• Multiple Linear Regression: More than one independent variable –
Also called multivariate analysis (Years of Work Experience +
Education Level -> Salary)
• Logistic Regression: Binary classification problems (Customer Tenure
+ Monthly Subscription Cost + Customer Complaints -> Churn?)
Linear Regression Example:
Salary (y) = 3,00,000 + 1,00,000x
Here, b0 = 3,00,000 and b1 = 1,00,000
Experience (Years) Salary (Rupees in Lakhs)
0 3
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
10 13
Simple and Multiple Linear Regression
•
Suppose our line satisfies the equation y = 10 + 0.5x
Least Squared Error
x y (Actual) y (Line) Error Squared
Error
10 10 15 5 25
20 25 20 -5 25
30 20 25 5 25
35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
60 40 40 0 0
65 30 42.5 12.5 156.25
70 50 45 -5 25
80 40 50 10 100
Sum of Squared Error (SSE) 862.5
Least Squared Error
• Repeat this process with different lines and calculate Sum of Squared
Error (SSE)
• Line of best fit = Line where we get the smallest SSE
• Problem: We cannot do this ourselves
• Solution: Use machine learning
MAE: Is not impacted by large outliers
Interpreting Results RMSE: Is impacted by large outliers (Since
(Formulae: Next Slide) it is based on MSE, which is impacted by
large outliers)
• R-squared (R2)
• A percentage that tells us how much of the variance in data is explained by our model
• Example: R-squared = 90.27% for an example where we predict a person’s weight based on
the height means that this much of variance is covered by our model
• The higher the better (Calculation on the next slide)
• Mean Absolute Error (MAE) These three
explain how
• By how many units is the model prediction different from the actual values
well the
• Lower the better
model
• Mean Squared Error (MSE) predicts, not
• Amplifies outliers, does not tell use about the direction of the errors because of squaringhow it
• Lower the better explains
• Root Mean Squared Error (RMSE) variance
• Square root of MSE, so it is in the unit of the target variable – Easier to interpret
• Lower the better
x y prediction (y – predicted)2
R-Squared 10
20
10
25
16
20
36
25
30
30
400
25
Calculation 30
35
20
30
24
26
16
16
30
30
100
0
40 40 28 144 30 100
• •
50 15 32 289 30 225
60 40 36 16 30 100
65 30 38 64 30 0
70 50 40 100 30 400
80 40 44 16 30 100
Total NA NA 722 NA 1450
All these are called Loss
Evaluation Metrics for functions, because we want
to minimize them
Linear Regression
• In height-weight example:
R-squared: 90 … Good
MAE: 8 pounds
MSE: 101
RMSE: 10 pounds
(Average weight in the
dataset is 161 pounds, so
RMSE is 6% of the average,
which is good)
Linear Regression: Python Implementations
LinearRegression() OLS()
Part of scikit-learn Part of statsmodels
Mainly used for machine learning tasks Mainly used for statistical analysis
Focus on predictive modelling Focus on understanding relationships between
variables
Less focus on statistical details Provides detailed statistics
Includes intercept by default Need to add intercept using sm.add_constant()
Focus on metrics such as MSE, R2 etc Focus on coefficients, t-statistics (e.g. is there a linear
relationship between a feature and the predicted
variable?), p-value, CI*
*Example: Predictor (Feature): X1, Coefficient (beta): 2.5, Standard Error (SE) = 0.5, t-statistic = 5, p-value =
0.0001
t-statistic = 2.5 / 0.5 = 5 … The predictor (feature) X1 is 5 SEs away from 0
P-value = 0.0001 … There is only a 0.01% chance of observing such a t-statistic if the true coefficient of X1
=0
Since p-value < 0.05, we reject H0 … X1 significantly affects Y (H0: X has no linear relationship with y)
Scaling and Encoding
Scaling and Encoding
• Predictive analytics: Use one or more features (e.g. Years of
experience) to predict a label (Salary)
• Problem: Labels on different scale (age and income) or of categorical
types (e.g. gender)
• Solution: Scaling and Encoding
• Scaling: Converting numeric features to a common scale
• Example: Age (0-100) and income ($0-$1 million)
• Encoding: Converting categorical variables to a numeric scale
• Example: Gender values of Male and Female
Scaling: Bring Numeric Data on a Common
Scale
• Scaling: Putting all the features on the same ruler/scale
Scaling
Standardization/ Min/Max Scaling
Normalization
• Standardization/Normalization: Subtracts the mean and divides by the standard
deviation
• Min-max scaling: Scales to a specific range based on the minimum and maximum
values
Standardized Scaling (Normalized Scaling)
Age Income
25 35,000 -1.22 -1.08
40 50,000 -0.26 -0.43
55 70,000 0.70 0.43
68 1,00,000 1.54 1.73
32 45,000 -0.77 -0.65
Mean: 44 60,000
SD: 15.61 23,128.91
Because we use range also, it is
called min-max (both get used)
Min-Max Scaling
Age Income Age Income
– –
Minimum Minimum Income
Age
25 35,000 0 0 0 0
40 50,000 15 15,000 0.34 0.23
55 70,000 30 35,000 0.69 0.53
68 1,00,00 43 65,000 1 1
0
32 45,000 7 10,000 0.16 0.15
Minimum: 35,000
25
Range: 1,00,00
68 - 25 = 0–
43 35,000
Working with Non-Numeric Features –
Encoding
• Encoding: Transform categorical data into a numeric form (e.g. Passenger Class:
Business, has 20% travellers, Economy Plus, has 30% travellers, Economy, has 50%
travellers)
Encoding
One-Hot Encoding Label Encoding Frequency Encoding
• One-Hot Encoding: Business = 100, Economy Plus = 010, Economy = 001
• Label Encoding: Business = 0, Economy Plus = 1, Economy = 2
• Frequency encoding: Business = 0.20, Economy Plus = 0.30, Economy = 0.50
One-Hot Encoding: Columns
• One-Hot encoding adds a column per category
• Example: Dataset before and after One-Hot encoding
Computer OS Computer OHE OHE (Linux) OHE (Mac)
PC-01 Windows (Windows)
PC-02 Linux PC-01 1 0 0
PC-03 Linux PC-02 0 1 0
PC-04 Linux PC-03 0 1 0
PC-05 Windows PC-04 0 1 0
PC-06 Mac PC-05 1 0 0
PC-06 0 0 1
Logistic Regression
• Logistic regression: Classification technique to predict the probability
of a binary (true/false) outcome
• Forms an S-shaped curve between 0 and 1
Confusion Matrix
• Test to check if patients have a disease (H0: Patient does not have a disease)
n = 165 Predicted: NO Predicted: YES
Actual: NO TN = 50 FP = 10 (Type I Error)
Actual: YES FN = 5 (Type II Error) TP = 100
• True Positive (TP) Prediction: Disease Reality: Disease
• True Negative (TN) Prediction: No Disease Reality: No Disease
• False Positive (FP) Prediction: Disease Reality: No Disease
• False Negative (FP) Prediction: No Disease Reality: Disease
Confusion Matrix - Exercise
• Out of 1000 emails, 800 non-spams were classified correctly, 20 were
incorrectly classified as spam, 40 were incorrectly classified as
non-spam, and the remaining spams were identified correctly
• Write the null hypothesis and create a confusion matrix
Confusion Matrix - Exercise
• Out of 1000 emails, 800 non-spams were classified correctly, 20 were
incorrectly classified as spam, 40 were incorrectly classified as
non-spam, and the remaining spams were identified correctly
• Write the null hypothesis and create a confusion matrix
• H0: Email is not spam
n = 1000 Predicted: NO Predicted: YES
Actual: NO TN = 800 FP = 20 (Type I Error)
Actual: YES FN = 40 (Type II Error) TP = 140
Metrics Derived from Confusion Matrix
• Accuracy: Overall correctness of the model's predictions
• Precision (Positive Predictive Value): Accuracy of positive predictions
• Recall (Sensitivity or True Positive Rate): Ability of the model to
identify all positive instances
• Specificity (True Negative Rate): Ability of the model to identify all
negative instances
• F1 Score: Harmonic mean of precision and recall and provides a
balance between the two metrics
Metrics for Our Example
•