0% found this document useful (0 votes)

19 views27 pages

Predictive ModellingAnalytics

The document provides an overview of predictive analytics, focusing on regression and classification techniques, including various models like linear regression, logistic regression, and decision trees. It explains the concepts of scaling and encoding for data preparation, as well as evaluation metrics such as R-squared, MAE, and confusion matrices for assessing model performance. Additionally, it discusses the importance of minimizing loss functions in predictive modeling.

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views27 pages

Predictive ModellingAnalytics

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Predictive Modelling/Analytics

Predictive Analytics
• Predictive Analytics: Use statistics and modeling techniques to make
predictions about future outcomes
• Types: Regression and Classification
• Common types: Linear regression, Logistic regression, Decision trees,
Random forest, Neural networks, Time series forecasting, K Nearest
Neighbors (KNN), Naïve Bayes, Clustering
Regression
• Regression: Predict a dependent variable (target) using independent
variable(s) (features/predictors)
• y: Dependent numeric variable
• X: Independent variable(s)
• We know X and we want to predict y
• Linear regression: y is numeric
• Logistic regression: y is categorical
Regression and Classification
• Regression: Numeric target, Classification: Class target
Regression examples Classification examples
How many page views will we get? Is this a fraudulent transaction?

What will be the amount of loss? Whose face is in this picture?

What will be the blood sugar level? Which product is best fit for the
customer?
Confusion About Regression
Predictive Analytics

Numeric Categorical

Regression Classification

Linear Regression Logistic Regression Decision Trees

Naïve Bayes’ Classifier

Main Types of Regression
• Simple Linear Regression: Relationship between the independent
variable and the dependent variable is linear (Taxi milage -> Bill
amount)
• Multiple Linear Regression: More than one independent variable –
Also called multivariate analysis (Years of Work Experience +
Education Level -> Salary)
• Logistic Regression: Binary classification problems (Customer Tenure
+ Monthly Subscription Cost + Customer Complaints -> Churn?)
Linear Regression Example:

Salary (y) = 3,00,000 + 1,00,000x

Here, b0 = 3,00,000 and b1 = 1,00,000

Experience (Years) Salary (Rupees in Lakhs)

0 3
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
10 13
Simple and Multiple Linear Regression
•
Suppose our line satisfies the equation y = 10 + 0.5x

Least Squared Error

x y (Actual) y (Line) Error Squared
Error
10 10 15 5 25
20 25 20 -5 25
30 20 25 5 25
35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
60 40 40 0 0
65 30 42.5 12.5 156.25
70 50 45 -5 25
80 40 50 10 100
Sum of Squared Error (SSE) 862.5
Least Squared Error
• Repeat this process with different lines and calculate Sum of Squared
Error (SSE)
• Line of best fit = Line where we get the smallest SSE
• Problem: We cannot do this ourselves
• Solution: Use machine learning
MAE: Is not impacted by large outliers
Interpreting Results RMSE: Is impacted by large outliers (Since
(Formulae: Next Slide) it is based on MSE, which is impacted by
large outliers)
• R-squared (R2)
• A percentage that tells us how much of the variance in data is explained by our model
• Example: R-squared = 90.27% for an example where we predict a person’s weight based on
the height means that this much of variance is covered by our model
• The higher the better (Calculation on the next slide)
• Mean Absolute Error (MAE) These three
explain how
• By how many units is the model prediction different from the actual values
well the
• Lower the better
model
• Mean Squared Error (MSE) predicts, not
• Amplifies outliers, does not tell use about the direction of the errors because of squaringhow it
• Lower the better explains
• Root Mean Squared Error (RMSE) variance
• Square root of MSE, so it is in the unit of the target variable – Easier to interpret
• Lower the better
x y prediction (y – predicted)2

R-Squared 10
20
10
25
16
20
36
25
30
30
400
25

Calculation 30
35
20
30
24
26
16
16
30
30
100
0
40 40 28 144 30 100
• •
50 15 32 289 30 225
60 40 36 16 30 100
65 30 38 64 30 0
70 50 40 100 30 400
80 40 44 16 30 100
Total NA NA 722 NA 1450
All these are called Loss
Evaluation Metrics for functions, because we want
to minimize them
Linear Regression
• In height-weight example:

R-squared: 90 … Good

MAE: 8 pounds
MSE: 101
RMSE: 10 pounds

(Average weight in the

dataset is 161 pounds, so
RMSE is 6% of the average,
which is good)
Linear Regression: Python Implementations
LinearRegression() OLS()
Part of scikit-learn Part of statsmodels
Mainly used for machine learning tasks Mainly used for statistical analysis
Focus on predictive modelling Focus on understanding relationships between
variables
Less focus on statistical details Provides detailed statistics
Includes intercept by default Need to add intercept using sm.add_constant()
Focus on metrics such as MSE, R2 etc Focus on coefficients, t-statistics (e.g. is there a linear
relationship between a feature and the predicted
variable?), p-value, CI*

*Example: Predictor (Feature): X1, Coefficient (beta): 2.5, Standard Error (SE) = 0.5, t-statistic = 5, p-value =
0.0001
t-statistic = 2.5 / 0.5 = 5 … The predictor (feature) X1 is 5 SEs away from 0
P-value = 0.0001 … There is only a 0.01% chance of observing such a t-statistic if the true coefficient of X1
=0
Since p-value < 0.05, we reject H0 … X1 significantly affects Y (H0: X has no linear relationship with y)
Scaling and Encoding
Scaling and Encoding
• Predictive analytics: Use one or more features (e.g. Years of
experience) to predict a label (Salary)
• Problem: Labels on different scale (age and income) or of categorical
types (e.g. gender)
• Solution: Scaling and Encoding
• Scaling: Converting numeric features to a common scale
• Example: Age (0-100) and income ($0-$1 million)
• Encoding: Converting categorical variables to a numeric scale
• Example: Gender values of Male and Female
Scaling: Bring Numeric Data on a Common
Scale
• Scaling: Putting all the features on the same ruler/scale

Scaling

Standardization/ Min/Max Scaling

Normalization

• Standardization/Normalization: Subtracts the mean and divides by the standard

deviation
• Min-max scaling: Scales to a specific range based on the minimum and maximum
values
Standardized Scaling (Normalized Scaling)
Age Income

25 35,000 -1.22 -1.08

40 50,000 -0.26 -0.43
55 70,000 0.70 0.43
68 1,00,000 1.54 1.73
32 45,000 -0.77 -0.65
Mean: 44 60,000
SD: 15.61 23,128.91
Because we use range also, it is
called min-max (both get used)
Min-Max Scaling
Age Income Age Income
– –
Minimum Minimum Income
Age
25 35,000 0 0 0 0
40 50,000 15 15,000 0.34 0.23
55 70,000 30 35,000 0.69 0.53
68 1,00,00 43 65,000 1 1
0
32 45,000 7 10,000 0.16 0.15
Minimum: 35,000
25
Range: 1,00,00
68 - 25 = 0–
43 35,000
Working with Non-Numeric Features –
Encoding
• Encoding: Transform categorical data into a numeric form (e.g. Passenger Class:
Business, has 20% travellers, Economy Plus, has 30% travellers, Economy, has 50%
travellers)
Encoding

One-Hot Encoding Label Encoding Frequency Encoding

• One-Hot Encoding: Business = 100, Economy Plus = 010, Economy = 001

• Label Encoding: Business = 0, Economy Plus = 1, Economy = 2
• Frequency encoding: Business = 0.20, Economy Plus = 0.30, Economy = 0.50
One-Hot Encoding: Columns
• One-Hot encoding adds a column per category
• Example: Dataset before and after One-Hot encoding
Computer OS Computer OHE OHE (Linux) OHE (Mac)
PC-01 Windows (Windows)
PC-02 Linux PC-01 1 0 0
PC-03 Linux PC-02 0 1 0
PC-04 Linux PC-03 0 1 0
PC-05 Windows PC-04 0 1 0
PC-06 Mac PC-05 1 0 0
PC-06 0 0 1
Logistic Regression
• Logistic regression: Classification technique to predict the probability
of a binary (true/false) outcome
• Forms an S-shaped curve between 0 and 1
Confusion Matrix
• Test to check if patients have a disease (H0: Patient does not have a disease)

n = 165 Predicted: NO Predicted: YES

Actual: NO TN = 50 FP = 10 (Type I Error)
Actual: YES FN = 5 (Type II Error) TP = 100
• True Positive (TP) Prediction: Disease Reality: Disease
• True Negative (TN) Prediction: No Disease Reality: No Disease
• False Positive (FP) Prediction: Disease Reality: No Disease
• False Negative (FP) Prediction: No Disease Reality: Disease
Confusion Matrix - Exercise
• Out of 1000 emails, 800 non-spams were classified correctly, 20 were
incorrectly classified as spam, 40 were incorrectly classified as
non-spam, and the remaining spams were identified correctly
• Write the null hypothesis and create a confusion matrix
Confusion Matrix - Exercise
• Out of 1000 emails, 800 non-spams were classified correctly, 20 were
incorrectly classified as spam, 40 were incorrectly classified as
non-spam, and the remaining spams were identified correctly
• Write the null hypothesis and create a confusion matrix
• H0: Email is not spam

n = 1000 Predicted: NO Predicted: YES

Actual: NO TN = 800 FP = 20 (Type I Error)
Actual: YES FN = 40 (Type II Error) TP = 140
Metrics Derived from Confusion Matrix
• Accuracy: Overall correctness of the model's predictions
• Precision (Positive Predictive Value): Accuracy of positive predictions
• Recall (Sensitivity or True Positive Rate): Ability of the model to
identify all positive instances
• Specificity (True Negative Rate): Ability of the model to identify all
negative instances
• F1 Score: Harmonic mean of precision and recall and provides a
balance between the two metrics
Metrics for Our Example
•

Ds Module 4
No ratings yet
Ds Module 4
73 pages
IDA117V Supervised ML
No ratings yet
IDA117V Supervised ML
39 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
DS Unit 4
No ratings yet
DS Unit 4
13 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Regression Analysis for Beginners
No ratings yet
Regression Analysis for Beginners
35 pages
Lect 02 Evaluation Part 1
No ratings yet
Lect 02 Evaluation Part 1
33 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
Unit 2
No ratings yet
Unit 2
80 pages
FAM Unit6
No ratings yet
FAM Unit6
32 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
W8-Supervised Learning Methods
No ratings yet
W8-Supervised Learning Methods
30 pages
Regression v33
No ratings yet
Regression v33
81 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Evaluation Metrics For Your Regression Model - Analytics Vidhya
No ratings yet
Evaluation Metrics For Your Regression Model - Analytics Vidhya
6 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
DSBDL - Write - Ups - 4 To 7
No ratings yet
DSBDL - Write - Ups - 4 To 7
11 pages
Regression
No ratings yet
Regression
45 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
6 - Classification and Regression Tasks
No ratings yet
6 - Classification and Regression Tasks
115 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Metric
No ratings yet
Metric
6 pages
Unit 2 - NOTES1 - ML
No ratings yet
Unit 2 - NOTES1 - ML
35 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 9-10
No ratings yet
Lecture 9-10
28 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
12 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
Day 2
No ratings yet
Day 2
52 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
26 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Supervised Learning. wk3
No ratings yet
Supervised Learning. wk3
18 pages
DSA Shotnotes For 2 Units
No ratings yet
DSA Shotnotes For 2 Units
5 pages
Linear Regression
No ratings yet
Linear Regression
28 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
Machine Learning Data & Metrics Guide
No ratings yet
Machine Learning Data & Metrics Guide
12 pages
MLDAP Module2
No ratings yet
MLDAP Module2
32 pages
Intro To Machine Learning
No ratings yet
Intro To Machine Learning
11 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
22 pages
Lecture 6-Revisions Chapter 1-5
100% (1)
Lecture 6-Revisions Chapter 1-5
62 pages
ML & DA Unit2 - Notes
No ratings yet
ML & DA Unit2 - Notes
57 pages
d3 It ML Jan 2023 Part 2
No ratings yet
d3 It ML Jan 2023 Part 2
32 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
W1.2 Regression 1
No ratings yet
W1.2 Regression 1
28 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
CSE3506 PPT Ref1
No ratings yet
CSE3506 PPT Ref1
135 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
T Test, ANOVA, Chi Square Test
No ratings yet
T Test, ANOVA, Chi Square Test
26 pages
Continuous Distributions
No ratings yet
Continuous Distributions
17 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Naïve Bayes' Classifier
No ratings yet
Naïve Bayes' Classifier
17 pages
Advanced Optimization Problems
No ratings yet
Advanced Optimization Problems
13 pages
Backpropagation Tahun Data Aktual Hasil Peramalan MAD MSE
No ratings yet
Backpropagation Tahun Data Aktual Hasil Peramalan MAD MSE
72 pages
s2.3 Continuous Distributions
No ratings yet
s2.3 Continuous Distributions
37 pages
Kelola, 13 - JIMKES 2021 Vol 9 No 1 Yulia
No ratings yet
Kelola, 13 - JIMKES 2021 Vol 9 No 1 Yulia
4 pages
Tutorial - Session Nine
0% (1)
Tutorial - Session Nine
3 pages
Binomial
No ratings yet
Binomial
15 pages
Game Theory: Repeated Games: Branislav L. Slantchev
No ratings yet
Game Theory: Repeated Games: Branislav L. Slantchev
19 pages
Kontribusi Pembelajaran Di Perguruan Tinggi Dan Literasi Keuangan Terhadap Perilaku Keuangan Mahasiswa
No ratings yet
Kontribusi Pembelajaran Di Perguruan Tinggi Dan Literasi Keuangan Terhadap Perilaku Keuangan Mahasiswa
11 pages
Regression Models For Count Data in R: Achim Zeileis Christian Kleiber Simon Jackman
No ratings yet
Regression Models For Count Data in R: Achim Zeileis Christian Kleiber Simon Jackman
25 pages
Petrel Modeling Workflow
No ratings yet
Petrel Modeling Workflow
54 pages
Topic 2a Theory of Estimation
No ratings yet
Topic 2a Theory of Estimation
12 pages
EC221 답 지운 것
No ratings yet
EC221 답 지운 것
99 pages
Standard Normal Tables
100% (1)
Standard Normal Tables
1 page
Solution HW4
100% (1)
Solution HW4
10 pages
2071stat PDF
No ratings yet
2071stat PDF
2 pages
Mortality Modelling and Forecasting
No ratings yet
Mortality Modelling and Forecasting
41 pages
Ijstra 2024 0073
No ratings yet
Ijstra 2024 0073
6 pages
Repeated-Measures ANOVA Guide
No ratings yet
Repeated-Measures ANOVA Guide
41 pages
Discrete Distributions Exam Prep
100% (1)
Discrete Distributions Exam Prep
5 pages
Correlation and Regression
No ratings yet
Correlation and Regression
15 pages
Estimation and Detection Theory Lab Manual v2
No ratings yet
Estimation and Detection Theory Lab Manual v2
5 pages
Stats MCQs: Correlation & Regression
No ratings yet
Stats MCQs: Correlation & Regression
3 pages
Math 362, Problem Set 6
No ratings yet
Math 362, Problem Set 6
4 pages
Lampiran Tabel MPN Mikro
No ratings yet
Lampiran Tabel MPN Mikro
2 pages
Practice Exercise-Network Diagram Project Duration (CPM)
No ratings yet
Practice Exercise-Network Diagram Project Duration (CPM)
12 pages
Polynomial Regression Explained
No ratings yet
Polynomial Regression Explained
5 pages
Point Estimation Exercises
100% (1)
Point Estimation Exercises
7 pages
Answers:: Fin1131/Fin3154 First Semester, AY 2020-2021 Laboratory Activity 3
No ratings yet
Answers:: Fin1131/Fin3154 First Semester, AY 2020-2021 Laboratory Activity 3
4 pages
Decision Analysis 1
No ratings yet
Decision Analysis 1
20 pages

Predictive ModellingAnalytics

Uploaded by

Predictive ModellingAnalytics

Uploaded by

Predictive Modelling/Analytics

What will be the amount of loss? Whose face is in this picture?

Linear Regression Logistic Regression Decision Trees

Naïve Bayes’ Classifier

Salary (y) = 3,00,000 + 1,00,000x

Here, b0 = 3,00,000 and b1 = 1,00,000

Experience (Years) Salary (Rupees in Lakhs)

Least Squared Error

(Average weight in the

Standardization/ Min/Max Scaling

• Standardization/Normalization: Subtracts the mean and divides by the standard

25 35,000 -1.22 -1.08

One-Hot Encoding Label Encoding Frequency Encoding

• One-Hot Encoding: Business = 100, Economy Plus = 010, Economy = 001

n = 165 Predicted: NO Predicted: YES

n = 1000 Predicted: NO Predicted: YES

You might also like