Predictive Analytics For Data Science Unit1
Predictive Analytics For Data Science Unit1
Unit-1
1.1 Introduction to predictive analytics
1.2 Business Analytics
1.3 Business Analytics Types
1.4 Applications of Predictive Analytics
1.5 Analytics Techniques
1.6 Business Analytics Tools
1.7 Business Analytics Models
1.8 Data Pre-processing and Model Tuning
1.8 Data Transformations
1.9 Dealing with Missing Values
1.10 Removing Predictors
1.11 Adding Predictors
1.12 Binning predictors
1.14 Overfitting
1.15 Model Tuning
1.16 Data Splitting
1.17 Resampling Techniques
Predictive analytics helps businesses build a complete view of each customer by analyzing
data across departments. It supports marketing, sales, and customer service by predicting
buying habits, identifying high-demand products, and flagging potential issues before
customers churn. CRM uses these insights throughout the customer lifecycle—from
acquisition to win-back.
In healthcare, predictive analytics is used to identify patients at risk for conditions like heart
disease or diabetes. It powers decision support systems that assist clinicians in making
informed, timely medical decisions at the point of care, improving patient outcomes.
3. Collection Analytics:
Financial institutions use predictive analytics to identify which delinquent customers are most
likely to repay. This allows better targeting of collection efforts, reducing costs and
increasing recovery rates by optimizing collection strategies for each customer.
4. Cross-Sell:
Organizations use predictive analytics to find patterns in customer behavior, enabling them to
offer additional, relevant products to existing customers. This increases profitability per
customer and strengthens long-term relationships.
5. Customer Retention:
Predictive analytics helps detect early signs of customer churn by analyzing past behavior,
service usage, and spending. Companies can act proactively with personalized offers to retain
customers and combat silent attrition—where usage decreases slowly over time.
6. Direct Marketing:
Predictive analytics identifies the best product offers, timing, and communication channels
for each customer. It lowers marketing costs by targeting only the most promising leads with
customized campaigns.
7. Fraud Detection:
Businesses apply predictive models to flag fraudulent activities like identity theft, fake
transactions, and false claims. These models assign risk scores to accounts or activities,
allowing early detection and prevention of fraud across industries, including finance,
insurance, and tax.
Predictive analytics can forecast trends not just for consumers but also for products,
companies, industries, or entire economies. Techniques like time series analysis help
organizations plan for future demand, inventory, or market changes.
9. Risk Management:
Tools like CAPM and PRA use predictive analytics to forecast financial returns and project
risks. These methods support informed decision-making by simulating possible outcomes,
improving both short-term and long-term strategic planning.
10. Underwriting:
Businesses assess future risk by analyzing historical data on claims, payments, or defaults.
Predictive analytics helps determine appropriate pricing for insurance premiums, loans, or
credit approvals, streamlining decision-making and reducing exposure to financial loss.
1. Linear Regression:
Linear regression models the relationship between a dependent variable and one or
more independent variables using a linear equation. It aims to minimize the difference
between observed and predicted values using the ordinary least squares (OLS)
method, making it suitable for continuous outcomes.
2. Discrete Choice Models:
These models are used when the dependent variable is categorical rather than
continuous. They include logistic, multinomial logit, and probit models, which are
more appropriate than linear regression when dealing with binary or multi-category
outcomes.
3. Logistic Regression:
This technique models binary outcomes (e.g., yes/no) by estimating the probability of
an event occurring. It transforms the binary dependent variable into a continuous scale
for analysis and evaluates model performance using goodness-of-fit tests.
4. Multinomial Logistic Regression:
Used when the outcome has more than two unordered categories (e.g., red, blue,
green), this model helps avoid loss of information that can result from collapsing
variables into binary form. It’s ideal for complex multi-class classification tasks.
5. Probit Regression:
Probit models assume a normal distribution of the error term, making them useful for
modeling binary or proportional outcomes, especially in economics. While similar to
logistic regression, probit models are preferred when normality is assumed in the
latent variable.
6. Time Series Models:
Time series models handle data indexed over time by capturing trends, seasonality,
and autocorrelation. Techniques like ARMA, ARIMA, ARCH, and GARCH help
forecast future values and are essential in finance and economics.
7. Survival (Duration) Analysis:
This technique is used to analyze time-to-event data, especially when the event hasn't
occurred for all subjects (censoring). It models hazard rates and survival probabilities,
often using distributions like Weibull or exponential, and is widely used in medicine
and engineering.
8. Classification and Regression Trees (CART):
CART builds decision trees by splitting data based on variables that best differentiate
outcomes. It’s a non-parametric approach used for both classification and regression,
with extensions like Random Forests enhancing performance.
9. Multivariate Adaptive Regression Splines (MARS):
MARS is a flexible, non-parametric regression method that uses piecewise linear
regressions and basis functions to model complex relationships. It overfits initially
and prunes to find an optimal model, suitable for high-dimensional data.
1. scikit-learn
A powerful Python library built on NumPy and SciPy, offering simple and efficient
tools for data mining and machine learning. It supports a wide range of supervised
and unsupervised learning algorithms, making it ideal for both beginners and experts.
2. KNIME
An open-source data analytics platform that enables visual workflows for data
manipulation, analysis, and modeling. It integrates with various programming
languages and tools, making it versatile for advanced analytics.
3. Orange
A Python-based data visualization and machine learning tool that features an
interactive, drag-and-drop interface. It is excellent for teaching, prototyping, and
performing advanced analytics without heavy coding.
4. R
A statistical programming language widely used for data analysis, visualization, and
predictive modeling. It boasts thousands of packages and strong community support
for various analytics tasks.
5. RapidMiner
Although it offers a commercial version, its open-source edition allows visual
workflow development for data preparation, modeling, and validation. It is known for
its easy-to-use interface and wide range of integrated tools.
6. Weka
A Java-based suite of machine learning tools, ideal for educational and research
purposes. It includes a variety of algorithms for classification, regression, clustering,
and visualization.
7. IBM SPSS Statistics & Modeler
SPSS provides advanced statistical analysis and Modeler adds machine learning and
text analytics. Both are popular in academic and enterprise settings for data-driven
decision-making.
8. MATLAB
A high-performance language for numerical computing, MATLAB is used for
algorithm development and data analysis. It supports a wide range of toolboxes for
predictive modeling and machine learning.
9. SAP
SAP integrates predictive analytics within its business intelligence and ERP systems.
It offers both automated and expert-driven modeling tools.
10. SAS & SAS Enterprise Miner
SAS is a comprehensive suite for data analysis, with Enterprise Miner offering
advanced modeling and visualization capabilities. It's a favorite in finance, healthcare,
and government sectors.
11. STATISTICA
A predictive analytics and data mining software developed by TIBCO. It offers visual
workflows, advanced modeling, and integration with R and Python.
1.7 Business Analytics Models
Predictive Models estimate the likelihood of future outcomes based on known data
patterns. They analyze the relationship between a unit’s attributes and its behaviour using a
training sample. These models are commonly used in real-time applications like fraud
detection or credit scoring. Predictive models are especially useful when applied to new,
unseen data (out-of-sample) to forecast behaviour. They also support advanced simulations,
such as predicting authorship or analyzing crime scenes.
Purpose:
Forecasting: Predicting future trends, such as sales, stock prices, or weather patterns.
Risk Assessment: Identifying potential risks, like fraud, credit risk, or cybersecurity threats.
Resource Optimization: Predicting resource needs, such as staffing, inventory, or energy
consumption.
Customer Behavior Analysis: Understanding customer preferences, predicting purchase
patterns, and personalizing marketing efforts.
Inputs:
Historical Data: Past data relevant to the prediction, such as sales figures, customer
demographics, or website traffic.
Statistical Data: Statistical information about the data, such as mean, median, standard
deviation, and correlation.
Machine Learning Algorithms: Algorithms like regression, classification, or neural networks
are used to analyze the data and make predictions.
Features: Specific attributes or variables within the data that are used as inputs to the model.
Outputs:
Predicted Values: The numerical or categorical values predicted by the model, such as sales
forecasts, risk scores, or customer segments.
Probabilities: The likelihood of a certain event occurring, like the probability of a customer
defaulting on a loan.
Recommendations: Suggested actions based on the predictions, like recommending products
to a customer or optimizing a marketing campaign.
Descriptive Models focus on identifying patterns and relationships within data to group or
classify subjects, like customers or products. Unlike predictive models, they don't forecast
behaviour but instead reveal hidden structures, such as customer segmentation based on
preferences. These models are widely used in marketing and customer analytics. They help
understand the "what" and "why" behind behaviours, not just the "what might happen."
Descriptive modeling also serves as a foundation for building more complex simulations.
This term is basically used to produce correlation, cross-tabulation, frequency etc. These
technologies are used to determine the similarities in the data and to find existing patterns.
One more application of descriptive analysis is to develop the captivating subgroups in the
major part of the data available. This analytics emphasis on the summarization and
transformation of the data into meaningful information for reporting and monitoring.
Examples of descriptive data mining include clustering, association rule mining, and anomaly
detection. Clustering involves grouping similar objects together, while association rule
mining involves identifying relationships between different items in a dataset. Anomaly
detection involves identifying unusual patterns or outliers in the data.
Purpose:
Understanding and Explanation: Descriptive models help in understanding the structure,
behavior, and relationships within a system.
Documentation and Communication: They can be used to document processes, facilitate
communication among stakeholders, and train new users.
Analysis and Improvement: By visualizing the system, descriptive models can help identify
areas for improvement or optimization.
Inputs:
Data: Descriptive models rely on data about the system being modeled. This can include
information about components, their attributes, relationships, and interactions.
Knowledge: Domain expertise and understanding of the system's context are crucial for
creating accurate and meaningful descriptive models.
Outputs:
Descriptions: These can be textual descriptions, diagrams, flowcharts, or other visual
representations of the system's components and their relationships.
Summaries: Descriptive models can produce summaries of key characteristics, patterns, or
trends within the data.
Visualizations: Charts, graphs, and other visual aids help in understanding the system's
behavior and relationships.
Documentation: Descriptive models can be used to create documentation for processes,
systems, or architectures.
Decision Models connect all parts of a decision-making process: inputs (data), decisions,
and their possible outcomes. They are designed to optimize decisions by balancing multiple
objectives (e.g., maximizing profit while minimizing risk). These models use results from
predictive models to inform action. Businesses rely on decision models to automate and
improve complex decision-making through rule-based systems. Their strength lies in
simulating and evaluating the impact of different decisions in varying scenarios.
Decision models are structured approaches to making choices by defining inputs, processing
them, and producing outputs that guide actions. They serve to simplify complex decisions,
clarify decision-making processes, and improve the consistency and quality of outcomes.
Inputs can be predetermined factors, time-varying factors, data, decision variables, or
uncontrollable variables. The output is typically a recommended action, optimal offer, or a set
of evaluated options.
Purpose:
Simplify complex decisions: Decision models break down complex problems into
manageable components, making it easier to understand and analyze different options.
Improve decision quality: By providing a structured framework, decision models help to
ensure that decisions are based on relevant information and criteria.
Increase consistency: Decision models can be used repeatedly to make similar decisions,
leading to more consistent outcomes over time.
Facilitate collaboration: Decision models can be shared and understood by multiple
stakeholders, promoting better communication and collaboration in the decision-making
process.
Inputs:
Predetermined factors: These are factors that are fixed or relatively stable, such as initial
policies, application constraints, or available technology.
Time-varying factors: These are factors that change over time, such as market conditions,
customer behavior, or competitor actions.
Data: This includes relevant information about the situation, such as costs, benefits, risks, or
performance metrics.
Decision variables: These are factors that are under the control of the decision-maker, such as
the amount of resources to allocate, the price to charge, or the target market to pursue.
Uncontrollable variables: These are factors that affect the decision but are outside the control
of the decision-maker, such as economic conditions, weather patterns, or government
regulations.
Outputs:
Recommended actions: Decision models can suggest specific courses of action based on the
inputs and the model's logic.
Optimal offers: In a marketing or sales context, decision models can identify the most
appropriate offers to present to customers based on their profiles and behavior.
Evaluated options: Decision models can rank or score different options based on their
potential outcomes, helping the decision-maker to choose the best alternative.
Decision rules: These are specific rules that are triggered by certain inputs and produce
corresponding outputs, guiding the decision-making process.
Visualization of the decision-making process: Decision models can visually represent the
steps involved in making a decision, making it easier to understand the logic behind the
outcome.
Data cleaning is therefore the process of detecting and rectifying faults or inconsistencies in
dataset by scrapping or modifying them to fit the definition of quality data for analysis. It is
an essential activity in data preprocessing as it determines how the data will be used and
processed in other modeling processes.
The importance of data cleaning lies in the following factors:
Improved data quality: It is therefore very important to clean the data as this reduces the
chances of errors, inconsistencies and missing values, which ultimately makes the data to be
more accurate and reliable in the analysis.
Better decision-making: Consistent and clean data gives organization insight into
comprehensive and actual information and minimizes the way such organizations make
decisions on outdated and incomplete data.
Increased efficiency: High quality data is efficient to analyze, model or report on it, whereas
clean data often avoids a lot of foreseen time and effort that goes into handling poor data
quality.
Compliance and regulatory requirements: There are standard policies the industries and
various regulatory authorities set on data quality, and by data cleaning, one can be able to
conform with these standards to avoid penalties and legal endangers.
Navigating Common Data Quality Issues in Analysis and Interpretation
It is also relevant to mention that issues with the quality of data could be of various origins
including errors made by people, the failures of technical input and data merging issues
among others. Some common data quality issues include:Several common types of data
quality problem are:
Missing values: Lack of some data or missing information can result in failure to make the
right conclusions and can or else lead to creating a biased result.
Duplicate data: Duplicate or twofold variation could possibly result in different data values
and parameters within the set which might produce skewed results.
Incorrect data types: Adjustment 2: Elimination of data fields with wrong data format
conversion Data fields containing values of the wrong data type (for instance string data type
in a numeric data type) can sometimes hamper analysis and cause inaccuracies.
Outliers and anomalies: Outliers simply refer to observations whose values are unusually
high or low compared to other observations in the same data set ‘outliers can affect any
analysis and some statistical results beyond recognition’.
Inconsistent formats: It is also important to note that data discrepancies like date formats,
capital first letter etc may present challenges when bringing together data.
Spelling and typographical errors: This is due to the reason that the result is depended on
text fields and the misspellings and the typos of the keys are often misinterpreted or
categorized wrongly.
Common Data Cleaning Tasks
Data cleaning involves several key tasks, each aimed at addressing specific issues within a
dataset. Here are some of the most common tasks involved in data cleaning:
1. Handling Missing Data
Missing data is a common problem in datasets. Strategies to handle missing data include:
Removing Records: Deleting rows with missing values if they are relatively few and
insignificant.
Imputing Values: Replacing missing values with estimated ones, such as the mean,
median, or mode of the dataset.
Using Algorithms: Employing advanced techniques like regression or machine
learning models to predict and fill in missing values.
2. Removing Duplicates
Duplicates can skew analyses and lead to inaccurate results. Identifying and removing
duplicate records ensures that each data point is unique and accurately represented.
3. Correcting Inaccuracies
Data entry errors, such as typos or incorrect values, need to be identified and corrected. This
can involve cross-referencing with other data sources or using validation rules to ensure data
accuracy.
4. Standardizing Formats
Data may be entered in various formats, making it difficult to analyze. Standardizing formats,
such as dates, addresses, and phone numbers, ensures consistency and makes the data easier
to work with.
5. Dealing with Outliers
Outliers can distort analyses and lead to misleading results. Identifying and addressing
outliers, either by removing them or transforming the data, helps maintain the integrity of the
dataset.
1. Ignore the tuple: Remove the data points with missing values
Suppose the number of cases of missing values is extremely small (>5%); then, you may drop
or omit those values from the analysis.
2. Fill in the missing values manually:
Personally, I wouldn't suggest this technique! Bcoz, When the dataset size is too large then
filling the missing values manually is going to be a time-consuming and not feasible
approach.
3. Use a global constant to fill in the missing values:
Replace all missing attribute values by the same constant such as a label like “Unknown” or
−∞.
If missing values are replaced by, say, “Unknown,” then the mining program or the machine
learning algorithm may mistakenly think that they form an interesting concept, since they all
have a value in common—that of “Unknown.” Hence, although this method is simple, it is
not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing
value:
Measures of central tendency indicate the “middle” value of data distribution(For example:
Mean, Median, and Mode).
Incase if the attribute is a numerical attribute:
1. For normal (symmetric) data distributions, the mean can be used.
2. For skewed data distribution, we should employ the median (Section 2.2).
Incase if the attribute is Categorical then replace the missing values of that categorical
variable with its mode.
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that of the
given tuple.
If the data distribution for a given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using a Bayesian formalism,
or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
What is noise?” Noise is a random error or variance in a measured variable. In Chapter 2, we
saw how some basic statistical description techniques (e.g., boxplots and scatter plots), and
methods of data visualization can be used to identify outliers, which may represent noise.
Given a numeric attribute such as, say, price, how can we “smooth” out the data to remove
the noise? Let’s look at the following data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it. The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure 3.2 illustrates some binning techniques. In this example, the data for price are first
sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three
values). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the Sorted data for price (in dollars): 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other. Multiple linear regression is
an extension of linear regression, where more than two attributes are involved and the data
are fit to a multidimensional surface. Regression is further described in Section 3.4.5.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
diamonds %>%
mutate(scaled_depth = (depth - mean(depth)) / sd(depth)) %>%
ggplot(aes(x=scaled_depth)) +
geom_histogram(binwidth=.5)
Wage %>%
mutate(numeric_insurace = ifelse(health_ins == "1. Yes", 1, 0)) %>%
head()
Skewed Data
In many data analysis, variables will have a skewed distribution over their range. In the last
section we saw one way of defining skew using quartiles and median. Variables with skewed
distributions can be hard to incorporate into some modeling procedures, especially in the
presence of other variables that are not skewed. In this case, applying a transformation to
reduce skew will improve performance of models.
where x is the predictor variable, n is the number of values, and x is the sample mean of the
predictor.
Also, skewed data may arise when measuring multiplicative processes. This is very common
in physical or biochemical processes. In this case, interpretation of data may be more
intiuitive after a transformation.
flights %>% ggplot(aes(x=dep_delay)) + geom_histogram(binwidth=30)
Data reduction techniques are another class of predictor transformations. These methods
reduce the data by generating a smaller set of predictors that seek to capture a majority of the
information in the original variables. In this way, fewer variables can be used that provide
reasonable fidelity to the original data. For most data reduction techniques, the new
predictors are functions of the original predictors; therefore, all the original predictors are still
needed to create the surrogate variables. This class of methods is often called signal
extraction or feature extraction techniques.
PCA is a widely used data reduction technique that seeks to find linear combinations of the
predictors, called principal components (PCs), which capture the maximum possible
variance in the data. The first principal component (PC1) is defined as the linear
combination of the predictors that explains the greatest amount of variability among all
possible linear combinations. Subsequent components (PC2, PC3, …) are constructed in such
a way that they capture the maximum remaining variability while being uncorrelated with all
previously derived components.
1. Preprocessing
o Predictors are often on different scales or skewed.
o Solution: Transform skewed predictors, then center and scale before applying
PCA.
2. Limitations
o PCA is unsupervised, meaning it ignores the response variable.
o If predictor variance does not align with predictive relationships, PCA may not
improve modeling.
o Alternative: Use Partial Least Squares (PLS) when the response variable
must be considered.
3. Choosing Number of Components
o Use a scree plot (variance explained vs component number).
o Select the number of components at the elbow point, where variance
contribution levels off.
o Cross-validation can also determine the optimal number of PCs.
PC1 explains 14%, PC2 explains 12.6%, and PC3 explains 9.4% of the variance.
Together, the first three PCs account for about 36% of total variance.
After PC4, the variance contribution drops sharply.
Even with 4 PCs, only 42.4% of the total data information is captured, highlighting
that many predictors still contain unexplained variability.
Scatter plots of the first three PCs, with points colored by class (segmentation quality),
provide insights into separation:
Removing predictors before building a model can be beneficial for several reasons. First,
reducing the number of predictors decreases computational time and simplifies the overall
model. Second, when two predictors are highly correlated, they essentially capture the same
underlying information. In such cases, removing one of them usually does not affect model
performance and can even result in a simpler and more interpretable model. Third, certain
predictors with poor statistical properties can negatively impact model performance or
stability. By eliminating these problematic variables, the quality of the model can improve
significantly.
A common issue arises with zero variance predictors, which are variables that take on only
a single unique value across all samples. These predictors provide no useful information. For
some models, such as tree-based methods, zero variance predictors are harmless because they
will never be used for splitting. However, in models like linear regression, these variables can
cause serious computational errors. Since they do not contribute any information, they can be
safely discarded.
Similarly, near-zero variance predictors can also cause problems. These are variables that
contain very few unique values, often with one value dominating most of the data. For
instance, in a text mining example, after removing common stop words, a keyword count
variable may only appear in a small number of documents. Suppose in a dataset of 531
documents, a particular keyword is absent in 523 documents, appears twice in six documents,
three times in one document, and six times in one document. In this case, 98% of the data
corresponds to zero occurrences, while only a handful of documents contain the keyword.
This makes the predictor highly imbalanced and prone to exerting undue influence on the
model. Furthermore, if resampling is used, there is a risk that some resampled datasets may
contain no occurrences of the keyword at all, leaving the predictor with a single unique value.
To diagnose near-zero variance predictors, two criteria are often used. First, the fraction of
unique values relative to the total sample size should be examined. If this fraction is very
small (for example, below 10%), the predictor may be problematic. Second, the ratio of the
most frequent value to the second most frequent value should be considered. If this ratio is
very large (for example, greater than 20), it indicates a strong imbalance. In the document
example, the ratio of zero occurrences (523) to two occurrences (6) is 87, which clearly
signals an imbalance.
Therefore, when both conditions are met and the model being used is sensitive to such issues,
it is often advantageous to remove the near-zero variance predictor to improve model
performance and stability.
Between-Predictor Correlations
Collinearity occurs when two predictor variables are strongly correlated with each other,
while multicollinearity refers to the case where multiple predictors have strong
interrelationships. For instance, in cell segmentation data, several predictors measure cell
size, such as perimeter, width, and length, while others describe cell morphology, like
roughness.
A correlation matrix helps visualize predictor relationships, where colors indicate the strength
and direction of correlations. Clustering techniques group correlated predictors, often forming
“blocks” of related variables. When many predictors exist, Principal Component Analysis
(PCA) can summarize correlations, with loadings showing which predictors contribute to
each component.
Highly correlated predictors are undesirable because they add redundancy, increase model
complexity, and can cause instability or poor performance in linear regression. The Variance
Inflation Factor (VIF) is commonly used to detect multicollinearity, though it has limitations,
such as being specific to linear models and not identifying which predictors to remove.
A simpler heuristic approach involves eliminating predictors with high correlations until all
pairwise correlations fall below a chosen threshold. The process is:
For example, if a threshold of 0.75 is set, predictors with correlations above 0.75 are
gradually removed. In the segmentation dataset, this method recommended removing 43
predictors.
Feature extraction methods such as PCA can also reduce the effects of collinearity by
creating new surrogate predictors. However, this makes interpretation more difficult and,
since PCA is unsupervised, the new predictors may not necessarily relate well to the outcome
variable.
When a predictor is categorical (e.g., gender, race), it is usually broken down into smaller,
more specific variables. For example, in the credit scoring dataset (Sect. 4.5), a predictor
indicates the amount of money in an applicant’s savings account. These savings values are
grouped into categories, including an “unknown” group.
Table 3.2 shows the groups, the number of applicants in each group, and how they are
converted into dummy variables. A dummy variable is a binary (0/1) indicator for each
category. Usually, one dummy variable is created for each category. However, in practice,
only (number of categories – 1) dummy variables are needed, since the last one can be
inferred from the others.
If a model includes an intercept term (like linear regression), including all dummy variables
creates a problem: they always sum to one, making them redundant with the intercept. To
avoid numerical issues, one dummy variable is dropped. But in models that are insensitive to
this issue, using all dummy variables may improve interpretation.
Many modern models automatically capture nonlinear relationships between predictors and
outcomes. However, simpler models do not, unless the user explicitly specifies nonlinear
terms.
For example, logistic regression typically creates linear classification boundaries. Figure
3.11 illustrates this:
The left panel shows boundaries using only linear terms for predictors A and B.
The right panel shows boundaries when a quadratic term (B²) is added.
Adding nonlinear terms (like squares or interactions) allows logistic regression to model
more complex decision boundaries without resorting to very complex techniques, which may
risk overfitting.
Additionally, Forina et al. (2009) proposed another approach: calculating class centroids (the
center of predictor values for each class). For each predictor, the distance to each class
centroid is computed and added as new predictors in the model. This technique enhances
classification by incorporating richer information.
Binning predictors refers to the process of converting continuous variables into categories or
groups before analysis. For example, in diagnosing Systemic Inflammatory Response
Syndrome (SIRS), clinical measurements such as temperature, heart rate, respiratory rate, and
white blood cell count are categorized into ranges to simplify diagnosis. The main advantages
of this approach are that it creates simple decision rules that are easy to interpret, the modeler
does not need to know the exact relationship between predictors and outcomes, and it can
improve survey response rates since people often find it easier to answer questions in ranges
rather than exact values.
However, manual binning of predictors has several drawbacks. First, it reduces model
performance because modern predictive models are capable of learning complex relationships
that are lost when data is simplified into bins. Second, it causes a loss of precision in
predictions because fewer categories limit the number of possible outcomes, leading to
oversimplified results. Third, research has shown that binning can increase the rate of false
positives by incorrectly identifying noise variables as important. Overall, although binning
may improve interpretability, it usually comes at the cost of predictive accuracy, which can
even be unethical in sensitive fields such as medical diagnosis where accuracy is critical.
Key Distinction
COMPUTING
1. Data Source
o Raw data: segmentationOriginal from AppliedPredictiveModeling package.
o Use only training samples (Case == "Train").
o Remove ID columns (Cell, Class, Case) and unnecessary "Status" columns.
2. Checking Skewness
o Use skewness() from e1071 package to detect skewed predictors.
3. Transformations
o Use Box-Cox transformation (BoxCoxTrans in caret) to normalize skewed
features.
o Apply multiple transformations (BoxCox, center, scale, pca) using preProcess().
4. Principal Component Analysis (PCA)
o Use prcomp() for PCA.
o Extract variance explained and loadings.
o preProcess(method = "pca") can also apply PCA automatically.
5. Filtering Predictors
o Remove near-zero variance predictors: nearZeroVar().
o Remove highly correlated predictors: findCorrelation() with a cutoff (e.g.,
0.75).
o Visualize correlation matrix with corrplot().
6. Handling Missing Values
o Use preProcess() (KNN/bagged tree imputation) or impute.knn() from impute.
7. Creating Dummy Variables
o Use dummyVars() in caret to encode categorical variables.
o Interaction terms can be added in formulas (e.g., Mileage:Type).
🖥️Sample R Program
# Load libraries
library(AppliedPredictiveModeling)
library(caret)
library(e1071)
library(corrplot)
# Load data
data(segmentationOriginal)
segData <- subset(segmentationOriginal, Case == "Train")
# Correlation filtering
correlations <- cor(segData)
corrplot(correlations, order = "hclust")
highCorr <- findCorrelation(correlations, cutoff = 0.75)
filteredSegData <- segData[, -highCorr]
Problem of Over-Fitting
Overfitting and underfitting are two fundamental problems encountered when training
machine learning models, both indicating an imbalance in model complexity relative to the
underlying data patterns.
1.14 Overfitting
Overfitting occurs when a model learns the training data too well, including its noise and
random fluctuations, rather than the underlying generalizable patterns. This results in
excellent performance on the training data but poor performance on new, unseen data. It is
analogous to memorizing answers for a specific exam without understanding the subject
matter, leading to failure on a different exam.
Causes of Overfitting:
Excessive model complexity: Using a model that is too intricate for the given data,
allowing it to capture noise.
Insufficient training data: Not having enough data to represent the true underlying
patterns, causing the model to learn specific instances.
Underfitting:
Underfitting occurs when a model is too simple to capture the underlying patterns and
relationships within the training data. This results in poor performance on both the training
data and new, unseen data. It is like trying to explain a complex phenomenon with a highly
simplified theory, missing crucial details.
Causes of Underfitting:
Insufficient model complexity: Using a model that is too simple to represent the
complexity of the data.
Lack of relevant features: Not including enough informative features in the data to
allow the model to learn effectively.
Insufficient training: Not training the model for enough iterations or with enough
data for it to learn the patterns. [1, 2]
Random search selects hyperparameter values randomly from a defined distribution instead
of testing all possible combinations. It is faster and more efficient than grid search and often
finds good solutions quickly, but it does not guarantee the absolute best configuration
since many possibilities are skipped.
Model Tuning – The process of finding the best hyperparameters (e.g., learning rate,
layers, batch size) before training.
Model Training – The process where the algorithm learns from data by adjusting
weights and biases to minimize the loss function using optimization methods like
gradient descent.
Training seeks convergence (minimized loss), which may reach a local minimum
rather than the global one, but that is often sufficient for good performance.
Learning Rate – Controls how fast weights are updated; high = faster but risk of
overshooting, low = stable but slower.
Learning Rate Decay – Gradually reduces the learning rate over time to improve
convergence.
Epochs – Number of times the entire training dataset is passed through the model.
Batch Size – Number of training samples processed before weights are updated.
Momentum – Adds inertia to updates; high momentum speeds convergence but may
skip minima, low momentum slows progress.
Number of Hidden Layers – More layers → higher capacity for complex tasks,
fewer layers → simpler, faster models.
Nodes per Layer – More nodes widen a layer, capturing complex relationships but
increasing computation.
Activation Function – Defines how nodes fire (e.g., ReLU, sigmoid, tanh) and helps
the network learn non-linear patterns.
1.16 Data Splitting
Now that we have outlined the general procedure for finding optimal tuning parameters, we
turn to discussing the heart of the process: data splitting.
A few of the common steps in model building are:
• Pre-processing the predictor data
• Estimating model parameters
• Selecting predictors for the model
• Evaluating model performance
• Fine tuning class prediction rules (via ROC curves, etc.)
Given a fixed amount of data, the modeler must decide how to “spend” their data points to
accommodate these activities.
Purpose
To evaluate the model honestly, we must use data that was not used for training or
tuning.
Ensures the model’s performance estimate is unbiased.
🔹 Training vs Test Set
Training set: Builds the model.
Test set: Estimates final model performance.
If data is small, using a test set may not be efficient.
🔹 Pitfall: Single Test Set
Validation on a single test set can be unreliable and lead to high variance.
Resampling methods are often preferred.
🔹 Sampling Strategies
1. Simple Random Sampling
o Splits data randomly.
o Might cause class imbalance in train/test split.
2. Stratified Sampling
o Ensures outcome class distribution is preserved in both sets.
o Useful in classification problems with imbalanced classes.
3. Maximum Dissimilarity Sampling
o Selects test samples that are most dissimilar from training data.
o Ensures test set samples represent diverse areas of predictor space.
o Fig. 4.5 shows how distant samples are selected for the test set.