KEMBAR78
Predictive Analytics For Data Science Unit1 | PDF | Analytics | Predictive Analytics
0% found this document useful (0 votes)
17 views35 pages

Predictive Analytics For Data Science Unit1

The document provides a comprehensive overview of predictive analytics, including its definition, applications, and various types such as descriptive, predictive, prescriptive, and diagnostic analytics. It emphasizes the importance of predictive models in business decision-making across multiple industries, highlighting techniques, tools, and challenges associated with business analytics. Additionally, it outlines specific applications of predictive analytics in areas like customer relationship management, healthcare, and risk management.

Uploaded by

jashtiyamini72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views35 pages

Predictive Analytics For Data Science Unit1

The document provides a comprehensive overview of predictive analytics, including its definition, applications, and various types such as descriptive, predictive, prescriptive, and diagnostic analytics. It emphasizes the importance of predictive models in business decision-making across multiple industries, highlighting techniques, tools, and challenges associated with business analytics. Additionally, it outlines specific applications of predictive analytics in areas like customer relationship management, healthcare, and risk management.

Uploaded by

jashtiyamini72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Predictive Analytics for Data Science

Unit-1
1.1 Introduction to predictive analytics
1.2 Business Analytics
1.3 Business Analytics Types
1.4 Applications of Predictive Analytics
1.5 Analytics Techniques
1.6 Business Analytics Tools
1.7 Business Analytics Models
1.8 Data Pre-processing and Model Tuning
1.8 Data Transformations
1.9 Dealing with Missing Values
1.10 Removing Predictors
1.11 Adding Predictors
1.12 Binning predictors
1.14 Overfitting
1.15 Model Tuning
1.16 Data Splitting
1.17 Resampling Techniques

1.1 Introduction to predictive analytics


Predictive analytics—sometimes used synonymously with predictive modelling—
encompasses a variety of statistical techniques from modeling, machine learning, and data
mining that analyse current and historical facts to make predictions about future, or otherwise
unknown, events (Nyce, 2007) (Eckerson, 2007).
In business, predictive models exploit patterns found in historical and transactional data to
identify risks and opportunities. Models capture relationships among many factors to allow
assessment of risk or potential associated with a particular set of conditions, guiding decision
making for candidate transactions.
Predictive analytics is used in actuarial science (Conz, 2008), marketing (Fletcher, 2011),
financial services (Korn, 2011), insurance, telecommunications (Barkin, 2011), retail (Das &
Vidyashankar, 2006), travel (McDonald, 2010), healthcare (Stevenson, 2011),
pharmaceuticals (McKay, 2009) and other fields.
One of the most well-known applications is credit scoring (Nyce, 2007), which is used
throughout financial services. Scoring models process a customer’s credit history, loan
application, customer data, etc., in order to rank-order individuals by their likelihood of
making future credit payments on time. A well-known example is the FICO® Score.
1.2 Business Analytics
Business analytics (BA) refers to the skills, technologies, practices for continuous iterative
exploration and investigation of past business performance to gain insight and drive business
planning. Business analytics makes extensive use of statistical analysis, including descriptive
and predictive modeling, and fact-based management to drive decision making. In recent
years, prescriptive modeling has also taken a role in BA. It is therefore closely related to
management science and operations research. Business analytics can answer questions like
why is this happening, what if these trends continue, what will happen next (that is, predict),
what is the best that can happen (that is, optimize).
Benefits of Business Analytics:
Business analytics has numerous benefits, regardless of the size of your company or the
industry in which it works. One of the greatest benefits is that it allows your company to
prepare for the unexpected. Business analytics can forecast future patterns in an
organization's sales, earnings, and other vital indicators by modelling current trends.
Businesses can now notice changes that may occur annually, seasonally, or on any scale,
giving them the opportunity to prepare and plan ahead. To prepare for a slow season, you
may need to cut back on spending or engage in fresh marketing strategies. Larger
organisations may find it easier to estimate order volume and waste with business analytics.
Your company can also use business analytics to test new marketing tactics. You can better
analyse the impact of your advertising campaigns on different audiences and demographics
since business analytics gives data about customer behaviour. You can also explore
delivering targeted deals to reclaim the customer's business if you can discover that they are
less likely to return.
Challenges of Business Analytics
For beginners, you'll have the most success with it if everyone in your organization is on
board with its adoption and implementation. It will always require top leadership buy-in and
a defined corporate vision. It can be tough to get everyone in high management to agree on a
business analytics plan, so make sure to pitch business analytics as a complement to existing
strategies. This should also include specific, quantifiable objectives to assist people who are
hesitant to accept business analytics benefits. Business analytics, in addition to executive
ownership, necessitate IT involvement, i.e., the appropriate IT infrastructure and tools to
handle the data. For business analytics to be genuinely successful, business and IT teams
must collaborate. Make sure you have the necessary project management software in place to
apply predictive models and take an agile approach while you're at it.
It's critical to stay committed to the end outcome during the early months of an analytics
project. Stay committed, even if the cost of analytics software is significant and the return on
investment isn't instant. Over time, the analytical models will improve, and predictions will
become more accurate. A company that fails to make it through the investment period is
likely to abandon the concept altogether.
Components of Business Analytics
 Data Aggregation
o Data must first be obtained, sorted, and filtered, either through volunteered data or
transactional records before it can be analysed.
 Data Mining
o To detect trends and establish links, data mining for business analytics filters
through enormous datasets using databases, statistics, and machine learning.
 Association and Sequence Identification
o The identification of predictable activities that are carried out in conjunction with
other acts or in a sequential order
 Text Mining
o For qualitative and quantitative analysis, examines and organises big, unstructured
text databases.
 Forecasting
o Analyses historical data from a given time period in order to create educated
predictions about future occurrences or behaviours.
 Predictive Analytics
o Predictive business analytics employs a number of statistical techniques to build
models that extract data from datasets, discover patterns, and provide a score for a
variety of organisational outcomes.
 Optimization
o Businesses can use simulation tools to test out best-case scenarios once patterns
have been discovered and predictions have been made.
 Data Visualization
o It provides visual representations of data, such as charts and graphs, to make data
analysis simple and rapid.

1.3 Business Analytics Types


Business analytics is comprised of descriptive, predictive and prescriptive analytics, these are
generally understood to be descriptive modeling, predictive modeling, and prescriptive
modeling.
1.3.1 Descriptive analytics
Descriptive models quantify relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike predictive models that focus on
predicting a single customer behavior (such as credit risk), descriptive models identify
many different relationships between customers or products. Descriptive analytics
provides simple summaries about the sample audience and about the observations that
have been made. Such summaries may be either quantitative, i.e. summary statistics,
or visual, i.e. simple-to-understand graphs. These summaries may either form the
basis of the initial description of the data as part of a more extensive statistical
analysis, or they may be sufficient in and of themselves for a particular investigation.
 Key Characteristics:
 Answers the question: “What happened?”
 Uses historical data
 Provides insights into past performance or behavior
 Helps in reporting and monitoring
 Techniques:
 Data aggregation,
 Data mining (Association, Classification, Clustering)
 Data visualization (e.g., charts, dashboards)
 Basic statistics (mean, median, variance
 Tools
 Excel: For basic data analysis and visualization.
 Tableau: For advanced data visualization and dashboard creation.
 Power BI: For interactive data visualization and business intelligence.
 SQL: For querying and managing databases.
 Applications
 Business Reporting: Generating regular reports on sales, revenue, and other key
performance indicators (KPIs).
 Customer Segmentation: Analyzing customer data to identify different segments
and their characteristics.
 Market Analysis: Understanding market trends and consumer behavior.
 Operational Efficiency: Monitoring and improving business processes.

1.3.2 Predictive analytics


Predictive analytics encompasses a variety of statistical techniques from modeling,
machine learning, and data mining that analyze current and historical facts to make
predictions about future, or otherwise unknown, events. Predictive models are models
of the relation between the specific performance of a unit in a sample and one or more
known attributes or features of the unit. The objective of the model is to assess the
likelihood that a similar unit in a different sample will exhibit the specific
performance. This category encompasses models that are in many areas, such as
marketing, where they seek out subtle data patterns to answer questions about
customer performance, such as fraud detection models.
 Techniques
o Regression Analysis: Modeling the relationship between dependent and
independent variables.
o Time Series Analysis: Analyzing data points collected or recorded at specific
time intervals.
o Machine Learning: Using algorithms to learn from data and make
predictions.
o Classification and Clustering: Grouping data into categories or clusters
based on similarities.
 Tools
o R: For statistical computing and graphics.
o Python: For machine learning and data analysis libraries like scikit-learn and
TensorFlow.
o SAS: For advanced analytics, business intelligence, and data management.
o IBM SPSS: For statistical analysis and predictive modeling.
 Applications
o Risk Management: Predicting potential risks and their impact on business
operations.
o Customer Retention: Identifying customers at risk of churning and
developing retention strategies.
o Sales Forecasting: Estimating future sales based on historical data.
o Healthcare: Predicting disease outbreaks and patient outcomes
1.3.3 Prescriptive analytics
Prescriptive analytics not only anticipates what will happen and when it will happen,
but also why it will happen. Further, prescriptive analytics suggests decision options
on how to take advantage of a future opportunity or mitigate a future risk and shows
the implication of each decision option. Prescriptive analytics can continually take in
new data to re-predict and re-prescribe, thus automatically improving prediction
accuracy and prescribing better decision options. Prescriptive analytics ingests hybrid
data, a combination of structured (numbers, categories) and unstructured data (videos,
images, sounds, texts), and business rules to predict what lies ahead and to prescribe
how to take advantage of this predicted future without compromising other priorities.
 Techniques
 Optimization: Finding the best solution from a set of feasible options.
 Simulation: Modeling complex systems to evaluate different scenarios.
 Decision Analysis: Assessing and comparing different decision options.
 Machine Learning: Using algorithms to learn from data and make
recommendations.
 Tools
 Gurobi: For mathematical optimization.
 IBM ILOG CPLEX: For optimization and decision support.
 AnyLogic: For simulation modeling.
 MATLAB: For numerical computing and optimization.
 Applications
 Supply Chain Optimization: Improving inventory management and logistics.
 Revenue Management: Setting optimal pricing strategies.
 Healthcare: Recommending personalized treatment plans.
 Finance: Optimizing investment portfolios and risk management strategies
1.3.4 Diagnostic Analytics
Diagnostic analytics is a type of analytics that focuses on identifying the root causes
of problems or anomalies in data. It helps to answer questions like "Why did this
happen?" or "What's causing this issue?"
 Techniques
 Data mining: Use techniques like clustering, decision trees, and regression
analysis.
 Correlation analysis: Examine relationships between variables.
 Drill-down analysis: Analyze data at a detailed level to identify specific causes.
 Tools
 Statistical software: R, Python, SAS, SPSS, and JMP.
 Data visualization tools: Tableau, Power BI, D3.js, and Matplotlib.
 Data mining software: Weka, RapidMiner, and KNIME.
 Applications
 Troubleshooting: Identify causes of problems in business processes or systems.
 Quality control: Analyze defects or issues in products or services.
 Performance optimization: Identify areas for improvement in business
performance.
 Benefits
 Improved problem-solving: Diagnostic analysis helps identify root causes.
 Data-driven decision-making: Provides insights that inform decision-making.
 Increased efficiency: Helps organizations optimize processes and improve
performance.

1.4 Applications of Predictive Analytics


Predictive analytics has become a vital tool across industries, enabling organizations to make
smarter, data-driven decisions. By analyzing historical data and identifying patterns, it
empowers businesses to anticipate future behaviors and outcomes. The applications of
predictive analytics span various domains—from marketing and finance to healthcare and
operations—making it an essential part of modern strategic planning.

1. Analytical Customer Relationship Management (CRM):

Predictive analytics helps businesses build a complete view of each customer by analyzing
data across departments. It supports marketing, sales, and customer service by predicting
buying habits, identifying high-demand products, and flagging potential issues before
customers churn. CRM uses these insights throughout the customer lifecycle—from
acquisition to win-back.

2. Clinical Decision Support Systems:

In healthcare, predictive analytics is used to identify patients at risk for conditions like heart
disease or diabetes. It powers decision support systems that assist clinicians in making
informed, timely medical decisions at the point of care, improving patient outcomes.

3. Collection Analytics:

Financial institutions use predictive analytics to identify which delinquent customers are most
likely to repay. This allows better targeting of collection efforts, reducing costs and
increasing recovery rates by optimizing collection strategies for each customer.

4. Cross-Sell:

Organizations use predictive analytics to find patterns in customer behavior, enabling them to
offer additional, relevant products to existing customers. This increases profitability per
customer and strengthens long-term relationships.

5. Customer Retention:

Predictive analytics helps detect early signs of customer churn by analyzing past behavior,
service usage, and spending. Companies can act proactively with personalized offers to retain
customers and combat silent attrition—where usage decreases slowly over time.

6. Direct Marketing:
Predictive analytics identifies the best product offers, timing, and communication channels
for each customer. It lowers marketing costs by targeting only the most promising leads with
customized campaigns.

7. Fraud Detection:
Businesses apply predictive models to flag fraudulent activities like identity theft, fake
transactions, and false claims. These models assign risk scores to accounts or activities,
allowing early detection and prevention of fraud across industries, including finance,
insurance, and tax.

8. Portfolio, Product, or Economy-Level Prediction:

Predictive analytics can forecast trends not just for consumers but also for products,
companies, industries, or entire economies. Techniques like time series analysis help
organizations plan for future demand, inventory, or market changes.

9. Risk Management:

Tools like CAPM and PRA use predictive analytics to forecast financial returns and project
risks. These methods support informed decision-making by simulating possible outcomes,
improving both short-term and long-term strategic planning.

10. Underwriting:

Businesses assess future risk by analyzing historical data on claims, payments, or defaults.
Predictive analytics helps determine appropriate pricing for insurance premiums, loans, or
credit approvals, streamlining decision-making and reducing exposure to financial loss.

1.5 Analytics Techniques


Regression Techniques

1. Linear Regression:
Linear regression models the relationship between a dependent variable and one or
more independent variables using a linear equation. It aims to minimize the difference
between observed and predicted values using the ordinary least squares (OLS)
method, making it suitable for continuous outcomes.
2. Discrete Choice Models:
These models are used when the dependent variable is categorical rather than
continuous. They include logistic, multinomial logit, and probit models, which are
more appropriate than linear regression when dealing with binary or multi-category
outcomes.
3. Logistic Regression:
This technique models binary outcomes (e.g., yes/no) by estimating the probability of
an event occurring. It transforms the binary dependent variable into a continuous scale
for analysis and evaluates model performance using goodness-of-fit tests.
4. Multinomial Logistic Regression:
Used when the outcome has more than two unordered categories (e.g., red, blue,
green), this model helps avoid loss of information that can result from collapsing
variables into binary form. It’s ideal for complex multi-class classification tasks.
5. Probit Regression:
Probit models assume a normal distribution of the error term, making them useful for
modeling binary or proportional outcomes, especially in economics. While similar to
logistic regression, probit models are preferred when normality is assumed in the
latent variable.
6. Time Series Models:
Time series models handle data indexed over time by capturing trends, seasonality,
and autocorrelation. Techniques like ARMA, ARIMA, ARCH, and GARCH help
forecast future values and are essential in finance and economics.
7. Survival (Duration) Analysis:
This technique is used to analyze time-to-event data, especially when the event hasn't
occurred for all subjects (censoring). It models hazard rates and survival probabilities,
often using distributions like Weibull or exponential, and is widely used in medicine
and engineering.
8. Classification and Regression Trees (CART):
CART builds decision trees by splitting data based on variables that best differentiate
outcomes. It’s a non-parametric approach used for both classification and regression,
with extensions like Random Forests enhancing performance.
9. Multivariate Adaptive Regression Splines (MARS):
MARS is a flexible, non-parametric regression method that uses piecewise linear
regressions and basis functions to model complex relationships. It overfits initially
and prunes to find an optimal model, suitable for high-dimensional data.

🔹 Machine Learning Techniques

10. Neural Networks:


Neural networks model complex, non-linear relationships using layers of
interconnected nodes. They learn from training data and are widely used in prediction,
classification, and control tasks across diverse fields.
11. Multilayer Perceptron (MLP):
MLP is a type of neural network with one or more hidden layers that uses
backpropagation to learn. It’s effective for modeling non-linear relationships between
inputs and outputs.
12. Radial Basis Functions (RBF):
RBF networks use distance-based activation functions, typically Gaussian, in the
hidden layer. They are efficient for interpolation and avoid local minima issues found
in some other neural networks.
13. Support Vector Machines (SVM):
SVMs classify or regress data by finding an optimal separating hyperplane, often
using kernel tricks for non-linear data. They are powerful for tasks requiring high
accuracy and work well in high-dimensional spaces.
14. Naïve Bayes:
This classification technique assumes independence between predictors and applies
Bayes' theorem to estimate class probabilities. It’s simple, scalable, and effective,
especially with high-dimensional data.
15. k-Nearest Neighbors (k-NN):
k-NN classifies new instances by finding the majority class among the ‘k’ closest data
points in the training set. It’s a simple yet powerful non-parametric method that adapts
well to new data without strong assumptions.
16. Geospatial Predictive Modeling:
This technique uses spatial data and environmental variables to predict the likelihood
of event occurrences in geographic space. It’s valuable in fields like urban planning,
archaeology, and environmental science.

1.6 Business Analytics Tools

1. scikit-learn
A powerful Python library built on NumPy and SciPy, offering simple and efficient
tools for data mining and machine learning. It supports a wide range of supervised
and unsupervised learning algorithms, making it ideal for both beginners and experts.
2. KNIME
An open-source data analytics platform that enables visual workflows for data
manipulation, analysis, and modeling. It integrates with various programming
languages and tools, making it versatile for advanced analytics.
3. Orange
A Python-based data visualization and machine learning tool that features an
interactive, drag-and-drop interface. It is excellent for teaching, prototyping, and
performing advanced analytics without heavy coding.
4. R
A statistical programming language widely used for data analysis, visualization, and
predictive modeling. It boasts thousands of packages and strong community support
for various analytics tasks.
5. RapidMiner
Although it offers a commercial version, its open-source edition allows visual
workflow development for data preparation, modeling, and validation. It is known for
its easy-to-use interface and wide range of integrated tools.
6. Weka
A Java-based suite of machine learning tools, ideal for educational and research
purposes. It includes a variety of algorithms for classification, regression, clustering,
and visualization.
7. IBM SPSS Statistics & Modeler
SPSS provides advanced statistical analysis and Modeler adds machine learning and
text analytics. Both are popular in academic and enterprise settings for data-driven
decision-making.
8. MATLAB
A high-performance language for numerical computing, MATLAB is used for
algorithm development and data analysis. It supports a wide range of toolboxes for
predictive modeling and machine learning.
9. SAP
SAP integrates predictive analytics within its business intelligence and ERP systems.
It offers both automated and expert-driven modeling tools.
10. SAS & SAS Enterprise Miner
SAS is a comprehensive suite for data analysis, with Enterprise Miner offering
advanced modeling and visualization capabilities. It's a favorite in finance, healthcare,
and government sectors.
11. STATISTICA
A predictive analytics and data mining software developed by TIBCO. It offers visual
workflows, advanced modeling, and integration with R and Python.
1.7 Business Analytics Models

 Predictive Models estimate the likelihood of future outcomes based on known data
patterns. They analyze the relationship between a unit’s attributes and its behaviour using a
training sample. These models are commonly used in real-time applications like fraud
detection or credit scoring. Predictive models are especially useful when applied to new,
unseen data (out-of-sample) to forecast behaviour. They also support advanced simulations,
such as predicting authorship or analyzing crime scenes.

What is predictive modelling?


Predictive modelling is a process used in data science to create a mathematical model that
predicts an outcome based on input data. It involves using statistical algorithms and machine
learning techniques to analyze historical data and make predictions about future or unknown
events.
In predictive modelling, the goal is to build a model that can accurately predict the target
variable (the outcome we want to predict) based on one or more input variables (features).
The model is trained on a dataset that includes both the input variables and the known
outcome, allowing it to learn the relationships between the input variables and the target
variable.
Once the model is trained, it can be used to make predictions on new data where the target
variable is unknown. The accuracy of the predictions can be evaluated using various metrics,
such as accuracy, precision, recall, and F1 score, depending on the nature of the problem.
Predictive modelling is used in a wide range of applications, including sales forecasting,
risk assessment, fraud detection, and healthcare. It can help businesses make informed
decisions, optimize processes, and improve outcomes based on data-driven insights.

Purpose:
Forecasting: Predicting future trends, such as sales, stock prices, or weather patterns.
Risk Assessment: Identifying potential risks, like fraud, credit risk, or cybersecurity threats.
Resource Optimization: Predicting resource needs, such as staffing, inventory, or energy
consumption.
Customer Behavior Analysis: Understanding customer preferences, predicting purchase
patterns, and personalizing marketing efforts.

Inputs:
Historical Data: Past data relevant to the prediction, such as sales figures, customer
demographics, or website traffic.
Statistical Data: Statistical information about the data, such as mean, median, standard
deviation, and correlation.
Machine Learning Algorithms: Algorithms like regression, classification, or neural networks
are used to analyze the data and make predictions.
Features: Specific attributes or variables within the data that are used as inputs to the model.

Outputs:
Predicted Values: The numerical or categorical values predicted by the model, such as sales
forecasts, risk scores, or customer segments.
Probabilities: The likelihood of a certain event occurring, like the probability of a customer
defaulting on a loan.
Recommendations: Suggested actions based on the predictions, like recommending products
to a customer or optimizing a marketing campaign.
 Descriptive Models focus on identifying patterns and relationships within data to group or
classify subjects, like customers or products. Unlike predictive models, they don't forecast
behaviour but instead reveal hidden structures, such as customer segmentation based on
preferences. These models are widely used in marketing and customer analytics. They help
understand the "what" and "why" behind behaviours, not just the "what might happen."
Descriptive modeling also serves as a foundation for building more complex simulations.

This term is basically used to produce correlation, cross-tabulation, frequency etc. These
technologies are used to determine the similarities in the data and to find existing patterns.
One more application of descriptive analysis is to develop the captivating subgroups in the
major part of the data available. This analytics emphasis on the summarization and
transformation of the data into meaningful information for reporting and monitoring.

Examples of descriptive data mining include clustering, association rule mining, and anomaly
detection. Clustering involves grouping similar objects together, while association rule
mining involves identifying relationships between different items in a dataset. Anomaly
detection involves identifying unusual patterns or outliers in the data.

Purpose:
Understanding and Explanation: Descriptive models help in understanding the structure,
behavior, and relationships within a system.
Documentation and Communication: They can be used to document processes, facilitate
communication among stakeholders, and train new users.
Analysis and Improvement: By visualizing the system, descriptive models can help identify
areas for improvement or optimization.

Inputs:
Data: Descriptive models rely on data about the system being modeled. This can include
information about components, their attributes, relationships, and interactions.
Knowledge: Domain expertise and understanding of the system's context are crucial for
creating accurate and meaningful descriptive models.

Outputs:
Descriptions: These can be textual descriptions, diagrams, flowcharts, or other visual
representations of the system's components and their relationships.
Summaries: Descriptive models can produce summaries of key characteristics, patterns, or
trends within the data.
Visualizations: Charts, graphs, and other visual aids help in understanding the system's
behavior and relationships.
Documentation: Descriptive models can be used to create documentation for processes,
systems, or architectures.

 Decision Models connect all parts of a decision-making process: inputs (data), decisions,
and their possible outcomes. They are designed to optimize decisions by balancing multiple
objectives (e.g., maximizing profit while minimizing risk). These models use results from
predictive models to inform action. Businesses rely on decision models to automate and
improve complex decision-making through rule-based systems. Their strength lies in
simulating and evaluating the impact of different decisions in varying scenarios.
Decision models are structured approaches to making choices by defining inputs, processing
them, and producing outputs that guide actions. They serve to simplify complex decisions,
clarify decision-making processes, and improve the consistency and quality of outcomes.
Inputs can be predetermined factors, time-varying factors, data, decision variables, or
uncontrollable variables. The output is typically a recommended action, optimal offer, or a set
of evaluated options.

Purpose:
Simplify complex decisions: Decision models break down complex problems into
manageable components, making it easier to understand and analyze different options.
Improve decision quality: By providing a structured framework, decision models help to
ensure that decisions are based on relevant information and criteria.
Increase consistency: Decision models can be used repeatedly to make similar decisions,
leading to more consistent outcomes over time.
Facilitate collaboration: Decision models can be shared and understood by multiple
stakeholders, promoting better communication and collaboration in the decision-making
process.

Inputs:
Predetermined factors: These are factors that are fixed or relatively stable, such as initial
policies, application constraints, or available technology.
Time-varying factors: These are factors that change over time, such as market conditions,
customer behavior, or competitor actions.
Data: This includes relevant information about the situation, such as costs, benefits, risks, or
performance metrics.
Decision variables: These are factors that are under the control of the decision-maker, such as
the amount of resources to allocate, the price to charge, or the target market to pursue.
Uncontrollable variables: These are factors that affect the decision but are outside the control
of the decision-maker, such as economic conditions, weather patterns, or government
regulations.

Outputs:
Recommended actions: Decision models can suggest specific courses of action based on the
inputs and the model's logic.
Optimal offers: In a marketing or sales context, decision models can identify the most
appropriate offers to present to customers based on their profiles and behavior.
Evaluated options: Decision models can rank or score different options based on their
potential outcomes, helping the decision-maker to choose the best alternative.
Decision rules: These are specific rules that are triggered by certain inputs and produce
corresponding outputs, guiding the decision-making process.
Visualization of the decision-making process: Decision models can visually represent the
steps involved in making a decision, making it easier to understand the logic behind the
outcome.

1.8 Data Pre-processing and Model Tuning


Data transformations: Individual predictors, Multiple predictors, dealing with missing values,
Removing. Adding, Binning Predictors, Computing, Model Tuning, Data Splitting,
Resampling.
1.8 Data Pre-processing
Data pre-processing is the process of transforming raw data into a clean, structured, and
usable format for analysis and machine learning. It involves cleaning, transforming, and
organizing data to improve its quality and make it suitable for algorithms. This crucial step
helps in addressing issues like missing values, noise, and inconsistencies, ultimately leading
to more accurate and efficient results.
Key aspects of data pre-processing include:
 Data Cleaning: This involves handling missing values (imputation or removal), removing
duplicates, and addressing outliers.
 Data Transformation: This includes scaling numerical features, encoding categorical
variables, and normalizing data.
 Data Reduction: Techniques like dimensionality reduction can be applied to reduce the
number of features while preserving important information.
 Data Integration: Combining data from multiple sources into a unified dataset.
 Feature Engineering: Creating new features from existing ones to improve model
performance.

Why is data preprocessing important?


 Improved Data Quality: Addressing missing values, inconsistencies, and noise ensures
that the data is reliable and accurate.
 Enhanced Model Performance:
Preprocessed data allows machine learning models to train more effectively and achieve
higher accuracy.
 More Efficient Analysis:
Clean and structured data facilitates faster and more efficient analysis.
 Better Insights:
Preprocessing helps reveal hidden patterns and relationships in the data, leading to more
meaningful insights.

What is Data Cleaning?

Data cleaning is therefore the process of detecting and rectifying faults or inconsistencies in
dataset by scrapping or modifying them to fit the definition of quality data for analysis. It is
an essential activity in data preprocessing as it determines how the data will be used and
processed in other modeling processes.
The importance of data cleaning lies in the following factors:
 Improved data quality: It is therefore very important to clean the data as this reduces the
chances of errors, inconsistencies and missing values, which ultimately makes the data to be
more accurate and reliable in the analysis.
 Better decision-making: Consistent and clean data gives organization insight into
comprehensive and actual information and minimizes the way such organizations make
decisions on outdated and incomplete data.
 Increased efficiency: High quality data is efficient to analyze, model or report on it, whereas
clean data often avoids a lot of foreseen time and effort that goes into handling poor data
quality.
 Compliance and regulatory requirements: There are standard policies the industries and
various regulatory authorities set on data quality, and by data cleaning, one can be able to
conform with these standards to avoid penalties and legal endangers.
Navigating Common Data Quality Issues in Analysis and Interpretation

It is also relevant to mention that issues with the quality of data could be of various origins
including errors made by people, the failures of technical input and data merging issues
among others. Some common data quality issues include:Several common types of data
quality problem are:
 Missing values: Lack of some data or missing information can result in failure to make the
right conclusions and can or else lead to creating a biased result.
 Duplicate data: Duplicate or twofold variation could possibly result in different data values
and parameters within the set which might produce skewed results.
 Incorrect data types: Adjustment 2: Elimination of data fields with wrong data format
conversion Data fields containing values of the wrong data type (for instance string data type
in a numeric data type) can sometimes hamper analysis and cause inaccuracies.
 Outliers and anomalies: Outliers simply refer to observations whose values are unusually
high or low compared to other observations in the same data set ‘outliers can affect any
analysis and some statistical results beyond recognition’.
 Inconsistent formats: It is also important to note that data discrepancies like date formats,
capital first letter etc may present challenges when bringing together data.
 Spelling and typographical errors: This is due to the reason that the result is depended on
text fields and the misspellings and the typos of the keys are often misinterpreted or
categorized wrongly.
Common Data Cleaning Tasks
Data cleaning involves several key tasks, each aimed at addressing specific issues within a
dataset. Here are some of the most common tasks involved in data cleaning:
1. Handling Missing Data
Missing data is a common problem in datasets. Strategies to handle missing data include:
 Removing Records: Deleting rows with missing values if they are relatively few and
insignificant.
 Imputing Values: Replacing missing values with estimated ones, such as the mean,
median, or mode of the dataset.
 Using Algorithms: Employing advanced techniques like regression or machine
learning models to predict and fill in missing values.
2. Removing Duplicates
Duplicates can skew analyses and lead to inaccurate results. Identifying and removing
duplicate records ensures that each data point is unique and accurately represented.
3. Correcting Inaccuracies
Data entry errors, such as typos or incorrect values, need to be identified and corrected. This
can involve cross-referencing with other data sources or using validation rules to ensure data
accuracy.
4. Standardizing Formats
Data may be entered in various formats, making it difficult to analyze. Standardizing formats,
such as dates, addresses, and phone numbers, ensures consistency and makes the data easier
to work with.
5. Dealing with Outliers
Outliers can distort analyses and lead to misleading results. Identifying and addressing
outliers, either by removing them or transforming the data, helps maintain the integrity of the
dataset.
1. Ignore the tuple: Remove the data points with missing values
 Suppose the number of cases of missing values is extremely small (>5%); then, you may drop
or omit those values from the analysis.
2. Fill in the missing values manually:
 Personally, I wouldn't suggest this technique! Bcoz, When the dataset size is too large then
filling the missing values manually is going to be a time-consuming and not feasible
approach.
3. Use a global constant to fill in the missing values:
 Replace all missing attribute values by the same constant such as a label like “Unknown” or
−∞.
 If missing values are replaced by, say, “Unknown,” then the mining program or the machine
learning algorithm may mistakenly think that they form an interesting concept, since they all
have a value in common—that of “Unknown.” Hence, although this method is simple, it is
not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing
value:
 Measures of central tendency indicate the “middle” value of data distribution(For example:
Mean, Median, and Mode).
 Incase if the attribute is a numerical attribute:
1. For normal (symmetric) data distributions, the mean can be used.
2. For skewed data distribution, we should employ the median (Section 2.2).
 Incase if the attribute is Categorical then replace the missing values of that categorical
variable with its mode.
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple:
 For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that of the
given tuple.
 If the data distribution for a given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value:
 This may be determined with regression, inference-based tools using a Bayesian formalism,
or decision tree induction.
 For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
What is noise?” Noise is a random error or variance in a measured variable. In Chapter 2, we
saw how some basic statistical description techniques (e.g., boxplots and scatter plots), and
methods of data visualization can be used to identify outliers, which may represent noise.
Given a numeric attribute such as, say, price, how can we “smooth” out the data to remove
the noise? Let’s look at the following data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it. The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure 3.2 illustrates some binning techniques. In this example, the data for price are first
sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three
values). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.

Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the Sorted data for price (in dollars): 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other. Multiple linear regression is
an extension of linear regression, where more than two attributes are involved and the data
are fit to a multidimensional surface. Regression is further described in Section 3.4.5.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.

1.8 Data Transformations


Data Transformation of predictor variables may be needed for several reasons. Some
modeling techniques may have strict requirements, such as the predictors having a common
scale. In other cases, creating a good model may be difficult due to specific characteristics of
the data (e.g., outliers). Here we discuss centering, scaling, and skewness transformations.
Having a sense of how data is distributed, both from using visual or quantitative summaries,
we can consider transformations of variables to ease both interpretation of data analyses and
the application statistical and machine learning models to a dataset.

1.8.1 Data Transformations for Individual Predictors


Centering and scaling
A very common and important transformation is to scale data to a common unit-less scale.
Informally, you can think of this as transforming variables from whatever units they are
measured (e.g., diamond depth percentage) into “standard deviations away from the mean”
units (actually called standard units, or z -score). Given data x=x1,x2,…,xn, the
transformation applied to obtain centered and scaled variable z is:

where x is the mean of data x, and sd(x) is its standard deviation.


library(ggplot2)
data(diamonds)

diamonds %>%
mutate(scaled_depth = (depth - mean(depth)) / sd(depth)) %>%
ggplot(aes(x=scaled_depth)) +
geom_histogram(binwidth=.5)

Treating categorical variables as numeric


Many modeling algorithms work strictly on numeric measurements. For example, we will see
methods to predict some variable given values for other variables such as linear regression or
support vector machines, that are strictly defined for numeric measurements. In this case, we
would need to transform categorical variables into something that we can treat as numeric.
We will see more of this in later sections of the course but let’s see a couple of important
guidelines for binary variables (categorical variables that only take two values, e.g.,
health_insurance).
One option is to encode one value of the variable as 1 and the other as 0. For instance:
library(ISLR)
library(tidyverse)
data(Wage)

Wage %>%
mutate(numeric_insurace = ifelse(health_ins == "1. Yes", 1, 0)) %>%
head()
Skewed Data
In many data analysis, variables will have a skewed distribution over their range. In the last
section we saw one way of defining skew using quartiles and median. Variables with skewed
distributions can be hard to incorporate into some modeling procedures, especially in the
presence of other variables that are not skewed. In this case, applying a transformation to
reduce skew will improve performance of models.

where x is the predictor variable, n is the number of values, and x is the sample mean of the
predictor.
Also, skewed data may arise when measuring multiplicative processes. This is very common
in physical or biochemical processes. In this case, interpretation of data may be more
intiuitive after a transformation.
flights %>% ggplot(aes(x=dep_delay)) + geom_histogram(binwidth=30)

## Warning: Removed 8255 rows containing non-finite values


## (stat_bin).

1.8.2 Data Transformation in Multiple Predictors


1. Transformations to Resolve Outliers
These transformations act on groups of predictors, typically the entire set under
consideration. Of primary importance are methods to resolve outliers and reduce the
dimension of the data.
Outliers are generally defined as data points that lie unusually far from the main distribution
of the data. While formal statistical definitions exist under certain assumptions, in practice
they are often identified visually. Before treating values as outliers, it is essential to confirm
that the measurements are valid (e.g., ensuring physiological values like blood pressure
remain positive) and that no data entry or recording errors are present. Caution is especially
important with small datasets, where apparent outliers may actually reflect skewness in the
distribution rather than invalid samples. In some cases, unusual points may represent an
under-sampled but meaningful subpopulation—for example, a distinct group within the
population being studied. If these data points form a valid “cluster,” they should not be
hastily removed, as they may hold valuable information.
Certain predictive models are inherently resistant to outliers. For instance:
 Tree-based models split data into regions using logical rules, so extreme points
rarely dominate the model.
 Support Vector Machines (SVMs) often ignore distant samples when forming the
decision boundary, reducing the influence of outliers.
For models that are more sensitive to outliers, transformations can help. One such approach is
the spatial sign transformation (Serneels et al., 2006). This technique projects each sample
onto the surface of a multidimensional sphere, ensuring all points lie at the same distance
from the origin. Practically, each data point is divided by its squared norm, after centering
and scaling the predictors. Unlike simple centering or scaling, this is a group transformation
—removing predictors after applying it can cause issues.

Data reduction techniques are another class of predictor transformations. These methods
reduce the data by generating a smaller set of predictors that seek to capture a majority of the
information in the original variables. In this way, fewer variables can be used that provide
reasonable fidelity to the original data. For most data reduction techniques, the new
predictors are functions of the original predictors; therefore, all the original predictors are still
needed to create the surrogate variables. This class of methods is often called signal
extraction or feature extraction techniques.

PCA Principle Component Analysis:

PCA is a widely used data reduction technique that seeks to find linear combinations of the
predictors, called principal components (PCs), which capture the maximum possible
variance in the data. The first principal component (PC1) is defined as the linear
combination of the predictors that explains the greatest amount of variability among all
possible linear combinations. Subsequent components (PC2, PC3, …) are constructed in such
a way that they capture the maximum remaining variability while being uncorrelated with all
previously derived components.

Mathematically, the jthj^{th} principal component can be expressed as a linear combination


of predictors, where the coefficients aj1,aj2,…,ajP are known as component weights
(loadings). These weights indicate which predictors are most important for each principal
component.
Example Illustration (Figure 3.5)
In the segmentation dataset, two highly correlated predictors—average pixel intensity and
entropy of intensity values—carry redundant information. PCA transforms them into two
new uncorrelated components:
 PC1 explains 97% of the variability,
 PC2 explains 3% of the variability.
Thus, PC1 alone is sufficient for representing most of the information.
Key Advantages of PCA
1. PCA produces uncorrelated components, improving model stability and
performance in algorithms that prefer low-correlation predictors.
2. It helps reduce dimensionality by summarizing multiple correlated predictors into
fewer independent components.
However, PCA must be applied with caution. Since it only considers variance in predictors
and not the response variable, it may capture patterns irrelevant to the modeling goal.
Moreover, because PCA emphasizes predictors with larger scales (e.g., income in dollars vs
height in feet), it is essential to center and scale variables to avoid dominance by
measurement units.
Practical Considerations in PCA

1. Preprocessing
o Predictors are often on different scales or skewed.
o Solution: Transform skewed predictors, then center and scale before applying
PCA.
2. Limitations
o PCA is unsupervised, meaning it ignores the response variable.
o If predictor variance does not align with predictive relationships, PCA may not
improve modeling.
o Alternative: Use Partial Least Squares (PLS) when the response variable
must be considered.
3. Choosing Number of Components
o Use a scree plot (variance explained vs component number).
o Select the number of components at the elbow point, where variance
contribution levels off.
o Cross-validation can also determine the optimal number of PCs.

Using PCA for Data Exploration


1. Visual Inspection
o Scatter plots of the first few PCs can reveal clusters, class separation, or
outliers when observations are colored by class labels.
2. Interpretation
o If PCA captures sufficient information, the plots may show clear separation of
classes.
o If classes overlap heavily, PCA offers weak separation.
3. Scaling Issues
o Higher PCs explain less variance and have smaller ranges.
o Example: In Figure 3.5, PC1 ranges from −3.7 to 3.4, while PC2 ranges only
from −1 to 1.1.
o Different axis scales can lead to over-interpretation of low-variance PCs.
4. Preprocessing Before PCA
o Skewed predictors can distort results.
o Applying transformations (e.g., Box–Cox), followed by centering and scaling,
ensures fair contribution of all predictors.
o For example, in the segmentation dataset, 44 predictors were transformed prior
to PCA.

Variance Explained by PCs (Figure 3.6)

 PC1 explains 14%, PC2 explains 12.6%, and PC3 explains 9.4% of the variance.
 Together, the first three PCs account for about 36% of total variance.
 After PC4, the variance contribution drops sharply.
 Even with 4 PCs, only 42.4% of the total data information is captured, highlighting
that many predictors still contain unexplained variability.

PCA Scatter Plot Matrix (Figure 3.7)

Scatter plots of the first three PCs, with points colored by class (segmentation quality),
provide insights into separation:

 Some separation between classes is visible in PC1 vs PC2.


 The distribution of well-segmented cells is mostly contained within the poorly-
segmented cells, showing weak separation.
 No strong outliers are detected, though a few points fall outside the main cluster.
 This indicates that cell types are not easily separated by PCA alone.
PCA Loadings (Figure 3.8)
 Loadings are coefficients that show how much each predictor contributes to a
component.
o Loadings close to 0 indicate little contribution.
o Large positive or negative loadings indicate strong influence.
 In the segmentation dataset:
o PC1 is strongly influenced by channel 1 (cell body) predictors.
o PC3 is mainly associated with channel 3 (actin & tubulin) predictors.
o Channel 3 contributes little to PC1, while the cell body plays a minor role in
PC3.
 Importantly, predictors that explain more variance (e.g., cell body) are not always the
best predictors of the response variable (segmentation quality).

1.9 Dealing with Missing Values


1. Nature of Missing Data
 Structurally missing: values that are impossible by definition (e.g., number of
children a man has given birth to).
 Randomly missing: values that were not measured or not recorded.
 Informative missingness: when the pattern of missing values relates to the outcome.
o Example: Patients dropping out of a study due to side effects of a drug.
o Example: Online ratings—people with extreme opinions (love/hate) are more
likely to rate.
2. Difference: Missing vs. Censored Data
 Missing data: no information is available.
 Censored data: the exact value is unknown, but partial information exists.
o Example:
 Lab tests below detection limit → known to be less than threshold.
 Movie rental duration if the customer hasn’t returned the movie → at
least as long as the current duration.
 Treatment:
o In traditional inference models → censoring is modeled formally.
o In predictive modeling → censored values are often treated as missing or
replaced by thresholds/randomized within limits.
3. Handling Missing Data
 Removing missing data:
o Works for large datasets if missingness is not informative.
o Risky for small datasets → loss of valuable information.
 Using models that handle missing values directly:
o Example: Tree-based methods (discussed in Chapter 8).
 Imputation (filling in missing values):
o Builds a model to predict missing values from other predictors.
o Acts as “a model within a model.”
o Must be incorporated into resampling to get fair estimates of performance.
4. Imputation Approaches
1. K-Nearest Neighbors (KNN) Imputation
o For a missing value, find the k most similar samples in training data.
o Take the average of their values to impute.
o Pros: Imputed values remain within observed range.
o Cons: Needs full training set at prediction time; requires choosing k and
distance metric.
o Robust across parameter settings and missing proportions.
2. Regression-Based Imputation
o Use a correlated predictor to estimate the missing value.
o Example: Cell perimeter (missing) predicted from cell fiber length (high
correlation = 0.99).

5. Example: Cell Segmentation Dataset (Figure 3.9)


 Experiment:
o 50 cell perimeter values in the test set were set as missing.
o Two imputation methods were applied:
1. 5NN model → correlation between imputed & real values = 0.91
2. Linear regression model (using cell fiber length) → correlation =
0.85
 Visualization (Fig. 3.9):
o X-axis: original (true) values (centered & scaled).
o Y-axis: imputed values.
o Dotted line = perfect agreement.
o 5NN plot: points closely aligned to diagonal → better accuracy.
o Linear model plot: more spread around diagonal → slightly less accurate.
1.10 Removing Predictors

Removing predictors before building a model can be beneficial for several reasons. First,
reducing the number of predictors decreases computational time and simplifies the overall
model. Second, when two predictors are highly correlated, they essentially capture the same
underlying information. In such cases, removing one of them usually does not affect model
performance and can even result in a simpler and more interpretable model. Third, certain
predictors with poor statistical properties can negatively impact model performance or
stability. By eliminating these problematic variables, the quality of the model can improve
significantly.

A common issue arises with zero variance predictors, which are variables that take on only
a single unique value across all samples. These predictors provide no useful information. For
some models, such as tree-based methods, zero variance predictors are harmless because they
will never be used for splitting. However, in models like linear regression, these variables can
cause serious computational errors. Since they do not contribute any information, they can be
safely discarded.

Similarly, near-zero variance predictors can also cause problems. These are variables that
contain very few unique values, often with one value dominating most of the data. For
instance, in a text mining example, after removing common stop words, a keyword count
variable may only appear in a small number of documents. Suppose in a dataset of 531
documents, a particular keyword is absent in 523 documents, appears twice in six documents,
three times in one document, and six times in one document. In this case, 98% of the data
corresponds to zero occurrences, while only a handful of documents contain the keyword.
This makes the predictor highly imbalanced and prone to exerting undue influence on the
model. Furthermore, if resampling is used, there is a risk that some resampled datasets may
contain no occurrences of the keyword at all, leaving the predictor with a single unique value.

To diagnose near-zero variance predictors, two criteria are often used. First, the fraction of
unique values relative to the total sample size should be examined. If this fraction is very
small (for example, below 10%), the predictor may be problematic. Second, the ratio of the
most frequent value to the second most frequent value should be considered. If this ratio is
very large (for example, greater than 20), it indicates a strong imbalance. In the document
example, the ratio of zero occurrences (523) to two occurrences (6) is 87, which clearly
signals an imbalance.

Therefore, when both conditions are met and the model being used is sensitive to such issues,
it is often advantageous to remove the near-zero variance predictor to improve model
performance and stability.

Between-Predictor Correlations

Collinearity occurs when two predictor variables are strongly correlated with each other,
while multicollinearity refers to the case where multiple predictors have strong
interrelationships. For instance, in cell segmentation data, several predictors measure cell
size, such as perimeter, width, and length, while others describe cell morphology, like
roughness.
A correlation matrix helps visualize predictor relationships, where colors indicate the strength
and direction of correlations. Clustering techniques group correlated predictors, often forming
“blocks” of related variables. When many predictors exist, Principal Component Analysis
(PCA) can summarize correlations, with loadings showing which predictors contribute to
each component.

Highly correlated predictors are undesirable because they add redundancy, increase model
complexity, and can cause instability or poor performance in linear regression. The Variance
Inflation Factor (VIF) is commonly used to detect multicollinearity, though it has limitations,
such as being specific to linear models and not identifying which predictors to remove.

A simpler heuristic approach involves eliminating predictors with high correlations until all
pairwise correlations fall below a chosen threshold. The process is:

1. Compute the correlation matrix.


2. Identify the pair of predictors with the largest absolute correlation.
3. Compute the average correlation of each predictor with the others.
4. Remove the predictor with the higher average correlation.
5. Repeat until all correlations are below the threshold.

For example, if a threshold of 0.75 is set, predictors with correlations above 0.75 are
gradually removed. In the segmentation dataset, this method recommended removing 43
predictors.

Feature extraction methods such as PCA can also reduce the effects of collinearity by
creating new surrogate predictors. However, this makes interpretation more difficult and,
since PCA is unsupervised, the new predictors may not necessarily relate well to the outcome
variable.

1.11 Adding Predictors

When a predictor is categorical (e.g., gender, race), it is usually broken down into smaller,
more specific variables. For example, in the credit scoring dataset (Sect. 4.5), a predictor
indicates the amount of money in an applicant’s savings account. These savings values are
grouped into categories, including an “unknown” group.

Table 3.2 shows the groups, the number of applicants in each group, and how they are
converted into dummy variables. A dummy variable is a binary (0/1) indicator for each
category. Usually, one dummy variable is created for each category. However, in practice,
only (number of categories – 1) dummy variables are needed, since the last one can be
inferred from the others.

If a model includes an intercept term (like linear regression), including all dummy variables
creates a problem: they always sum to one, making them redundant with the intercept. To
avoid numerical issues, one dummy variable is dropped. But in models that are insensitive to
this issue, using all dummy variables may improve interpretation.

Many modern models automatically capture nonlinear relationships between predictors and
outcomes. However, simpler models do not, unless the user explicitly specifies nonlinear
terms.

For example, logistic regression typically creates linear classification boundaries. Figure
3.11 illustrates this:

 The left panel shows boundaries using only linear terms for predictors A and B.
 The right panel shows boundaries when a quadratic term (B²) is added.

Adding nonlinear terms (like squares or interactions) allows logistic regression to model
more complex decision boundaries without resorting to very complex techniques, which may
risk overfitting.

Additionally, Forina et al. (2009) proposed another approach: calculating class centroids (the
center of predictor values for each class). For each predictor, the distance to each class
centroid is computed and added as new predictors in the model. This technique enhances
classification by incorporating richer information.

1.12 Binning predictors

Binning predictors refers to the process of converting continuous variables into categories or
groups before analysis. For example, in diagnosing Systemic Inflammatory Response
Syndrome (SIRS), clinical measurements such as temperature, heart rate, respiratory rate, and
white blood cell count are categorized into ranges to simplify diagnosis. The main advantages
of this approach are that it creates simple decision rules that are easy to interpret, the modeler
does not need to know the exact relationship between predictors and outcomes, and it can
improve survey response rates since people often find it easier to answer questions in ranges
rather than exact values.

However, manual binning of predictors has several drawbacks. First, it reduces model
performance because modern predictive models are capable of learning complex relationships
that are lost when data is simplified into bins. Second, it causes a loss of precision in
predictions because fewer categories limit the number of possible outcomes, leading to
oversimplified results. Third, research has shown that binning can increase the rate of false
positives by incorrectly identifying noise variables as important. Overall, although binning
may improve interpretability, it usually comes at the cost of predictive accuracy, which can
even be unethical in sensitive fields such as medical diagnosis where accuracy is critical.

Issues with Manual Binning

1. Loss of model performance – modern models can capture complex predictor–


outcome relationships better than pre-binned data.
2. Loss of precision – fewer categories = limited combinations = oversimplified
predictions.
3. Higher false positives – research (Austin & Brunner, 2004) shows that binning can
incorrectly identify noise variables as important.
4. Trade-off between interpretability and accuracy – simple bins may look
interpretable but significantly reduce predictive accuracy.
5. Ethical concerns – in sensitive domains (like medical diagnosis), reducing predictive
accuracy for the sake of simplicity can be harmful.

Key Distinction

 Manual Binning (not recommended): Arbitrarily categorizing predictors before


model building.
 Model-Based Binning (recommended in some cases):
o Methods like Decision Trees and MARS (Multivariate Adaptive
Regression Splines) automatically create cut points.
o They use all predictors simultaneously, optimize for accuracy, and are
statistically sound.

COMPUTING

1. Data Source
o Raw data: segmentationOriginal from AppliedPredictiveModeling package.
o Use only training samples (Case == "Train").
o Remove ID columns (Cell, Class, Case) and unnecessary "Status" columns.
2. Checking Skewness
o Use skewness() from e1071 package to detect skewed predictors.
3. Transformations
o Use Box-Cox transformation (BoxCoxTrans in caret) to normalize skewed
features.
o Apply multiple transformations (BoxCox, center, scale, pca) using preProcess().
4. Principal Component Analysis (PCA)
o Use prcomp() for PCA.
o Extract variance explained and loadings.
o preProcess(method = "pca") can also apply PCA automatically.
5. Filtering Predictors
o Remove near-zero variance predictors: nearZeroVar().
o Remove highly correlated predictors: findCorrelation() with a cutoff (e.g.,
0.75).
o Visualize correlation matrix with corrplot().
6. Handling Missing Values
o Use preProcess() (KNN/bagged tree imputation) or impute.knn() from impute.
7. Creating Dummy Variables
o Use dummyVars() in caret to encode categorical variables.
o Interaction terms can be added in formulas (e.g., Mileage:Type).

🖥️Sample R Program
# Load libraries
library(AppliedPredictiveModeling)
library(caret)
library(e1071)
library(corrplot)

# Load data
data(segmentationOriginal)
segData <- subset(segmentationOriginal, Case == "Train")

# Save ID fields separately and remove them


cellID <- segData$Cell
class <- segData$Class
case <- segData$Case
segData <- segData[, -(1:3)]

# Remove binary "Status" columns


statusColNum <- grep("Status", names(segData))
segData <- segData[, -statusColNum]

# --- Skewness check ---


skewValues <- apply(segData, 2, skewness)
head(skewValues)

# Example Box-Cox transformation


Ch1AreaTrans <- BoxCoxTrans(segData$AreaCh1)
predict(Ch1AreaTrans, head(segData$AreaCh1))

# --- PCA transformation ---


pcaObject <- prcomp(segData, center = TRUE, scale. = TRUE)
percentVariance <- pcaObject$sd^2 / sum(pcaObject$sd^2) * 100
head(percentVariance)

# --- Preprocess: BoxCox + center + scale + PCA ---


trans <- preProcess(segData, method = c("BoxCox", "center", "scale", "pca"))
transformed <- predict(trans, segData)
head(transformed[, 1:5])

# --- Filtering ---


# Near-zero variance predictors
nzv <- nearZeroVar(segData)
print(nzv)

# Correlation filtering
correlations <- cor(segData)
corrplot(correlations, order = "hclust")
highCorr <- findCorrelation(correlations, cutoff = 0.75)
filteredSegData <- segData[, -highCorr]

# --- Dummy variables example (car data) ---


data("cars", package = "caret") # Alternative: carSubset dataset
# Suppose we use Mileage + Type
simpleMod <- dummyVars(~Mileage + Type, data = carSubset, levelsOnly = TRUE)
predict(simpleMod, head(carSubset))

✅ This code covers:

 Loading and cleaning data


 Skewness + Box-Cox
 PCA with preProcess
 Filtering (NZV + correlations)
 Dummy variable creation

Problem of Over-Fitting
Overfitting and underfitting are two fundamental problems encountered when training
machine learning models, both indicating an imbalance in model complexity relative to the
underlying data patterns.

1.14 Overfitting
Overfitting occurs when a model learns the training data too well, including its noise and
random fluctuations, rather than the underlying generalizable patterns. This results in
excellent performance on the training data but poor performance on new, unseen data. It is
analogous to memorizing answers for a specific exam without understanding the subject
matter, leading to failure on a different exam.
Causes of Overfitting:

 Excessive model complexity: Using a model that is too intricate for the given data,
allowing it to capture noise.
 Insufficient training data: Not having enough data to represent the true underlying
patterns, causing the model to learn specific instances.

Underfitting:
Underfitting occurs when a model is too simple to capture the underlying patterns and
relationships within the training data. This results in poor performance on both the training
data and new, unseen data. It is like trying to explain a complex phenomenon with a highly
simplified theory, missing crucial details.
Causes of Underfitting:
 Insufficient model complexity: Using a model that is too simple to represent the
complexity of the data.
 Lack of relevant features: Not including enough informative features in the data to
allow the model to learn effectively.
 Insufficient training: Not training the model for enough iterations or with enough
data for it to learn the patterns. [1, 2]

Addressing Overfitting and Underfitting:


The goal is to find a "just right" model complexity that balances bias (underfitting) and
variance (overfitting), allowing the model to generalize well to new data. Strategies
include:
 Regularization: Techniques like L1/L2 regularization to penalize overly complex
models.
 Cross-validation: Using a validation set to monitor performance on unseen data
during training and detect overfitting.
 Feature engineering: Adding or removing features to improve model learning.
 Ensemble methods: Combining multiple models to reduce variance (overfitting).
 Adjusting model complexity: Choosing a more or less complex model architecture
as needed.

1.15 Model Tuning


Model tuning optimizes a machine learning model’s hyperparameters to obtain the best
training performance. The process involves making adjustments until the optimal set of
hyperparameter values is found, resulting in improved accuracy, generation quality and other
performance metrics. Because model tuning identifies a model’s optimal hyperparameters, it
is also known as hyperparameter optimization, or alternatively, hyperparameter tuning.
What are hyperparameters?
Hyperparameters are configuration settings chosen before training that control how a model
learns (e.g., learning rate, batch size) or its structure (e.g., number of layers). They cannot be
learned from data and must be tuned for good model performance.

How does model tuning work?


Model tuning is the process of finding the best hyperparameter settings for a model. For
simple models, this can be done manually, but for complex models like transformers with
many possible combinations, automated search methods are used to efficiently explore the
most promising configurations.

Model tuning methods


 Grid search
 Random search
 Bayesian optimization
 Hyperband
Grid search is a brute-force hyperparameter tuning method where the model is trained and
validated on every possible combination of hyperparameters within a predefined search
space. It ensures comprehensive coverage but is slow and computationally expensive,
making it less practical for large models or wide parameter ranges..

Random search selects hyperparameter values randomly from a defined distribution instead
of testing all possible combinations. It is faster and more efficient than grid search and often
finds good solutions quickly, but it does not guarantee the absolute best configuration
since many possibilities are skipped.

Bayesian optimization tunes hyperparameters by learning from previous trials. It builds a


probabilistic model (surrogate function) of the objective function and uses it to predict
promising hyperparameter values. Over time, it becomes more efficient by focusing on
high-performing areas and avoiding poor ones. This method, also called Sequential Model-
Based Optimization (SMBO), is more resource-efficient than grid or random search.

Hyperband speeds up hyperparameter tuning by combining random search with early


stopping. It allocates resources to many configurations at first, then repeatedly eliminates the
worst-performing half (“successive halving”), focusing only on the most promising ones.
This makes it faster and more efficient at finding the best configuration compared to testing
all options.

Model Tuning vs. Model Training (Summary):

 Model Tuning – The process of finding the best hyperparameters (e.g., learning rate,
layers, batch size) before training.
 Model Training – The process where the algorithm learns from data by adjusting
weights and biases to minimize the loss function using optimization methods like
gradient descent.
 Training seeks convergence (minimized loss), which may reach a local minimum
rather than the global one, but that is often sufficient for good performance.

Common hyperparameters in neural networks (like LLMs):

 Learning Rate – Controls how fast weights are updated; high = faster but risk of
overshooting, low = stable but slower.
 Learning Rate Decay – Gradually reduces the learning rate over time to improve
convergence.
 Epochs – Number of times the entire training dataset is passed through the model.
 Batch Size – Number of training samples processed before weights are updated.
 Momentum – Adds inertia to updates; high momentum speeds convergence but may
skip minima, low momentum slows progress.
 Number of Hidden Layers – More layers → higher capacity for complex tasks,
fewer layers → simpler, faster models.
 Nodes per Layer – More nodes widen a layer, capturing complex relationships but
increasing computation.
 Activation Function – Defines how nodes fire (e.g., ReLU, sigmoid, tanh) and helps
the network learn non-linear patterns.
1.16 Data Splitting
Now that we have outlined the general procedure for finding optimal tuning parameters, we
turn to discussing the heart of the process: data splitting.
A few of the common steps in model building are:
• Pre-processing the predictor data
• Estimating model parameters
• Selecting predictors for the model
• Evaluating model performance
• Fine tuning class prediction rules (via ROC curves, etc.)
Given a fixed amount of data, the modeler must decide how to “spend” their data points to
accommodate these activities.
Purpose
 To evaluate the model honestly, we must use data that was not used for training or
tuning.
 Ensures the model’s performance estimate is unbiased.
🔹 Training vs Test Set
 Training set: Builds the model.
 Test set: Estimates final model performance.
 If data is small, using a test set may not be efficient.
🔹 Pitfall: Single Test Set
 Validation on a single test set can be unreliable and lead to high variance.
 Resampling methods are often preferred.
🔹 Sampling Strategies
1. Simple Random Sampling
o Splits data randomly.
o Might cause class imbalance in train/test split.
2. Stratified Sampling
o Ensures outcome class distribution is preserved in both sets.
o Useful in classification problems with imbalanced classes.
3. Maximum Dissimilarity Sampling
o Selects test samples that are most dissimilar from training data.
o Ensures test set samples represent diverse areas of predictor space.
o Fig. 4.5 shows how distant samples are selected for the test set.

1.17 Resampling Techniques


Resampling helps estimate true model performance without requiring a separate test set.
1. ✅ k-Fold Cross-Validation
 Data is split into k equal parts (folds).
 Model is trained on (k–1) folds and tested on the remaining fold.
 Process repeated k times, each fold used as test once.
 Final performance = Average of all k test performances.
 Common choices for k: 5 or 10.
 Stratified k-Fold: Ensures outcome class balance in each fold.
 Repeated k-Fold: Repeats the k-fold procedure several times to improve stability.
Trade-offs:
 Higher k: lower bias, but more computational cost.
 Lower k: faster, but can have high bias and variance.
2. 🔄 Leave-One-Out Cross-Validation (LOOCV)
 Special case of k-fold with k = total samples.
 Each sample used once as test data.
 Accurate but very slow (needs n models).
 Comparable results to 10-fold CV in large datasets.
3. 🔂 Repeated Training/Test Splits (Monte Carlo CV)
 Data is randomly split multiple times into training and test sets.
 More repetitions = More stable estimates.
 Good rule: 75–80% for training, rest for testing.
 Needs 25–200 repetitions for reliable results.
4. 🔁 Bootstrap Method
 Random sampling with replacement.
 Model trained on sampled data; non-selected points (out-of-bag) used to test.
 Around 63.2% of original data appears in each bootstrap sample.
Variants:
 .632 Bootstrap:
o Combines bootstrap and apparent error rate:
 0.632 × Bootstrap error + 0.368 × Apparent error
o Helps reduce bias.
 .632+ Bootstrap:
o Adjusts further for over-fitting.
o Useful for small datasets or highly flexible models.
📌 Comparison Summary of Resampling Methods
Method Bias Variance Computation Comments
k-Fold CV Low (k↑) Moderate–High Moderate Best trade-off
LOOCV Very Low High Very High Not scalable
Repeated Splits Moderate Moderate–Low Moderate Useful in larger sets
Bootstrap Moderate Low Moderate Better for small sets
The purpose of data splitting in model building is to ensure that a machine learning model
can generalize well to unseen data and to prevent overfitting. This is achieved by dividing the
available dataset into distinct subsets, typically:
 Training Set:
This subset is used to train the model, allowing it to learn patterns and relationships
within the data.
 Validation Set (Optional but Recommended):
This subset is used during the model development phase to tune hyperparameters and
select the best-performing model architecture. It helps in preventing overfitting to the
training data.
 Test Set:
This independent subset is held out and only used to evaluate the final model's
performance on new, unseen data. This provides an unbiased estimate of how the model
will perform in a real-world scenario.
Key reasons for data splitting:
 Preventing Overfitting:
Overfitting occurs when a model learns the training data too well, including its noise and
idiosyncrasies, leading to poor performance on new data. Data splitting, especially with a
separate test set, helps identify and mitigate overfitting by evaluating the model's ability
to generalize.
 Assessing Generalization Capability:
The test set provides a realistic evaluation of the model's ability to make accurate
predictions on data it has not encountered during training.
 Hyperparameter Tuning and Model Selection:
The validation set allows for iterative refinement of model parameters and selection of
the optimal model without "peeking" at the final test set performance, thus avoiding data
leakage.
 Simulating Real-World Performance:
By using a separate test set, the process simulates how the model would perform when
deployed and exposed to entirely new data in a production environment.

You might also like