KEMBAR78
UT-1-Machine Learning Lecture Notes-2 | PDF | Categorical Variable | Machine Learning
0% found this document useful (0 votes)
16 views11 pages

UT-1-Machine Learning Lecture Notes-2

The document provides an overview of features in machine learning, emphasizing their importance in model performance, interpretability, and computational efficiency. It categorizes features into numerical, categorical, boolean, date/time, and discusses feature transformation, construction, and selection methods. The document highlights techniques for scaling, encoding, and selecting features to enhance model accuracy and reduce overfitting.

Uploaded by

Karan Nigal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

UT-1-Machine Learning Lecture Notes-2

The document provides an overview of features in machine learning, emphasizing their importance in model performance, interpretability, and computational efficiency. It categorizes features into numerical, categorical, boolean, date/time, and discusses feature transformation, construction, and selection methods. The document highlights techniques for scaling, encoding, and selecting features to enhance model accuracy and reduce overfitting.

Uploaded by

Karan Nigal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Features –Transformation, Construction, and Selection

1. Introduction to Features in Machine Learning

In Machine Learning, "features" are individual measurable properties or


characteristics of a phenomenon being observed.

The quality and relevance of features profoundly impact the performance of any
machine learning model.

Why are features so important?

 Model Performance: Good features enable models to learn underlying


patterns more effectively, leading to higher accuracy, precision, recall, etc.
 Interpretability: Well-defined features can make models more interpretable
and understandable.
 Computational Efficiency: Relevant and optimized features can reduce
model training and inference time.
 Curse of Dimensionality: Irrelevant or redundant features can lead to the
"curse of dimensionality," making it harder for models to generalize.

2. Kinds of Features

Features can be categorized based on their data type and nature

2.1. Numerical Features

These features represent quantities and can be either continuous or discrete.

 Continuous Features: Can take any value within a given range. They often
represent measurements.
o Examples: Temperature, height, weight, price, age.
 Discrete Features: Can only take specific, distinct values. They often
represent counts or distinct categories that have an inherent order.
o Examples: Number of rooms in a house (integers), number of
children, ratings on a scale of 1 to 5.
2.2. Categorical Features

These features represent categories or labels, rather than numerical quantities.


They can be nominal or ordinal.

 Nominal Features: Represent categories without any intrinsic order or


ranking.
o Examples: Color (Red, Blue, Green), City (Mumbai, Delhi,
Bangalore), Gender (Male, Female).
 Ordinal Features: Represent categories with a meaningful order or ranking.
o Examples: Education Level (High School, Bachelor's, Master's,
PhD), Satisfaction Rating (Low, Medium, High), T-shirt Size (S, M,
L, XL).

2.3. Boolean Features (Binary Features)

These are a special type of categorical feature that can only take one of two values,
typically representing true/false, yes/no, 0/1.

 Examples: Is_Spam (True/False), Has_Credit_Card (Yes/No),


Is_Premium_User (0/1).

2.4. Date and Time Features

These features represent points in time or durations. While often numerical in their
raw form (e.g., timestamps), their temporal nature often requires special handling.

 Examples: Date of Birth, Transaction Timestamp, Duration of a Call.

Reference:

 VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for


Working with Data. O'Reilly Media. (Chapter 2: Data Manipulation with
Pandas)
3. Feature Transformation

Feature transformation is the process of converting raw features into a format that
is more suitable for machine learning models.

3.1. Scaling and Normalization

Many machine learning algorithms (e.g., gradient descent-based algorithms, K-


Nearest Neighbors, SVMs) perform better when numerical input variables are
scaled to a standard range.

Purpose: Prevents features with larger values from dominating the learning
process and helps algorithms converge faster.

 3.1.1. Min-Max Scaling (Normalization)


o Scales features to a fixed range, typically [0, 1].
o Formula: Xscaled=Xmax−Xmin
o Use Case: When you need features to be within a specific bounded
range. Sensitive to outliers.
 3.1.2. Standardization (Z-score Normalization)
o Scales features to have a mean of 0 and a standard deviation of 1.
o Formula: Xscaled=σX−μ (where μ is mean, σ is standard deviation)
o Use Case: When the data has a Gaussian distribution or when
algorithms assume normally distributed data (e.g., linear regression,
logistic regression).
o Less sensitive to outliers than Min-Max scaling.
 3.1.3. Robust Scaling
o Scales features using statistics that are robust to outliers, such as the
median and interquartile range (IQR).
o Formula: Xscaled=IQR(X)X−median(X)
o Use Case: When the dataset contains many outliers.

Reference:

 Raschka, S., & Mirjalili, V. (2019). Python Machine Learning: Machine


Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2
(3rd ed.). Packt Publishing. (Chapter 4: Building Good Training Datasets)
3.2. Encoding Categorical Features

Machine learning models typically work with numerical data. Categorical features
need to be converted into numerical representations.

 3.2.1. One-Hot Encoding


o Converts each category into a new binary feature (0 or 1).
o For N categories, it creates N new features.
o Use Case: For nominal categorical features where no order is implied.
Prevents the model from incorrectly assuming an ordinal relationship.
o Example: 'City' with categories {Mumbai, Delhi, Chennai} becomes:
 City_Mumbai (0/1)
 City_Delhi (0/1)
 City_Chennai (0/1)
 3.2.2. Label Encoding (Ordinal Encoding)
o Assigns a unique integer to each category.
o Use Case: For ordinal categorical features where the numerical order
reflects the essential order of the categories. Also used for target
variables in classification.
o Example: 'Education_Level' with categories {High School,
Bachelor's, Master's, PhD} might become:
 High School: 0
 Bachelor's: 1
 Master's: 2
 PhD: 3
o Caution: If used for nominal features, the model might infer an
incorrect ordinal relationship.

 3.2.3. Binary Encoding


o A hybrid approach. Categories are first converted to integers, then
those integers are converted into binary code, and finally, the binary
digits are used as features.
o Can be useful for high-cardinality categorical features to reduce
dimensionality compared to one-hot encoding.

Reference:

 Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.


(Chapter 3: Data Preprocessing)
3.3. Log Transformation

 Purpose: Used to transform skewed numerical data (e.g., highly positively


skewed distributions like income or population) into a more Gaussian-like
distribution.
 Formula: Xtransformed=log(X) (often log(X+1) to handle zero values).
 Use Case: Improves model performance for algorithms sensitive to data
distribution (e.g., linear regression).

3.4. Power Transformations (e.g., Box-Cox, Yeo-Johnson)

 Purpose: A family of transformations that can make data more Gaussian-


like. They automatically determine the best transformation parameter (λ).
 Box-Cox: Applies to positive data only.
 Yeo-Johnson: Can handle both positive and negative data.
 Use Case: When you need to normalize the distribution of a feature to
improve model performance, especially for linear models or models
assuming normality.

Reference:

 James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to
Statistical Learning with Applications in R. Springer. (Chapter 3: Linear
Regression)
4. Feature Construction (Feature Engineering)

Feature construction, often called Feature Engineering, is the art of creating new
features from existing ones to improve the performance of machine learning
models.

It's highly domain-specific and often requires creativity and understanding of the
problem.

4.1. Polynomial Features

 Concept: Creates new features by raising existing features to a power or by


multiplying features together to capture non-linear relationships.
 Example: For features X1,X2:
o Degree 2 polynomial features would include: X1,X2,X12,X22,X1X2
 Use Case: When the relationship between features and the target variable is
non-linear.

4.2. Interaction Features

 Concept: Similar to polynomial features, but specifically focuses on the


product of two or more existing features. This captures how features interact
with each other.
 Example:
o Age * Income (How does income vary with age?)
o Number_of_Rooms * Area_per_Room (Total area of a house)
 Use Case: When the effect of one feature on the target depends on the value
of another feature.

4.3. Aggregation Features

 Concept: Creating new features by aggregating information from multiple


related rows or groups.
 Examples:
o From Time Series Data: Moving averages, min/max/std over a
window, daily/weekly aggregates.
o From Grouped Data: Average spending per customer, total number
of transactions per user.
 Use Case: When individual data points might not provide enough context,
but collective or historical trends do.
4.4. Categorical /Binning

 Concept: Transforming continuous numerical features into categorical


features by dividing them into bins or intervals.
 Examples:
o Age: (0-18, 19-30, 31-50, 51+)
o Income: (Low, Medium, High)
 Methods: Equal width binning, equal frequency binning, k-means binning.
 Use Case:
o To handle outliers (outliers fall into the first or last bin).
o To make a model more robust to small variations in numerical
features.
o To handle non-linear relationships by making them linear within each
bin.
o To work with models that prefer categorical inputs (e.g., decision
trees often handle bins well).

4.5. Feature Extraction from Text Data (e.g., Bag-of-Words, TF-IDF)

 Concept: Converting unstructured text data into numerical features.


 Bag-of-Words (BoW): Represents text as a bag (multiset) of its words,
disregarding grammar and even word order, but keeping multiplicity.
Features are word counts.
 TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words
by their frequency in a document (TF) and inversely by their frequency
across all documents (IDF). This highlights words that are important in a
specific document but not common everywhere.
 Use Case: Natural Language Processing (NLP) tasks like text classification,
sentiment analysis.

Reference:

 Müller, A. C., & Guido, S. (2017). Introduction to Machine Learning with


Python: A Guide for Data Scientists. O'Reilly Media. (Chapter 4:
Representing Data and Engineering Features)
5. Feature Selection

Feature selection is the process of choosing a subset of the most relevant features
for use in model construction.

It aims to reduce dimensionality, improve model performance, and reduce


overfitting.

Why Feature Selection?

 Reduces Overfitting: Fewer irrelevant features mean the model is less


likely to learn noise.
 Improves Accuracy: By removing misleading or noisy features.
 Reduces Training Time: Fewer features mean faster model training.
 Simplifies Models: Easier to interpret and understand.
 Mitigates problem of Dimensionality: Especially important for high-
dimensional datasets.

5.1. Filter Methods

 Concept: Select features based on their intrinsic characteristics, without


involving a machine learning model. They rank features based on statistical
measures.
 Advantages: Computationally inexpensive, fast, and can be used as a pre-
processing step.
 Disadvantages: Ignores the interaction between features and the chosen ML
model.
 5.1.1. Variance Threshold
o Removes features with variance below a certain threshold. Features
with very low variance (or zero variance) provide little information.
o Use Case: Simple initial step to remove constant or near-constant
features.
 5.1.2. Correlation Coefficient (e.g., Pearson, Spearman)
o Pearson Correlation: Measures linear relationship between two
continuous variables.
o Spearman Correlation: Measures monotonic relationship (linear or
non-linear) between two ranked variables.
o Use Case: Remove highly correlated features. If two features are
highly correlated, one can often be dropped without significant loss of
information. Also, select features highly correlated with the target
variable.
 5.1.3. Chi-Squared Test (χ2 )
o Purpose: Measures the independence between a categorical feature
and a categorical target variable.
o Use Case: Selects features with the strongest relationship to the
target. Higher χ2 value indicates stronger dependence.
 5.1.4. ANOVA F-test (Analysis of Variance)
o Purpose: Measures the linear relationship between a numerical
feature and a categorical target variable.
o Use Case: Selects features with the largest F-statistic, indicating a
stronger difference in means across groups (categories).
 5.1.5. Mutual Information
o Purpose: Measures the dependency between two variables (can be
used for both numerical and categorical features with appropriate
estimation). It quantifies the amount of information obtained about
one random variable by observing the other.
o Use Case: Can capture non-linear relationships.

Reference:

 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
(Chapter 7: Model Assessment and Selection)

5.2. Wrapper Methods

 Concept: Use a specific machine learning model to evaluate subsets of


features. The feature selection process becomes a search problem.
 Advantages: Accounts for feature interactions, often leads to better model
performance.
 Disadvantages: Computationally expensive (trains a model for each feature
subset), can be prone to overfitting the feature selection process itself.
 5.2.1. Forward Selection
o Starts with no features. Adds the feature that gives the best
performance improvement at each step until no further improvement
is observed.
 5.2.2. Backward Elimination
o Starts with all features. Removes the feature that causes the least
performance degradation at each step until no further improvement is
observed.
 5.2.3. Recursive Feature Elimination (RFE)
o Concept: Recursively fits a model and removes the least important
feature (based on coefficients or feature importance scores) until the
desired number of features is reached.
o Use Case: Often used with models that provide feature importance
(e.g., linear models, tree-based models).

Reference:

 Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection.
Artificial Intelligence, 97(1-2), 273-324. (Classic paper on wrapper methods)

5.3. Embedded Methods

 Concept: Feature selection is built into the model training process itself.
The model learns which features are most important during training.
 Advantages: Less computationally expensive than wrapper methods, more
accurate than filter methods (as they consider feature interactions within the
model).
 5.3.1. L1 Regularization (Lasso Regression)
o Concept: Adds a penalty to the sum of the absolute values of the
coefficients. This penalty forces some feature coefficients to become
exactly zero, effectively performing feature selection.
o Use Case: For linear models (linear regression, logistic regression)
when sparsity (many zero coefficients) is desired.
 5.3.2. Tree-based Models (e.g., Decision Trees, Random Forests,
Gradient Boosting)
o Concept: These models inherently perform feature selection by
assigning importance scores to features based on how much they
reduce impurity (e.g., Gini impurity or entropy) during tree
construction. Features that contribute more to impurity reduction are
considered more important.
o Use Case: Provides a natural way to rank and select features.
 5.3.3. Support Vector Machines (SVMs) with L1 Regularization
o Similar to Lasso, SVMs can also incorporate L1 penalties to achieve
sparsity in feature weights.

Reference:

 Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer. (Chapter 3:
Linear Methods for Regression)
Conclusion

Understanding and effectively applying feature engineering techniques—covering


the different kinds of features, their transformation, construction, and selection—is
paramount for building robust and high-performing machine learning models. It's
often said that "feature engineering is the secret sauce" in applied machine
learning, as it allows domain knowledge to be incorporated into the data, leading to
superior model performance that raw data alone cannot provide. For exam
preparation, focus on not just the definitions but also the why and when to apply
each technique, along with illustrative examples.

You might also like