Basics of Feature Engineering
Dr. K. Viswavardhan Reddy
                     Asst Prof.
      Dept. of ETE, RVCE, Bengaluru - 560061
      Introduction to feature
• What is a feature?
     • A feature is an attribute of a data set that is used in a machine learning process.
     • A lot of research is happening in selection of features
• The features in a data set are also called its dimensions. So a data set having ‘n’ features is called an n-dimensional data set.
• Example: Iris dataset (5 dimensional dataset – species – class, others are predictor variables)
    Introduction to feature
• What is a feature Engineering?
   • Feature engineering refers to the process of translating a data set into features such that these features are
     able to represent the data set more effectively and result in a better learning performance.
   • Feature Engineering – very important step in pre-processing – 1 . Feature transformation and 2. feature
     subset selection
• Feature transformation:
   • Transforms the data whether structured or unstructured into new set of features that can represent the
     underlying problem which ML can try to solve.
   • There are two variants of feature transformation:
       • 1. feature construction
       • 2. feature extraction
       • Sometimes both techniques we call them as feature discovery
      Introduction to feature engineering
• What is a feature construction?
   • Feature construction process discovers missing information about the relationships between features and augments
     the feature space by creating additional features.
   • Hence, if there are ‘n’ features or dimensions in a data set, after feature construction ‘m’ more features or dimensions
     may get added. So at the end, the data set will become ‘n + m’ dimensional.
• Feature extraction is the process of extracting or creating a new set of features from the original set of features using some
  functional mapping.
• Feature subset selection: no new features will be added
    • The objective of feature selection is to derive subset of features from the full feature set which is most meaningful in
       the context of a specific machine learning problem.
                                 FEATURE TRANSFORMATION
• What is a feature construction?
   • Feature construction process discovers missing information about the relationships between features and augments
     the feature space by creating additional features.
   • Hence, if there are ‘n’ features or dimensions in a data set, after feature construction ‘m’ more features or dimensions
     may get added. So at the end, the data set will become ‘n + m’ dimensional.
• Feature extraction is the process of extracting or creating a new set of features from the original set of features using some
  functional mapping.
• Feature subset selection: no new features will be added
    • The objective of feature selection is to derive subset of features from the full feature set which is most meaningful in
       the context of a specific machine learning problem.
                                       FEATURE Construction
• Feature transformation is used as an effective tool for dimensionality reduction
• Goals:
   • Achieving best reconstruction of the original features in the data set
   • Achieving highest efficiency in the learning task
• Feature construction: It involves transforming a given set of input features to generate a new set of more
  powerful features.
    • Example: real estate data set having details of all apartments sold in a specific region.
    • Apartment length, breadth and price – transformed to 4 d (apartment area is included)
                               FEATURE TRANSFORMATION
• Encoding categorical (nominal) variables:
• Ex: athletes dataset
• data set has features age, city of origin,
parents athlete (i.e. indicate whether any one of the
parents was an athlete) and Chance of Win
• Chance of win – class variable and three are
Predictor variables.
• However, our ML algorithms deals with numerical.
• Age is numerical, city of origin, parents athletes,
and chance of win – categorical.
                                FEATURE TRANSFORMATION
• Encoding categorical (nominal) variables:
• In this case, feature construction can be used to create New dummy features which are usable by machine learning
  algorithms.
• However, examining closely, we see that the features ‘Parents athlete’ and ‘Chance of win’ in the original data set can
  have two values only.
• So creating two features from them is a kind of duplication, since the value of one feature can be decided from the value
  of the other.
• To avoid this duplication, we can just leave one feature and eliminate the other
                                 FEATURE TRANSFORMATION
• Encoding categorical (ordinal) variables:
• Ex: student data set (three variable – science marks, maths marks and grade)
• As we can see, the grade is an ordinal variable with values A, B, C, and D.
• To transform this variable to a numeric variable, we can create a feature num_grade mapping a numeric value against each
  ordinal value. In the context of the current example, grades A, B, C, and D
                                 FEATURE TRANSFORMATION
• Transforming numeric (continuous) features to categorical features
• Sometimes there is a need of transforming a continuous numerical variable into a categorical variable.
• Ex: Treat the real estate price prediction problem, which is a regression problem, as a real estate price category prediction,
  which is a classification problem.
• In that case, we can ‘bin’ the numerical data into multiple categories based on the data range.
• In the context of the real estate price prediction example, the original data set has a numerical feature apartment_price
                                FEATURE TRANSFORMATION
• Text-specific feature construction
• In the current world, text is arguably the most predominant medium of communication.
• However, making sense of text data, due to the inherent unstructured nature of the data, is not so straightforward.
• Even if we have text data chunks - don’t have readily available feature.
• All machine learning models need numerical data as input.
• So the text data in the data sets need to be transformed into numerical features.
• Vectorization is the process of converting words into numbers
• In this process, word occurrences in all documents belonging to the corpus are consolidated in the form of bag-of-words.
• There are three major steps that are followed:
     • 1. tokenize
     • 2. count
     • 3. normalize
                                FEATURE TRANSFORMATION
• Text-specific feature construction
• Then the number of occurrences of each token is counted, for each document.
• Lastly, tokens are weighted with reducing importance when they occur in the majority of the documents.
• A matrix is then formed with each token representing a column and a specific document of the corpus representing each
  row.
                                   Feature extraction
• In feature extraction, new features are created from a combination of original features.
• Some of the commonly used operators for combining the original features include:
    • 1. For Boolean features: Conjunctions, Disjunctions, Negation, etc.
    • 2. For nominal features: Cartesian product, M of N, etc.
    • 3. For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division, Average, Equivalence,
      Inequality, etc.
                                        Feature extraction
• Various Techniques:
     • Principal Component Analysis
• In PCA, a new set of features are extracted from the original features which are quite dissimilar in nature.
• So an ‘n’ dimensional feature space gets transformed to an ‘m’ dimensional feature space, where the dimensions are
  orthogonal to each other, i.e. completely independent of each other.
• The objective of PCA is to make the transformation in such a way that
• 1. The new features are distinct, i.e. the covariance between the new features, i.e. the principal components is 0.
• 2. The principal components are generated in order of the variability in the data that it captures. Hence, the first principal
  component should capture the maximum variability, the second principal component should capture the next highest
  variability etc.
• 3. The sum of variance of the new features or the principal component should be equal to the sum of variance of the
  original features.
                                  Feature extraction - PCA
• PCA works based on a process called eigenvalue decomposition of a covariance matrix of a data set.
• Below are the steps to be followed:
• 1. First, calculate the covariance matrix of a data set.
• 2. Then, calculate the eigenvalues of the covariance matrix.
• 3. The eigenvector having highest eigenvalue represents the direction in which there is the highest variance. So this will
  help in identifying the first principal component.
• 4. The eigenvector having the next highest eigenvalue represents the direction in which data has the highest remaining
  variance and also orthogonal to the first direction. So this helps in identifying the second principal component.
• 5. Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues so as to get the ‘k’ principal components.
• Apart from PCA, we have SCD and LDA.
                                 Feature extraction - LDA
• The objective of LDA is similar to the sense that it intends to transform a data set into a lower dimensional feature space.
• However, unlike PCA, the focus of LDA is not to capture the data set variability.
• Instead, LDA focuses on class separability, i.e. separating the features based on class separability so as to avoid over-fitting
  of the machine learning model.
• Unlike PCA that calculates eigenvalues of the covariance matrix of the data set, LDA calculates eigenvalues and
  eigenvectors within a class and inter-class scatter matrices.
FEATURE SUBSET SELECTION
                                               FEATURE SUBSET SELECTION
• Issues with higher dimensional data:
    • With the rapid innovations in the digital space, the volume of data generated has increased to an unbelievable extent.
    • With breakthroughs in the storage technology area - storage of large quantity of data quite cheap.
    • Motivated towards the storage and mining of very large and high-dimensionality data sets.
• Two application domains :
    • biomedical research
    • text categorization
• very high quantity of computational resources and high amount of time will be required
• performance of the model – both for supervised and unsupervised machine learning task, also
  degrades sharply due to unnecessary noise in the data.
• Also, a model built on an extremely high number of features may be very difficult to understand.
• The objective of feature selection is three-fold:
    •   Having faster and more cost-effective (i.e. less need for computational resources) learning model
    •   Improving the efficiency of the learning model
    •   Having a better understanding of the underlying model that generated the data
                                              FEATURE SUBSET SELECTION
• Key drivers of feature selection – feature relevance and redundancy
    • Feature relevance
         • In Supervised learning: for each input dataset, a class label is attached.
         • When a model is inducted it assigns a class label to new, un-labelled data.
         • The predictor variables or the features is expected to contribute information to decide the class.
         • In case if it is not contributing it is said to be irrelevant.
         • If contributing partially, then we say weak relevant.
         • Remaining variables which contribute to class information – strongly relevant variables.
         • In UnSupervised learning: no class data or training data. Just grouping similar items
         • Certain variables do not contribute any useful information for deciding the similarity of dissimilarity. Hence, those variables make no significant
           information contribution in the grouping process. Are irrelevant
         • Student dataset – roll number doesn’t contribute
                                           FEATURE SUBSET SELECTION
• Key drivers of feature selection – feature relevance and redundancy
    • Feature redundancy
         • With increase in Age, Weight is expected to increase.
         • Similarly, with the increase of Height also Weight is expected to increase.
         • Also, Age and Height increase with each other.
         • So, in context of the Weight prediction problem, Age and Height contribute similar information.
         • So, do we need age? (redundant)
• Now, the question is how to find out which of the features are irrelevant or which features have potential redundancy?.
          FEATURE SUBSET SELECTION - Measures of feature relevance and redundancy
• Measures of feature relevance
• For supervised learning - mutual information is considered as a good measure of information.
• Higher the value of mutual information of a feature, more relevant is that feature.
• For unsupervised learning - the entropy of the set of features
• without one feature at a time is calculated for all the features.
• Then, the features are ranked in a descending order of
information gain from a feature and top ‘β’ percentage (value
of ‘β’ is a design parameter of the algorithm) of features are
selected as relevant features.
     only for features that take discrete values
                                       FEATURE SUBSET SELECTION
• Measures of Feature redundancy: similar information contribution by multiple features.
• 1. Correlation-based measures
• 2. Distance-based measures, and
• 3. Other coefficient-based measure
• Correlation-based similarity measure: Correlation is a measure of linear dependency between two random variables.
• For two random feature variables F1 and F2 , Pearson correlation coefficient is defined as:
   Correlation: +1 to -1
   1 – perfect linear relationship
   0 – no linear relationship
                                              FEATURE SUBSET SELECTION
• Distance-based similarity measure: Euclidean distance.       Hamming distance
                                                             For example, the Hamming distance between two
                                                             vectors 01101011 and 11001001 is 3.
                     Distance calculation between features
Minkowski distance
                                   FEATURE SUBSET SELECTION
• Other similarity measures
• Jaccard index/coefficient is used as a measure of similarity between two features.
• The Jaccard distance, a measure of dissimilarity between two features, is complementary of Jaccard index.
• For two features having binary values, Jaccard index is measured as
                                 FEATURE SUBSET SELECTION
• Other similarity measures                Cosine Similarity
• Simple matching coefficient (SMC)
                                     FEATURE SUBSET SELECTION
• Cosine similarity actually measures the angle (refer to Fig.) between x and y vectors.
• Hence, if cosine similarity has a value 1, the angle between x and y is 0° which means x and y are same except for the
  magnitude.
• If cosine similarity is 0, the angle between x and y is 90°. Hence, they do not share any similarity (in case of text data, no
  term/word is common).
• In the above example, the angle comes to be 43.2°.
                                              FEATURE SUBSET SELECTION
• Overall feature selection process:                                                       Subset generation
     • Feature selection is the process of selecting a subset of features in a data set.   • for an ‘n’ dimensional data set, 2^n subsets
• 1. generation of possible subsets                                                          can be generated.
                                                                                           • So, as the value of ‘n’ becomes high, finding
• 2. subset evaluation
                                                                                             an optimal subset from all the 2^n candidate
• 3. stop searching based on some stopping criterion                                         subsets becomes intractable.
• 4. validation of the result                                                              • sequential forward selection: empty set –
                                                                                             keep adding
                                                                                           • Sequential backward elimination: a full set
                                                                                             and successively remove features
                                                                                           • Stopping Criterion:
                                                                                           1. the search completes
                                                                                           2. some given bound (e.g. a specified number of
                                                                                           iterations) is reached
                                                                                           3. subsequent addition (or deletion) of the
                                                                                           feature is not producing a better subset
                                                                                           4. a sufficiently good subset (e.g. a subset having
                                                                                           better classification accuracy than the existing
                                                                                           benchmark) is selected
                                   Feature selection approaches
• There are four types of approach for feature selection:
                                                              • In the wrapper approach identification of best feature
• 1. Filter approach
                                                                subset is done using the induction algorithm as a black
• 2. Wrapper approach                                           box.
• 3. Hybrid approach = filter (statistical) + wrapper (algorithm)
                                                                • The feature selection algorithm searches for a good
• 4. Embedded approach                                            feature subset
• Filter approach – based on statistical measures             • Since for every candidate subset, the learning model is
                                                                trained and the result is evaluated by running the learning
                                                                algorithm.
                                                              • Wrapper approach is computationally very expensive.
• Wrapper approach                                            • However, the performance is generally superior compared
                                                                to filter approach.
                                   Feature selection approaches
• There are four types of approach for feature selection:
• 4. Embedded approach
 Embedded approach is quite similar to wrapper approach.
 Uses inductive algorithm to evaluate the generated feature subsets.
 However, the difference is it performs feature selection and classification simultaneously.
                                                                                                             Go, change the world
    RV College of
    Engineering
                              ACTIVE LEARNING
1. Find the Hamming distance between 10001011 a n d 1 1 001 1 1 1 .
2 . Compare the Jaccard index and similarity matching coefficient of two features having values (1 , 1 , 0, 0, 1 , 0,
1 , 1 ) and (1 , 0, 0, 1 , 1 , 0, 0, 1 ).
3. Jaccard index & SMC
Claim A = (R,R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)
Claim B = (R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)
3. Two rows in a documen t-term matrix have values - (2 , 3 , 2 , 0, 2 , 3 , 3 , 0, 1 ) a n d (2 , 1 , 0, 0, 3 , 2 , 1 , 3 , 1
). Find the cosine similarity .
4. Find the cosine similarity
5.
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity.
The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
  6/13/2023                                                                                                                  30
                                        Go, change the world
  RV College of
  Engineering
                      ACTIVE LEARNING
Euclidean & manhattan distance
6/13/2023                                           31
                                    Go, change the world
  RV College of
  Engineering
                  ACTIVE LEARNING
6/13/2023                                       32
                                                          Go, change the world
  RV College of
  Engineering     4.3 Overall feature selection process
6/13/2023                                                             33