KEMBAR78
@vtudeveloper - in ML Mod 3 | PDF
0% found this document useful (0 votes)
17 views32 pages

@vtudeveloper - in ML Mod 3

Chapter 4 discusses Similarity-based Learning, a supervised learning technique that predicts class labels by assessing the similarity between test instances and training data. It covers instance-based learning methods, particularly k-Nearest Neighbors (k-NN) and its variants, emphasizing their applications in various fields and the importance of distance metrics. The chapter also highlights the differences between instance-based and model-based learning, along with practical examples and algorithms for classification tasks.

Uploaded by

ishaaa2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
17 views32 pages

@vtudeveloper - in ML Mod 3

Chapter 4 discusses Similarity-based Learning, a supervised learning technique that predicts class labels by assessing the similarity between test instances and training data. It covers instance-based learning methods, particularly k-Nearest Neighbors (k-NN) and its variants, emphasizing their applications in various fields and the importance of distance metrics. The chapter also highlights the differences between instance-based and model-based learning, along with practical examples and algorithms for classification tasks.

Uploaded by

ishaaa2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 32
Chapter 4 Similarity-based Learning “Anyone who stops learning is old, whether at twenty or eighty.” — Henry Ford Similarity-based Learning is a supervised learning technique that predicts the class label of atest instance by gauging the similarity of this test instance with training instances. Similarity-based learning refers to a family of instance-based learning which is used to solve both classification and regression problems. Instance-based learning makes prediction by computing distances or similarities between test instance and specific set of training instances local to the test instance in an incremental process. In contrast to other learning mechanisms, it considers only the nearest instance or instances to predict the class of unseen instances. This learning method- ology improves the performance of classification since it uses only a specific set of instances as incremental learning task. Similarity-based Classification is useful in various fields such as image processing, text classification, pattem recognition, bio informatics, data mining, infor- mation retrieval, natural language processing, etc. A practical application of this learning is predicting daily stock index price changes. This chapter provides an insight of how different similarity-based models predict the lass of a new instance. Learning Objectives * Understand the fundamentals of Instance based learning * Know about the concepts of Nearest-Neighbor Learning using the algorithm called k-Nearest-Neighbors (k-NN) Learn about Weighted k-Nearest-Neighbor classifier that chooses the neighbors by using the weighted distance * Gain knowledge about Nearest Centroid classifier, a simple alternative to NN dlassifiers + Understand Locally Weighted Regression (LWR) that approximates the linear functions of all k neighbors to minimize the error while prediction Z) 116 + Machine Learning 4.1 INTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test instance which works in contrast with other learning mechanisms such as decision trees or neural networks. Similarity-based learning is also called as Instance-based learning/Just-in time learning, since it does not build an abstract model of the training instances and performs lazy learning when classifying a new instance. This learning mechanism simply stores all data and uses it only when it needs to classify an unseen instance. The advantage of using this learning is that processing occurs only when a request to dassify a new instance is given. This methodology is particularly useful when the whole dataset is not available in the beginning but collected in an incremental manner. The drawback of this learning is that it requires a large memory to store the data since a global abstract model is not constructed initially with the training data. Classification of instances is done based on the measure of similarity in the form of distance functions over data instances, Several distance metrics are used to estimate the similarity or dissimilarity between instances required for clustering, nearest neighbor classification, anomaly detection, and so on. Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski distance, Cosine similarity, Mahalanobis distance, Pearson's correlation or correlation similarity, Mean squared difference, Jaccard coefficient, Tanimoto coefficient, etc. Generally, Similarity-based classification problems formulate the features of test instance and training instances in Euclidean space to learn the similarity or dissimilarity between instances. 4.1.1 Differences Between Instance- and Model-based Learning An instance is an entity or an example in the training dataset. It is described by a set of features or attributes. One attribute describes the class label or category of an instance, Instance-based methods learn or predict the class label of a test instance only when a new instance is given for classification and until then it delays the processing of the training dataset. Itis also referred to’as lazy learning methods since it does not generalize any model from the taining dataset but just keeps the training dataset as a knowledge base until a new instance is given. In contrast, model-based learning, generally referred to as eager learning, tries to generalize the training data to a model before receiving test instances. Model-based machine learning describes all assumptions about the problem domain in the form of a model. These algorithms basically learn in two phases, called training phase and testing phase. In training phase, a model is built from the training dataset and is used to classify a test instance during the testing phase. Some examples of models constructed are decision trees, neural networks and Support Vector Machines (SVM), etc. The differences between Instance-based Learning and Model-based Learning are listed in Table 4.1. Table 4.1: Differences between Instance-based Learning and Model-based Learning (fees ee (EEE een) Lazy Learners, Eager Learners ‘Processing of training instances is done only during | Processing of training instances is done during testing phase training phase (Continued) ‘No model is built with the training instances before it receives a test instance Similarity-based Learning © 117 Rirekne ein) Generalizes a mode! with the training instances before it receives a test instance Predicts the class of the test instance directly from Predicts the class of the test instance from the model the training data Slow in testing phase Learns by making many local approximations built Fast in testing phase Leams by creating global approximation Instance-based learning also comes under the category of memory-based models which normally compare the given test instance with the trained instances that are stored in memory. Memory- based models classify a test instance by checking the similarity with the training instances. Some examples of Instance-based learning algorithms are: 1. k-Nearest Neighbor (k-NN) Variants of Nearest Neighbor learning Locally Weighted Regression ‘Learning Vector Quantization (LVQ) Self-Organizing Map (SOM) Radial Basis Function (RBF) networks In this chapter, we will discuss about certain instance-based learning algorithms such as k-Nearest Neighbor (NN), Variants of Nearest Neighbor leaming, and Locally Weighted Regression learning. Self-Organizing Map (SOM) and Radial Basis Function (RBF) networks are discussed along with the concepts of artificial neural networks discussed in Chapter 10 since they could be referred only after the understanding of neural networks. These instance-based methods have serious limitations about the range of feature values taken. Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of instances. ape 4.2 NEAREST-NEIGHBOR LEARNING A natural approach to similarity-based lassification is k-Nearest-Neighbors (k-NN), which is a non-parametric method used for both classification and regression problems. It is a simple and powerful non-parametric algorithm that predicts the category of the test instance according to the ‘k’ training samples which are closer to the test instance and classifies it to that category which has the largest probability. A visual representation of this learning is shown in Figure 4.1. There are two classes of objects called C, and C, in the given figure. When given a test instance T, the category of this test instance is determined by looking at the class of k= 3 nearest neighbors. Thus, the class of this test instance Tis predicted as C,. Figure 4.1: Visual Representation of k-Nearest Neighbor Learning 118 + Machine Learning The algorithm relies on the assumption that similar objects are close to each other in the feature space. k-NN performs instance-based learning which just stores the training data instances and learning instances case by case. The model is also ‘memory-based’ as it uses training data at time when predictions need to be made. It is a lazy learning algorithm since no prediction model is built earlier with training instances and classification happens only after getting the test instance. The algorithm classifies a new instance by determining the ‘K’ most similar instances ( k nearest neighbors) and summarizing the output of those ‘K’ instances. If the target variable is discrete then it is a classification problem, so it selects the most common class value among the ‘k’ instances by a majority vote. However, if the target variable is continuous then it is a regression problem, and hence the mean output variable of the ‘K’ instances is the output of the test instance. The most popular distance measure such as Euclidean distance is used in -NN to determine the ‘¥’ instances which are similar to the test instance. The value of ‘K’ is best determined by tuning, with different ‘K’ values and choosing the ‘K’ which classifies the test instance more accurately. PAu Eat Inputs: Training dataset T, distance metric d, Test instance f, the number of nearest neighbors k Output: Predicted class or category Prediction: For test instance f, 1. For each instance i in T, compute the distance between the test instance t and every other instance / in the training dataset using a distance metric (Euclidean distance). [Continuous attributes - Euclidean distance between two points in the plane with coordinates (x, y,) and (xy y,) is given as dist ((xy ¥,), ty ¥))= Vt —%,) +. -¥,)' 1 [Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the distance d will be equal to 0 otherwise d = 1.] 2. Sort the distances in an ascending order and select the first k nearest training data instances to the test instance. 3. Predict the class of the test instance by majority voting (if target attribute is discrete valued) or mean (if target attribute is continuous valued) of the k selected nearest instances. ~~ Consider the student performance training dataset of 8 data instances shown in Table 4.2 which describes the performance of individual students in a course and their CGPA obtained in the previous semesters. The independent attributes are CGPA, Assessment and Project. The target variable is ‘Result’ which is a discrete valued variable that takes two values ‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a student will pass or fail in that course. Wess Project Submitted) (eos 85 8 Pass 80 7 Pass 81 8 Pass Similarity-based Learning © 119 Can Se Geese fete 4. 6 45 5 Fail 5. 65 50 4 Fail 6 82 72 7 Pass 7. 58 38 5 Fail 8. 89 a1 9 Pass Solution: Given a test instance (6.1, 40, 5) and a set of categories (Pass, Fail} also called as classes, we need to use the training set to classify the test instance using Euclidean distance. The task of classification is to assign a category or class to an arbitrary instance. Assignk=3. Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the training instances as shown in Table 4.3. Table 4.3: Euclidean Distance ry Caen fete Gereneaans tin * * 8 PP] Jezmeay + (a0) +(6-5p = 45.2068 * . ° 7 \P8s Ye —eay +(e0—a0) (7-5) = 40.09501 * “8 X BPP | ese) + (er—a0y +63) =41.17961 * 6 6 . fail (6—6.1)' + (45-40) +(5-5)- =5.001 S 8 ° ‘ i (65-6.1)' +(50-40) + (4-5) = 1005783 . 82 2 7 Poss (s2-6:1) + (72-40) +(7-s), 3213114 7 58 8 5 vail (58-6. + (38-40) +(5-5)" = 2.022375 * 8° ” ° Tas (89-61) +(91-40) +(9-5) =51.23319 120» Machine Learning Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test instance. The selected nearest neighbors are shown in Table 4.4. Table 4.4: Nearest Neighbors (iets eben laeiies ES 4 5.001 Fail 5 10.05783 Fail 7 2.022375 Fail Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances. Step 3: Predict the class of the test instance by majority voting. The class for the test instance is predicted as ‘Fail’. ee “*” Data normalization/standardization is required when data (features) have different ranges or a wider range of possible values when computing distances and to transform all features to a specific range. This is probably done to eliminate the influence of one feature over another (i., to give all features equal chances). For example if one feature has values in’the range 6f [0-1] and another feature has values in the range of {0-100], then the second feature will influence more even if there isa small variation than the first feature. J-NN classifier performance is strictly affected by three factors such as the number of nearest neighbors (ie., selection of k), distance metric and decision rule. If the k value selected is small then it may result in overfitting or less stable and if itis big then it may include many irrelevant points from other dlasses. The choice of the distance metric selected also plays a major role and it depends on the type of the independent attributes in the training dataset. The K-NN classification algorithm best suits lower dimensional dala asin a high-dimensional space the nearest neighbors may not be very close at all. 4.3 WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM The Weighted K-NN is an extension of k-NN. It chooses the neighbors by using the weighted distance. The k-Nearest Neighbor (k-NN) algorithm has some serious limitations as its performance is solely dependent on choosing the k nearest neighbors, the distance metric used and the decision rule, However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance are assigned a higher weight in the decision as compared to neighbors that are farther away from the test instance. The idea is that weights are inversely proportional to distances. The selected k nearest neighbors can be assigned uniform weights, which means all the instances in each neighborhood are weighted equally or weights can be assigned by the inverse of their distance. In the second case, closer neighbors of a query point will have a greater influence than neighbors which are further away. TAs era) Inputs: Training dataset “T’, Distance metric ‘ number of nearest neighbors ‘K’ Weighting function zo(i), Test instance ‘t’, the (Continued) Similarity-based Learning + 121 Output: Predicted class or category Prediction: For test instance t, 1. Foreach instance ‘i’ in Training dataset T, compute the distance between the test instance t and every other instance ‘i’ using a distance metric (Euclidean distance). [Continuous attributes - Euclidean distance between two points in the plane with coordinates (x, y,) and (x; yp) is given as dist ((x, %4), Cy» ¥s)) = Ye, —2Y +. HP I [Categorical attributes Binary) - Hamming Distance: If the values of two instances are the same, the distance d will be equal to 0. Otherwise d= 1] 2. Sort the distances in the ascending order and select the first ‘k’ nearest training data instances to the test instance. 3. Predict the class of the test instance by weighted voting technique (Weighting function w(i) for the k selected nearest instances: + Compute the inverse of each distance of the ‘K’ selected nearest instances. * Find the sum of the inverses. * Compute the weight by dividing each inverse distance by the sum. (Each weight is a vote for its associated class). «Add the weights of the same class. * Predict the class by choosing the class with the maximum vole. oS Consider the same training dataset given in Table 4.1. Use Weighted K-NN and determine the class. Solution: Step 1: Given a test instance (7.6, 60, 8) and a set of classes (Pass, Fail], use the training dataset to classify the test instance using Euclidean distance and weighting function. Assign k — 3. The distance calculation is shown in Table 4.5. Table 4.5: Euclidean Distance (cp eens Pemeaenss " °? % 5 Pass (92-76) + (85-60) +(8-8)" = 2505115 * . ° ; Pass (8-7.6)' + (80-60)' +(7-8), =20.02898 * “° * . Fas (85-76) +(s1-60) +(s-8) =21.01928 (Continued) 122» Machine Learning Teac coon Me eI aos med ‘ . * . fi (6-7.6)' +(45-60)' +(5-8) =1538051 5 ° = ‘ = (65-76) +(50-60) +(4-8)° = 10.82636 . "2 2 7 ined (.2-7.6)' + (72-60) +(7-8)" =12.05653, s # = 7 = (5:8-7.6)' + (38-60) +(5-8)" =22.27644 8 89 a 9 Pass. (eo-76Feo-eF +0) =31,04336. Step 2: Sort the distances in the ascending order and select the first 3nearest training data instances to the test instance. The selected nearest neighbors are shown in Table 4.6. Step nearest instances. Table 4.6: Nearest Neighbors (eens Piereubncues Cass 1538051 Fail 1082636 Fail 12.05653 Pass Predict the class of the test instance by weighted voting technique from the 3 selected * Compute the inverse of each distance of the 3 selected nearest instances as shown in Table 4.7. Table 4,7: Inverse Distance (Sen Sion en Mere eau CES 4 1538051 0.06502 Fail 5 10.82636 0.092370 Fail 6 1.05653 0.08294 Pass + Find the sum of the inverses. ‘Sum = 0.06502 + 0.092370 +-0.08294 = 0.24033 + Compute the weight by dividing each inverse distance by the sum as shown in Table 4.8. Similarity-based Learning © 123 Table 4.8: Weight Calculation een ietee ens Wwerse Distance Weight =Inverse (5 nest 4 1538051 0.06502 0.270545 Fail 5 1082636 0.092370 0.384347 Fail 6 1205653 0.08294 0.345109 Pass * Add the weights of the same class. Fail = 0.270545 + 0.384347 = 0.654892 Pass = 0.345109 * Predict the class by choosing the class with the maximum vote. The class is predicted as ‘Fail’. —l“"* 4.4 NEAREST CENTROID CLASSIFIER A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid Classifier. It is a simple classifier and also called as Mean Difference classifier. The idea of this classifier is to classify a test instance to the class whose centroid/mean is closest to that instance. Poi eM eager OCR ESTs Inputs: Training dataset T, Distance metric d, Test instance t Output: Predicted class or category 1. Compute the mean/centroid of each class. 2. Compute the distance between the test instance and mean/centroid of each class (Guclidean Distance). 3. Predict the class by choosing the class with the smaller distance. es [setae eH Consider the sample data shown in Table 4.9 with two features x and y. The target classes are ‘A’ or ‘B’, Predict the class using Nearest Centroid Classifier. Table 4.9: Sample Data (er wlalafafale alsfafeln|= sfelal>|>) > Solution: Step 1: Compute the mean/centroid of each class. In this example there are two classes called ‘A’ and ‘BY. 124 » Machine Learning Centroid of class‘A’ = (3 +5 +4,1+2+3)3 = (12, 6)/3= (4,2) Centroid of class ‘B’ = (7 +6 +8, 6 +7 +5)/3 = (21, 18)/3=(7, 6) Now given a test instance (6, 5), we can predict the class. Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid. Euc_Dist{(6, 5) (4, 2] = y(6—4) +(8-2) = V3 -3.6 Fuc_Distl(6, 5); (7, 6)]= ¥(6—7) +(5-6) = V2 =1.414 The test instance has smaller distance to class B. Hence, the class of this test instance is predicted as ‘B'. —__SSS—C—COsCsSC—SF?F 4.5 LOCALLY WEIGHTED REGRESSION (LWR) Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs local regression by combining regression model with nearest neighbor's model. LWR is also referred to as a memory-based method as it requires training data while prediction but uses only the training data instances locally around the point of interest. Using nearest neighbors algorithm, we find the instances that are closest to a test instance and fit linear function to each of those ‘’ nearest instances in the local regression model. The key idea is that we need to approx- imate the linear functions of all ‘k’ neighbors that minimize the error such that the prediction line is no more linear but rather it is a curve. (Ordinary linear regression finds out a linear relationship between the input x and the output y. Given training dataset T, Hypothesis function h,(x), the predicted target output is a linear function where fi, is the intercept and f, is the coefficient of x. It is given in Eq, (4.1) as, My) = By + BX (4.1) The cost function is such that it minimizes the error difference between the predicted value (2) and true value ‘y and itis given as in Fq, (4.2). 1 10-33 (ts(s)-w) (42) where ‘m’ is the number of instances in the training dataset. Now the cost function is modified for locally weighted linear regression including the weights only for the nearest neighbor points. Hence, the cost function is given as in Eq. (4.3). 1< 2 1A= 50. (b()-x) @3) where w, is the weight associated with each x, The weight function used is a Gaussian kernel that gives a higher value for instances that are close to the test instance, and for instances far away, it tends to zero but never equals to zero. w,is computed in Eq. (4.4) as, sett e (4.4) Similarity-based Learning © 125 where, t is called the bandwidth parameter and controls the rate at which 1, reduces to zero with distance from x, o_O Consider a simple example with four instances shown in Table 4.10 and apply locally weighted regression. Table 4.11 ample Table ar Salary (in\lakhs) Expenditure (in thousands) 1 3 25 2. 1 5 3. 2 7 4. 1 8 Solution: Using linear regression model assuming we have computed the parameters: Bj= 4.72, B= 0.62 Given a test instance with x —2, the predicted y’ is: ¥ =A, +B, x= 4.72 + 0.62 x2=5.96 Applying the nearest neighbor model, we choose k=3 closest instances. Table 4.11 shows the Euclidean distance calculation for the training instances. Table 4.11: Euclidean Distance Calculation Pees ea | ee Instances 2, 3 and 4 are closer with smaller distances. The mean value = (5 +7 +8)/3 =20/3 = 6.67. Using Eq. (4.4) compute the weights for the closest instances, using the Gaussian kernel, Atel w,=e Hence the weights of the closest instances is computed as follows, Weight of Instance 2 is: tet ty we =e <6 _ 9053 Weight of Instance 3 is: sesh eat we =emt ae =1 [w, is closer hence gets a higher weight value] 126 » Machine Learning ‘Weight of Instance 4 is: ot wire =e8 = 0.043 ‘The predicted output for the three closer instances is given as follows: The predicted output of Instance 2 is Yo = Hg() = By + B, 2p = 4.72 + 0.62 x 1= The predicted output of Instance 3 is: Uy = Hix.) = By B X= 4.72 + 0.62 x 2=5.96 The predicted output of Instance 4 is: Ys = Hy) = By B, ¥,— 4.72 + 0.62 x 1=5.34 The error value is calculated as: 134 Ie) 3 S10, (%,) —y,) = ; (0.043(5.34 — 5)? + 1(6.96 — 798+ 0.043(6.34 — 8)*) = 0.6953 Now, we need to adjust this cost function to minimize the error difference and get optimal B parameters. OO SSSSS—CS——F?F 1. Similarity-based learning is a supervised learning technique that predicts the class label of a test instance by measuring the similarity of this test instance with training instances. 2, Similarity-based classification problems formulate the features of test instance and training instances in Euclidean space to lear the similarity or dissimilarity between instances. 3. A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a non-parametric method used for both classification and regression problems. 4. KNN predicts the category of the test instance according to the ‘K’ training samples which are closer to the test instance and classifies it to that eategory which has the largest category probability. 5, Data normalization/standardization is required when data (features) have different ranges or wider ‘range of possible values when computing distances and to transform all features to a specific range. 6. NN best suits for lower dimensional data, as in a high-dimensional space the nearest neighbors may not be very close at all. 7. Weighted k-NN assigns a higher weight to the k closest neighbors to the test instance in the decision than neighbors that are farther away from the test instance. 8. Nearest Centroid Classifier isa simple classifier also called as Mean Difference classifier that classifies 4 test instance to the class whose centroid/mean is closest to that instance. 9, Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs Jocal regression by combining regression model with nearest neighbor's model. * Instance - An entity or an example in the training dataset. + Instance-based Methods —Leam or predict the class label ofa test instance only when anew instance is given for classification. Similarity-based Learning © 127 ‘Model-based Machine Learning - Describes all assumptions about the problem domain in the form ofa model. Lary Leaning - Methods do not generalize any model from the training dataset but just keep the training dataset as a knowledge base until a new instance is given. Fager Learning - Methods generalize the training data to a model before receiving test instances. ‘Memory-based Models — Classify a test instance by the similarity with the training instances stored in the memory. Revii COTES 1. What do you understand by similarity-based learning? 2. Compare and contrast between instance-based learning and model-based learning. 3. Why instance based leamers are called as lazy learners? 4. Differentiate between lazy learning and eager leaning. 5. Why INN method is called as memory-based method? 6. Why data normalization/standardization is required in NN? 7. What are the benefits and limitations of k-NN algorithm? 8. Consider the following training dataset of 10 data instances shown in Table 4.12 which describes the award performance of individual students based on GPA and No. of projects done. The target variable is‘Award’ which is a discrete valued variable that takes 2 values ‘Yes’ or ‘No’. Table 4.12: Training Dataset Gn an 1. 95 ‘Yes x 80 ‘Yes 3. 72 No 4 65 Yes 5. 95 ‘Yes 6. 32 No z 66 No 8 54 No 9. 89 ‘Yes 10, 72 4 ‘Yes Given a test instance (GPA - 7.8, No. of projects done - 4), use the training set to classify the test instance. Choose k=3. + k-Nearest Neighbor classifier * Weighted k-Nearest Neighbor classifier ‘+ Nearest Centroid Classifier 9, A COVID care centre decide to develop a case-based reasoning system to predict whether a person ‘will test positive or negative based on the symptoms. ‘The table below shows the number of possible symptoms and the results of the previous cases. The training dataset contains the following instances as shown in the Table 4.13 below. 128 + Machine Learning ‘Table 4.13: Sample Set of Instances Enea) Ga cues ale es fetes Sees Gato] uss) fuer Gia ean iene) or ial Yes 2 |¥es [No [Yes [No [No Yes No [No No 3 [No |No [No [No [No No No |No No 4, [ves [Yes [Yes [No [No No No__|No Yes 5_|¥es [Yes |¥es_ [Yes | No No Yes _| Yes Yes 6._[ves [ves [Yes [Yes [No Yes No__|No No 7_|yes_|¥es [yes [Yes [No No Yes_| Yes No Yes_|ves [Yes [Yes [No No No [No No 9. [Yes [Yes [Yes [Yes [No No No [Ne No 0. [No [No [No [No [No No ‘No. | No. No ‘+ Determine k = number of nearest neighbors to get a better prediction result. ‘+ Increase ‘K’ value and check the prediction. Is it good or bad to have a smaller or larger ‘k’ value? ‘+ Apply proper similarity measure [Asymmetric binary features] and predict the test result of the instance [Fever = Yes, Dry Cough = Yes, Tiredness = yes, Sore Throat = Yes, Diarrhea = No, Headache = No, Loss of Taste ar Smell = No, Shoriness of Breath = No, Chest Pain = No]. 10. What is meant by locally weighted regression? Crossword 129 Similarity-based Learning « Down Across also known as an 1. The entity or an example in the dataset is No) 2. KNN is a parametric based method. (Yes/ 3, Instance-based learning are memory-based 4. Nearest centroid classifier is also known as difference classifier. 5. Weighted KNN algorithm assigns a higher methods. (Yes/No) 5. Locally weighted regression is a non- ‘weight for the closest neighbors. (Yes/No) 6. Euclidean distance is the most popular parametric method. (Yes/No) 6. Majority voting is used to determine the method for finding neighbors. (Yes/No) lass among neighbors in KNN algorithm. (Yes/No) 7. KNN is an example of based Tearing. 8 Model-based methods are also known as earning. MOE aT Find and mark the words listed below. VGFCPVRPGOWUTI I S VUFL FKENOUTONUHO VOunnxX Tawa Pex VOuMna Dt an¥zZou KRONRD AHH MOnnmODZ wy VOCE>uéoounUas VawontMwoo NEIMPARMHE UMxXNAWUP Us >PUMHAN eOndn qn a Hee moo >PUROmM DAKE MMOXKh ArH PRENOWNM a MPR HaaxK aw UturnE RAVHH aw PReOMEMONnE PZOMVMmHRH MomoN & oe ot obne <0 Hm HZnox Ham Nne UmEe zoey Sz0moaa mm mn | HE mh eZ DOUuUxunE> OZBMEZE HOeOuKeon Z>eMnod Env oaz = Orunoze Om<>ZH4 ZOno<30 eX Oh DK ZDZnNEr0 BAanoas HORZVOO BAm>vAzoO STD TO 4 wm~VOTDY Yes No Yes ‘Yes ‘Yes Mean Instance Instance Eager Yes Chapter, 5 Regression Analysis “Regression analysis is the hydrogen bomb of the statistics arsenal.” ~ Charles Wheelan, Naked Statistics: Stripping the Dread from the Data Regression analysis is a supervised learning method for predicting continuous variables. The difference between classification and regression analy sisis that regression methodsare used to predict qualitative variables or continuous numbers unlike categorical variables or labels. It is used to predict linear or non-linear relationships among variables of the given dataset. This chapter deals with an introduction of regression and its various types. * Understand the basics of regression analysis * Introduce concepts of correlation and causation + Learn about linear regression and its validation techniques + Discuss about multiple linear regression * Introduce logistic regression + Study about the concept of regularization «Study popular regression methods like Ridge, Lasso, and Elastic Net 5.1 INTRODUCTION TO REGRESSION Regression analysis is the premier method of supervised learning. This is one of the most popular and oldest supervised learning technique. Given a training dataset D containing N training points (x, y), where i= 1...N, regression analysis is used to model the relationship between one or more independent variables x, and a dependent variable y, The relationship between the dependent and independent variables can be represented as a function as follows: y=fe) 6.1) Here, the feature variable x is also known as an explanatory variable, exploratory variable, a predictor variable, an independent variable, a covariate, or a domain point. y is a dependent variable. Dependent variables are also called as labels, target variables, or response variables. Regression analysis determines the change in response variables when one exploration variable is varied while keeping all other parameters constant. This is used to determine the relationship each of the exploratory variables exhibits. Thus, regression analysis is used for prediction and forecasting. Regression Analysis «131 Regression is used to predict continuous variables or quantitative variables such as price and revenue. Thus, the primary concem of regression analysis is to find answer to questions such as: 1. What is the relationship between the variables? 2. What is the strength of the relationships? 3. What is the nature of the relationship such as linear or non-linear? 4, What is the relevance of the attributes? 5. What is the contribution of each attribute? There are many applications of regression analysis. Some of the applications of regressions include predicting: 1. Sales of a goods or services 2. Value of bonds in portfolio management 3. Premium on insurance companies 4, Yield of crops in agriculture 5. Prices of real estate 5.2 INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION ‘The quality of the regression analysis is determined by the factors such as correlation and causation. Regression and Correlation Correlation among two variables can be done effectively using a Scatter plot, which is a plot between explanatory variables and response variables. It is a 2D graph showing the relationship between two variables. The x-axis of the scatter plot is independent, or input or predictor variables and y-axis of the scatter plot is output or dependent or predicted variables. The scatter plot is useful in exploring data. Some of the scaiter plots are shown in Figure 5.1. The Pearson correlation coefficient is the most common test for determining correlation if there is an association between two variables. The correlation coefficient is denoted by r. Correlation is discussed in Chapter 2 of this book. The positive, negative, and random correlations are given in Figure 5.1. In positive correlation, one variable change is associated with the change in another variable. In negative correlation, the relationship between the variables is reciprocal while in random correlation, no relationship exists between variables. 24. * 20: . & é i i ig 12, 14. - . 12, 10 : : al X-axis Xaxis @) (b) @ Figure 5.1: Examples of (a) Positive Correlation (b) Negative Correlation (© Random Points with No Correlation 132 + Machine Learning While correlation is about relationships among variables, say x and y, regression is about predicting one variable given another variable. Regression and Causation Causation is about causal relationship among variables, say x and y. Causation means knowing whether x causes y to happen or vice versa. x causes y is often denoted as x implies y. Correlation and Regression relationships are not same as causation relationship. For example, the correlation between economical background and marks scored does not imply that economic background causes high marks. Similarly, the relationship between higher sales of cool drinks due to a rise in temperature is not a causal relation. Even though high temperature is the cause of cool drinks sales, it depends on other factors too. Linearity and Non-linearity Relationships The linearity relationship between the variables means the relationship between the dependent and independent variables can be visualized as a straight line. The line of the form, y = ax +5 can be fitted to the data points that indicate the relationship between x and y. By linearity, i meant that as one variable increases, the corresponding variable also increases in a linear manner. A linear relationship is shown in Figure 5.2 (a). A non-linear relationship exists in functions such as exponential function and power function and it is shown in Figures 5.2 (b) and 5.2 (c). Here, x-axis is given by x data and y-axis is given by y data. Yeaxls Yaaxis yoat yoax+b Xaxis Xaxis @ ) Yeaxis x ma X-axis © Figure 5.2: (a) Example of Linear Relationship of the Form y= ax + b (b) Example of a Non-linear Relationship of the Form y= ax* (c) Examples of a Non-linear Relationship y = —"— The functions like exponential function (y = ax*) and power function (v- 5) are ax non-linear relationships between the dependent and independent variables that cannot be fitted in aline. This is shown in Figures 5.2 (b) and (c).. Regression Analysis + 133 Types of Regression Methods The classification of regression methods is shown in Figure 5.3. Regression methods Linear regression Nor-linear Logical methods regression regression Single linear Multiple linear L] Polynomial regression regression regression Figure 5.3: Types of Regression Methods Linear Regression It is a type of regression where a line is fitted upon given data for finding the linear relationship between one independent variable and one dependent variable to describe relationships. Multiple Regression It is a type of regression where a line is fitted for finding the linear relationship between two or more independent variables and one dependent variable to describe relationships among variables. Polynomial Regression It is a typé of non-linear regression method of describing relation- ships among variables where N® degree polynomial is used to model the relationship between one independent variable and one dependent variable, Polynomial multiple regression is used to model two or more independent variables and one dependant variable. Logistic Regression It is used for predicting categorical variables that involve one or more independent variables and one dependent variable. This is also known as a binary classifier. Lasso and Ridge Regression Methods These are special variants of regression method where regularization methods are used to limit the number and size of coefficients of the independent variables. Limitations of Regression Method 1, Outliers — Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push the regression line towards it. 2. Number of cases - The ratio of independent and dependent variables should be at least 20 : 1. For every explanatory variable, there should be at least 20 samples. Atleast five samples are required in extreme cases. 3. Missing data - Missing data in training data can make the model unfit for the sampled data. 4, Multicollinearity — If exploratory variables are highly correlated (0.9 and above), the regression is vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy is to remove exploratory variables that exhibit correlation more than 1. If there is a tie, then the tolerance (1—R squared) is used to eliminate variables that have the greatest value. 434 6 Machine Learning $$ 5.3 INTRODUCTION TO LINEAR REGRESSION Inthe simplest form, the linear regression model can be created by fitting a line among the scattered data points. The line is of the form given in Eq. (5.2). y=ayta,xxte (62) Here, a, is the intercept which represents the bias and a, represents the slope of the line. These are called regression coefficients. ¢ is the error in prediction. ‘The assumptions of linear regression are listed as follows: 1. The observations (y) are random and are mutually independent. 2. The difference between the predicted and true values is called an error. The error is also mutually independent with the same distributions such as normal distribution with zero mean and constant variables. 3, The distribution of the error term is independent of the joint distribution of explanatory variables. 4, The unknown parameters of the regression models are constants. The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method is also known as ordinary least squares method. In this method, the data points are modelled using a straight line, Any arbitrarily drawn line is not an optimal line. In Figure 5.4, three data points and their errors (e,, ¢y ¢,) are shown. The vertical distance between each point and the line (predicted by the approximate line equation y = 4, + 4,x) is called an error. These individual errors are added to compute the total error f the predicted line. This is called sum of residuals. The squares of the individual errors can also be computed and added to give a sum of squared error. The line with the lowest sumvof squared error is called line of best fit. y-axis Figure 5.4: Data Points and their Errors In another words, OLS is an optimization technique where the difference between the data points and the line is optimized. Mathematically, based on Eq. (5.2), the line equations for points (x, X, -...X,) are: =@, 44x) +e, Ha Gtax) te (a, + 4,x,) +e, 63) In general, the exror is given as: ¢,= y,— (a, 4,x) 4) This can be extended into the set of equations as shown in Eq. (5.3). Regression Analysis + 135 Here, the terms (¢,,¢, ...,€,) ate error associated with the data points and denote the difference between the true value of the observation and the point on the line. This is also called as residuals. The residuals can be positive, negative or zero. A regression line is the line of best fit for which the sum of the squares of residuals is minimum. The minimization can be done as minimization of individual errors by finding the parameters a, and a, such that: E= De - Ey, — (a, +43) (6.5) Or as the minimization of sum of absolute values of the individual errors: E=Blel= Ely, -@ +a] 69) Or as the minimization of the sum of the squares of the individual errors: E= See = BU, (4, +43) 67 Sum of the squares of the individual errors, often preferred as individual errors (positive and negative errors), do not get cancelled out and are always positive, and stim of squares results in a large increase even for a small change in the error. Therefore, this is preferred for linear regression. Therefore, linear regression is modelled as a minimization function as follows: Ly, — f@)P By, +03 )F 68) Here, J(ay a,) is the criterion function of parameters a, and a,, This needs to be minimized. This is done by differentiating and substituting to zero. This yields the coefficient values of a, and a,. The values of estimates of a, and a, are given as follows: E)- OY am (5.9) " @2)-@F @°) And the value of a, is given as follows: a,=(@)—4,x¥ (6.10) Let us consider a simple problem to illustrate the usage of the above concept. SSS [EER aGERA Let us consider an example where the five weeks’ sales data (in Thousands) is given as shown below in Table 5.1. Apply linear regression technique to predict the 7 and 9* month sales. Table 5.1: Sample Data (Sales in Thousands) 1 12 2 18 3 26 4 3 32 38 136 + Machine Learning Solution: Here, there are 5 items, ie. i = 1, 2, 3, 4, 5. The computation table is shown below (Table 5.2). Here, there are five samples, so i ranges from 1 to 5. Table 5. ‘omputation Table La Ce baa 1 12 1 12 2 18 4 36 3 26 9 78 4 32 16 128 5 38 2 19 Sum=15 Sum = 126 Sum =55 Sum = 44.4 Average of (x,) Average of (y,) Average of (x7) Average of (x, xy.) 15 5-126 = we its 3 aes oe 3 oS =3 = =n =BE Let us compute the slope and intercept now using Eq. (5.9) as: 9, 888-3252) _ oe * +3 ‘ 4, = 2.52 - 0.663 = 0.54 The fitted line is shown in Figure 5.5. a 3S Y-Dependent 1 2 3 4 5 2Xcindependent —Regression line (7 = 0.66x + 0.54) Figure 5.5: Linear Regression Model Constructed Let us model the relationship as y = 4, +, xx. Therefore, the fitted line for the above data is: y = 0.54 40.66% -x. ‘The predicted 7” week sale would be (when x =7), y= 0.54 + 0.66 x 7= 0.54 + 0.66 x 12 =8.46. All sales are in thousands. 5-16 and the 12" month, Scan for ‘Additional Examples’ Regression Analysis + 137 Linear Regression in Matrix Form Matrix notations can be used for representing the values of independent and dependent variables. This is illustrated through Example 5.2. The Eq. (5.3) can be written in the form of matrix as follows: 11) This can be written as: Y= Xa+e, where X is ann x 2 matrix, Y is ann 1 vector, a is a2 x1 column vector and eis an nx 1 column vector. ee OOSSSS—S—————S—S—SS—SSS—— Find linear regression of the data of week and product sales (in Thousands) given in Table 5.3. Use linear regression in matrix form. Table 5.3: Sample Data for Regression a of (Week) || (Product Sales in Thousands), 1 wfe lola 2 3 4 Solution: Here, the dependent variable X is be given as: x7=[1234] And the independent variable is given as follows: yi=[13.48] ‘The data can be given in matrix form as follows: 3 The frst column can be used for setting bias. and Y= Cn) The regression is given as: a= (XP XY The computation order of this equation is shown step by step as: 138 » Machine Learning 14 1. Computation of (xtx)=(1111),]12}_(4 10 1234) |13] (10 30 14 410) (15 05 2. Computation of matrix inverse of (XTX) (i a) = (3 50 a 15 -05) (1111) (1 05 0 -o5 -05 02) (1234) |-03 -01 01 03 1 ji 1 05 0 -05) |3}_(-15)(Intercept 4, Finally, ((X™X)"'X)Y. = inally, CON {its 0.1 0.1 a) 4 (ak slope } 3. Computation of ((X™X)*X") 8 Thus, the substitution of values in Eq. (9.11) using the previous steps yields the fitted line as 22x-15. eee 5.4 VALIDATION OF REGRESSION METHODS The regression model should be evaluated using some metrics for checking the correctness. The following metrics are used to validate the results of regression. Standard Error Residuals or error is the difference between the actual (y) and predicted value (9). If the residuals have normal distribution, then the mean is zero and hence it is desirable. This is a measure of variability in finding the coefficients. It is preferable that the error be less than the coefficient estimate. The standard deviation of residuals is called residual standard error. If it is zero, then it means that the model fits the data correctly. Mean Absolute Error (MAE) MAE is the mean of residuals. It is the difference between estimated or predicted target value and actual target incomes. It can be mathematically defined as follows: (6.12) Here, jj is the estimated or predicted target output and y is the actual target output, and 1 is the number of samples used for regression analysis. Mean Squared Error (MSE) It is the sum of square of residuals. This value is always positive and closer to 0. This is given mathematically as: ee oe EU, -5P 13) Regression Analysis « 139 Root Mean Square Error (RMSE) The square root of the MSE is called RMSE. This is given as: (6.14) Relative MSE Relative MSE is the ratio of the prediction ability of the to the average of the trivial population. The value of zero indicates that the model is perfect and its value ranges between 0 and 1. If the value is more than 1, then the created model is not a good one. This is given as follows: 2-97 RelMSE = =~ (6.15) Eu, -97 Coefficient of Variation Coefficient of variation is unit less and is given as: cv= “S (6.16) ee ———sSeSSsSsSSSS—C—sSC—S FER IEGEN Consider the following training set Table 5.4 for predicting the sales of the items. Table 5.4: Training Item Table Tee) | VSSTETRE Sy Cuter 63) * a tn 80. by 90. 4 100 L, 10 I, 120 Consider two fresh items I, and I,, whose actual values are 80 and 75, respectively. A regression model predicts the values of the items I, and I, as 75 and 85, respectively. Find MAE, MSE, RMSE, RelMSE and CV. Solution: The test items' actual and prediction is given in Table 5.5 as: Table 5.5: Test Item Table Tends Peer gates Wy Yi ui i 80 B L % B Mean Absolute Error (MAE) using Eq. (5.12) is given as: 1 | = 15 MAE —73| + |75 — 85] = — x[p0—75|+ 5 —as|= 2 ‘Mean Squared Error (MSE) using Eq. (5.13) is given as: 2 p _ 125 + 75-85 => = 625 75 1 MSE = 5 x|80-7: 140 + Machine Learning Root Mean Square error using Eq. (5-14) is given as: RMSE = ¥MSE = 625 =7.91 For finding RelMSE and CV, the training table should be used to find the average of y. The average of y is aes ~ 100. RelMSE using Eq. (5.15) can be computed as: (80-75) +(75 85) _ 125, ReIMSE = 1.1219 ° (80 — 100)" + (75 — 100y CV can be computed using Eq, (5.16) as SC“? Coefficient of Determination To understand the coefficient of determination, one needs to understand the total variation of coefficients in regression analysis. The sum of the squares of the differences between the y-value of the data pair and the average of y is called total variation. Thus, the following variations can be defined. The explained variation is given as: LG IY G17) The unexplained variation is given as: =Xy,- 9) (5.18) Thus, the total variation is equal to the explained variation and the unexplained variation. The coefficient of determination r? is the ratio of the explained and total variations. Explained variation Total variation 2 6.19) Itis{a measure of how many future samples are likely to be predicted by the regression model. Its value ranges from 1 to -«, where 1 is the most optimum value. It also signifies the proportion of variance. Here, r is the correlation coefficient. If r = 0.95, then r? is given as 0.95 x 0.95 = 0.9025. This means that 90% of the model can be explained by the relationship between x and y. The rest 10% is unexplained and that may be due to various reasons such as noise, chance, or error. Standard Error Estimate Standard error estimate is another useful measure of regression. It is the standard deviation of the observed values to the predicted values. This is given as: (5.20) Here, as usual, y, is the observed value and §, is the predicted value. Here, n is the number of samples. Regression Analysis + 141 Let us consider the data given in the Table 5.3 with actual and predicted values. Find standard error estimate. Solution: The observed value or the predicted value is given below in Table 5.6. Table 5.6: Sample Data 1] 1s 146 (1.5 — 1.46)-= 0.0016 2) 29 2.02 (29 -2.02P =0.7744 3 [27 258 (2.7 - 2.58) = 0.0144 4 | 3a 3.14 (B.1-3.14)= 0.0016 The sum of (y— 9)? for all i=1, 2,3 and 4 (i.e,, number of samples 1 =4) is 0.792. The standard deviation error estimate as given in Eq. (5.20) 0.782 _ 10396 - 0.629 4-2 5.5 MULTIPLE LINEAR REGRESSION Multiple regression model involves multiple predictors or independent variables and one dependent variable. This is an extension of the linear regression problem. The basic assumptions of multiple linear regression are that the independent variables are not highly correlated and hence multicollinearity problem does not exist. Also, it is assumed that the residuals are normally distributed. For example, the multiple regression of two variables x, and x, is given as follows: y= fy 4) =a, +4,%, 40%, 21) In general, this is given for ‘n’ independent variables as: Y= fy Xp Ky oH) Hay $OX HO, +. HOT, HE 622) Here, (x x, --- , x,) are predictor variables, y is the dependent variable, (ay a,, ..- , 4,) are the coefficients of the regression equation and ¢ is the error term. This is illustrated through Example 55. o_O [Beets] Apply multiple regression for the values given in Table 5.7 where weekly sales along with sales for products x, and x, are provided. Use matrix approach for finding multiple regression. Table 5.7: Sample Data ea 7 (Product One Sales)|||(Product Two Sales)|| Output Weekly Sales (in Thousands) 1 4 1 2 5 6 3 8 8 4 2 2 142» Machine Learning Solution: Here, the matrices for Y and X are given as follows: 114 1 125 x= and Y = 138 142 12 The coefficient of the multiple regression equation is given as a= ‘The regression coefficient for multiple regression is calculated the same way as linear regression: &= (XTX XY (623) Using Eq. (5.23), and substituting the values (Similar to Problem 5.2), one gets @ as: 114 111i. Het a =|[1234 x]1234 138 4582 4582 142 1.69 =| 348 0.05 Here, the coefficients are @,=~1.69, a, ~ 3.48 and a, ~ -0.05. Hence, the constructed model is: 1.69 +3.48x, — 0.05x, fA _, 5.6 POLYNOMIAL REGRESSION If the relationship between the independent and dependent variables is not linear, then linear regression cannot be used as it will result in large errors. The problem of non-linear regression can be solved by two methods: 1. Transformation of non-linear data to linear data, so that the linear regression can handle the data 2. Using polynomial regression Transformations ‘The first method is called transformation. The trick is to convert non-linear data to linear data that can be handled using the linear regression method. Let us consider an exponential function y= ae. The transformation can be done by applying log function to both sides to get: Iny=bx+Ina (6.24) Regression Analysis + 143 Similarly, power function of the form (y = ax") can be transformed by applying log function on both sides as follows: log oy = blog, «¢ + log, 625) Once the transformation is carried out, linear regression can be performed and after the results are obtained, the inverse functions can be applied to get the desire result. Polynomial Regression Itcan handle non-linear relationships among variables by using n' degree of a polynomial. Instead of applying transforms, polynomial regression can be directly used to deal with different levels of curvilinearity. Polynomial regression provides a non-linear curve such as quadratic and cubic. For example, the second-degree transformation (called quadratic transformation) is given as: y—4,+a,x+a,2° and the third-degree polynomial is called cubic transformation given as: y = a, #a,x + ax? + ax. Generally, polynomials of maximum degree 4 are used, as higher order polynomials take some strange shapes and make the curve more flexible. It leads to a situation of overfitting and hence is avoided. Let us consider a polynomial of 2“ degree. Given points (x, ¥,), &y Yr + &y ¥,)- the objective is to fit a polynomial of degree 2. The polynomial of degree 2 is given as: yrataxtae (6.26) Such that the error E = Fly, (a, +4,2;+4,x;)P is minimized. The coefficients ay a, a, of Eq. (6.26) can be obtained by taking partial derivatives with respect to each of the coefficients as ae 2 2 and substituting it with zero. This results in 2+ 1 equations given as follows: , a, (627) The best line is the line that minimizes the error between line and data points. Arranging the coefficients of the above equation in the matrix form results in: n Ex, Ex Ta, Ly, Ex, Ex Bea, |=] 0,4) 628) Ex? Ex Ext fle] [Z02,y) This is of the form Xa=B. One can solve this equation for a as: a=X"B (6.29) e paetih)ixo4 Consider the data provided in Table 58 and fit it using the second-order polynomial. 144» Machine Learning Table 5.8: Sample Data 4 & 15 Solution: For applying polynomial regression, computation is done as shown in Table 5.9. Here, the order is 2 and the sample i ranges from 1 to 4. Table 5.9: Computation Table eA | SA LaA ae LHL oi 1 1 1 1 1 1 1 2 4 8 4 16 8 16 3 9 7 9 al 7 a 4 15 60. 16 240 64 256. ¥ax,=10 | ¥y,=29 | Yxy,=96 |Ex? =30] Derry, = 338 | x= 100 | Sx! = 354 It can be noted that, N = 4, Dy, = 29, Exy,=96, Dx?y, = 338. When the order is 2, the matrix using Eq. (5.28) is given as follows: 4 10 30][a,] [29 10, 30.100 =| 96 30 100 354|/4,] [338 Therefore, using Eq, (5.29), one can get coefficients as: a|-[4 10 30]" [29] (-075 a,|=]10 30 100] x} 96 |=] 0.95). a,| {30 100 a4} {338} | 075 This leads to the regression equation using Eq. (5.26) as: ¥=-0.75 + 0.95 x+0.75 SSS? 5.7 LOGISTIC REGRESSION Linear regression predicts the numerical response but is not suitable for predicting the categorical variables. When categorical variables are involved, it is called classification problem. Logistic regression is suitable for binary classification problem. Here, the output is often a categorical variable. For example, the following scenarios are instances of predicting categorical variables. 1, Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependant variable is a binary response of yes or no. 2. If the student should be admitted or not is based on entrance examination marks, Here, categorical variable response is admitted or not. 3, The student being pass or fail is based on marks secured. Regression Analysis + 145 Thus, logistic regression is used asa binary classifier and works by predicting the probability of the categorical variable. In general, it takes one or more features x and predicts the response y. If the probability is predicted via linear regression, it is given as: P(x)=ay + a,x Hence, logistic regression tries to model the probability of the particular response variable. In email classification problem, say normal email or spam, if the probability of the response variable is 0.7, then there is a 70% possibility of a normal mail. Linear regression generated value is in the range —« to +e, whereas the probability of the response variable ranges between 0 and 1. Hence, there must be a mapping function to map the value -© to 40 to 0-1. The core of the mapping function in logistic regression method is sigmoidal function. A sigmoidal function is a’S’ shaped function that yields values between Qand.1. This is known as logit function. This is mathematically represented as: logit(x) = (530) le Here, xis the independent variable and e is the Euler number. The purpose of the logit function is to map any real number to zero or 1. Logistic regression can be viewed as an extension of linear regression, but the only difference is that the output of linear regression can be an extremely high number. This needs to be mapped into the range 0-1, as probability can have values only in the range 0-1. This problem is solved using log odd or logit functions. What is the difference between odds and probability? Odds and probability (or likelihood) are two sides of a coin and represent uncer- tainty. The odds are defined as the ratio of the probability of an event and probability of an event that is not happening. Thisis given probability of an event = mrty oan event (5.31) ot ~ ohabilily of an non —event 63) Log-odds can be taken for the odds, resulting in: p(x) oe =atax (6.32) Here, log(,) is a logit function or log odds function. One can solve for p(x) by taking the inverse of the above function as: exp(a, + 4,x) P(e) = Pe 633) “+ exp(@, + 4,2) This is the same sigmoidal function. It always gives the value in the range 0-1. Dividing the numerator and denominator by the numerator, one gets: 1 P(e) = ——————_ 6.34) rexpea, — 43) ‘One can rearrange this by taking the minus sign outside to get the following logistic function: 1 Pz) = Trept, bax) (6.35) 146» Machine Learning Here, x is the explanatory or predictor variable, ¢ is the Euler number, and ay a, are the regression coefficients. The coefficients 4, a, can be learned and the predictor predicts p(x) directly using the threshold function as: {, if p(x) > 05 0 otherwise (636) a —— FEERyaGEeA Let us assume a binomial logistic regression problem where the classes are pass and fail. The student dataset has entrance mark based on the historic data of those who are selected or not selected. Based on the logistic regression, the values of the learnt parameters are ¢,= 1 and a, =8. Assuming marks of x=60, compute the resultant class. Solution: The values of regression coefficients are a, = 1 and a, = 8, and given that x= 60. Based on the regression coefficients, z can be computed as: z=a, tax +8x60=481 One can fit this in a sigmoidal function using Eq. (5.30) to get the probability as: yn __._1_ 1+ exp(481) 2271 If we assume the threshold value as 0.5, then it is observed that 0.44 < 05, therefore, the candidate with marks 60 is not selected LM To determine the relationship between dependant and independent variables, parameters need to be obtained. In logistic regression, the parameters are obtained through maximum likelihood function (MLE) using the training data. The aim is to lean the values of parameters of the logisti model (¢’s) by minimizing the error in the probability predicted by the model. There can be many different sets of coefficients available. The optimal value of the parameters is obtained by using the MLE function, which is a set of coefficients for which the probability of getting the observed data is maximum. If mis the success of the outcome and 1 — ris the failure of the outcome, then the likelihood function is given as: ueey i fa " J (-m) (637) 1-2, To determine the value of the parameters, the log of the likelihood function is taken. Techniques like Newton method can be used to maximize the log-likelihood of the function. Logistic regression is suitable for binary classification. The idea can be extended for multiple classes called multinomial logistic regression. Let us assume that there are three classes 1, 2 and 3. Then, the multinomial logistic regression creates three classification problems — class 1 and Not class 1, class 2 and Not class 2, and finally class 3 and Not class 3. Three problems are simultaneously used to find the maximum probability relative to others to get the appropriate class. Logistic regression is a simple and efficient method for binary classification. The model can be easily interpreted too. The disadvantages of logistic regression are that multinomial logistic regression cannot handle many attributes and can handle only linear features. Also, if all the attributes have multicollinearity problem, the logistic method does not work effectively.

You might also like