Unit – III
Introduction to Machine Learning:
Definition:
• Machine Learning is a branch of computer science which deals with system
programming in order to automatically learn.
• It contains a set of algorithms that work on a huge amount of data.
• Example: Handwriting recognition, Robot driving
Classification:
• Machine learning implementations are classified into four major categories,
depending on nature of learning “signal” or “response”.
✓ Supervised Learning
✓ Unsupervised Learning
✓ Reinforcement Learning
✓ Semi-supervised Learning
Supervised Learning:
• Supervised Learning is the machine learning task of learning a function that
maps an input to an output based.
• The given data is labeled.
• Both Classification and Regression problems are supervised learning problems.
• For example, the inputs could be camera images, each one accompanied by an
output saying “bus” or “pedestrian,” etc.
• An output like this is called a label.
• Classification: Classification algorithms are used to solve the classification
problems in which the output variable is categorical such as “yes” or “No”. Some
popular classification algorithms are Random Forest Algorithm, Decision
Tree Algorithm. Logistic Regression Algorithm.
• Regression: Regression algorithms are used to solve regression problems in
which there is a linear relationship between input and output variables. Some
popular regression algorithms are Simple Linear regression Algorithm,
Decision Tree Algorithm.
Advantages of supervised learning:
• Work in labeled dataset
• Helpful in predict the output.
Disadvantages of supervised learning:
• Not able to solve complex tasks.
• Predict wrong output.
Applications of supervised learning:
• Image segmentation
• Medical Diagnosis
• Fraud Detection
Unsupervised Learning:
• Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
• The machine is trained using unlabeled dataset.
• Both Clustering and Association problems are supervised learning problems.
• For example, when shown millions of images taken from the Internet, a
computer vision system can identify a large cluster of similar images which an
English speaker would call “cats.”
• Clustering: It is unsupervised method of grouping objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
of another group.
• Association: An association rule is an unsupervised learning method which
finds the relationships between variables in large database.
Advantages of unsupervised learning:
• Used for complicated tasks.
Disadvantages of unsupervised learning:
• Output is less accurate
• Working is more difficult
Applications of unsupervised learning:
• Network analysis
Reinforcement Learning:
• Reinforcement learning is the problem of getting an agent to act in the world
so as to maximize its rewards.
• In reinforcement learning the agent learns from a series of reinforcements:
rewards and punishments.
• For example — Consider teaching a dog a new trick: we cannot tell him
what to do, what not to do, but we can reward/punish it if it does the
right/wrong thing.
Semi-supervised Learning:
• Semi-Supervised learning is a type of Machine Learning algorithm that
represents the intermediate ground between Supervised and Unsupervised
learning algorithms.
• It uses the combination of labeled and unlabeled datasets during the training
period,
Linear Regression Models:
• Regression is essential for any machine learning problem
• Linear Regression is a linear approach to modeling the relationship between a
dependent variable and one or more independent variables.
• It is one of the easiest and most popular machine learning algorithms.
• It makes predictions for continuous/real or numeric variables such as sales,
salary, age.
• Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (x) variables, hence called as linear regression.
• The linear regression model provides a sloped straight line representing the
relationship between the variables.
• Let X be the independent variable and Y be the dependent variable.
• A linear relationship between these two variables as follows: Y=mX+c
Where m: slope, c:y-intercept
Types of linear regression:
Simple Linear Regression: If a single independent variable is used to predict
the value of a numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.
Equation: Y = b0 + b1x
Multiple Linear Regression: If more than one independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
Equation: Y = b0 + b1x1 + b2x2 + … + bnxn
Least Squares Regression:
• Least squares are a commonly used method in regression analysis.
• The Least squares method is a form of mathematical regression analysis used to
determine the line of best fit for a set of data.
• The Least Square Regression method is calculated as Y’ = bX+a
Where Y’ represents predicted value,
X represents Known value,
b and a represents numbers
• Steps to implement Least Squares Regression in python are listed below:
Step 1: Import the required python libraries such as numpy, pandas
Step 2: Read and load the dataset
Step 3: Create a scatter plot to check the relationship between two variables
Step 4: To assign X and Y as independent and dependent variables.
Step 5: Compute the mean of variables X and Y to determine value of slope
Step 6: Calculate the slope(m) and y-intercept using formula
Bayesian Linear Regression:
• Bayesian Regression is used when the data is insufficient in the dataset or the
data is poorly distributed.
• The output of a Bayesian Regression model is obtained from a probability
distribution.
• The aim of Bayesian Linear Regression is to find the ‘posterior‘ distribution for
the model parameters
• The expression for Posterior is :
Where,
Posterior: It is the probability of an event to occur; say, H, given that another
event; say, E has already occurred. i.e., P(H | E).
Prior: It is the probability of an event H has occurred prior to another event. i.e.,
P(H).
Likelihood: It is a likelihood function in which some parameter variable is
marginalized.
• The Bayesian Ridge Regression formula is as follows:
where. ‘y’ is expected value,
‘lambda’ is the distribution’s shape parameter
‘w’ is made up of the elements w0,w1,..
• Implementation steps of Bayesian Regression using python are listed below:
Step 1: Import the required python libraries
Step 2: Load the dataset
Step 3: Splitting the dataset into testing and training sets
Step 4: Create to train the model
Step 5: Model predicting the test data
Step 6: Evaluating of r2 score of the model against the test dataset
Advantages:
✓ Very effective when the size of the dataset is small.
✓ Robust
Disadvantages:
✓ Time consuming
✓ Not worth for large amount of data
Gradient Descent:
• Gradient descent is an optimization algorithm that finds the best-fit line for a
given training dataset in a smaller number of iterations
• If m and c are plotted against MSE, it will acquire a bowl shape.
• Cost Function: The cost is the error in our predicted value. It is calculated using
Mean Squared Error function.
• Learning Rate: A learning rate is used for each pair of input and output values.
It is a scalar factor and co-efficients are updated in direction towards minimizing
error.
• The steps are listed below
Step 1: Initially, let m-0,c=0
Step 2: Calculating the partial derivatives of loss function “m” to get derivative
D
Step 3: Similarly, find the partial derivative with respect to c, Dc.
Step 4: Update the current values of m and c using the following equations.
m = m – LDm
c = c – LDc
Step 5: Repeat this process until our cost function is very small.
Linear Classification Models:
• The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such as,
Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
• Classes can be called as targets/labels or categories.
• The output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc.
• The best example of an ML classification algorithm is Email Spam Detector.
• The goal of the classification algorithm is to take a D-dimensional input vector
x and assign it to one of K discrete classes Ck , k = 1, . . . , K.
• There are two types of classifications: Two-class problems, Multi-class
problems.
• Two-class problems: (Binary Representation): If the classification problem
has only two possible outcomes then it is called as binary classifier. Example:
Yes or No, Male or Female, Cat or Dog.
• Multi-class problems: If a classification problems has more than two
outcomes then it is called as Multi-class classifier. Example: Classification of
music.
Types of ML Classification Algorithms:
✓ Logistic Regression
✓ K-Nearest Neighbors
✓ Support Vector Machines
✓ Kernel SVM
✓ Naïve Bayes
✓ Decision Tree classification
✓ Random Forest classification
Discriminant Function:
• A function of a set of variables that is evaluated for samples of events or objects
and used as an aid in discriminating between or classifying them.
• A discriminant function (DF) maps independent (discriminating) variables into
a latent variable D
• DF is usually postulated to be a linear function: D = a0 + a1 x1 + a2 x2 ... aNxN
• The goal of discriminant analysis is to find such values of the coefficients.
• Whenever there is a requirement to separate two or more classes having multiple
features efficiently, the Linear Discriminant Analysis model is considered the
most common technique to solve such classification problems.
• For example if there are classes with multiple features and need to separate them
efficiently. Classify them using a single feature, then it may show overlapping.
• To overcome the overlapping issue in the classification process, must increase
the number of features regularly
• A discriminant function that is a linear combination of the components of x can
be written as
g(X) = WTX + W0
• The linear discriminant function g(x) can be written as
Probabilistic Discriminative Model:
• Discriminative models are a class of supervised machine learning models which
make predictions by estimating conditional probability P(y/x).
• For the two-class classification problem, the posterior probability of class c1 can
be written as a logistic sigmoid acting on a linear function of x.
• For multi-class case, the posterior probability of ckis given by a softmax
transformation of linear function,
Logistic Regression:
• Logistic regression is the Machine Learning algorithms, under the classification
algorithm of Supervised Learning technique.
• Logistic regression is used to describe data and the relationship between one
dependent variable and one or more independent variables.
• The independent variables can be nominal, ordinal, or of interval type.
• Logistic regression predicts the output of a categorical dependent variable.
• It can be either Yes or No, 0 or 1, true or False, etc. it gives the probabilistic
values which lie between 0 and 1.
• Logistic regression is used for solving the classification problems.
Logistic Function (Sigmoid Function):
• The logistic function is also known as Sigmoid function.
• It is a mathematical function used to map the predicted values to probabilities.
• The value of the logistic regression must be between 0 and 1, so it forms a curve
like “S” form.
• The S-form curve is called the Sigmoid function or logistic function.
Assumptions for logistic regression:
• The dependent variable must be categorical in nature
• The independent variable should not have multi-collinearity
Logistic Regression Equation:
• The equation for straight line can be written as:
• In Logistic Regression (y) can be between 0 and 1only, let’s divide the above
equation by 1,
• For the range between –[infinity] to +[infinity] , take logarithm of equation,
Types of Logistic Regression:
Binomial: In binomial logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail.
Multinomial: There can be 3 or more possible unordered types of dependent variables,
such as “cats”, “dogs”, “sheep”.
Ordinal: There can be 3 or more possible ordered types of dependent variables. Such
as “low”, “High” or “Medium”.
Steps In Logistic Regression:
Step 1: Data Pre-Processing step
Step 2: Fitting Logistic Regression to training set
Step 3: Predicting the test result
Step 4: Test accuracy of result
Step 5: Visualization the test set result.
Advantages of Logistic Regression:
• It performs better when the data is linearly separable.
• Easy to implement and train a model
• Does not require too many computational resources.
Probabilistic Generative Model:
• Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can estimate the
conditional probability using Baye’s Rule:
P(X | Y)P(Y) = P(Y | X)P(X)
• A Generative model is a statistical model of the joint probability distribution on
given observable variable X and target variable Y.
P(Y | X) = P(X | Y)P(Y)/P(X)
• A Discriminative model is a model of the conditional probability of the target
Y, given an observation X can estimate – P(X | Y) = P(Y | X)P(X) / P(Y)
• Classifier based on the generative model is generative classifier and classifier
based on the discriminative model is discriminative classifier.
• Simple Example:
Types of Generative Models:
• Naïve Bayes Classifier or Bayesian network
• Linear Discriminant analysis
Discriminative Models:
• Logistic Regression
• Support Vector Machine
• Decision Tree Learning
• Random Forest
Support Vector Machine(SVM):
• Support Vector Machine(SVM) is a supervised machine learning algorithm used
for both classification and regression.
• The objective of SVM algorithm is to find a hyperplane in an Ndimensional
space that distinctly classifies the data points.
• The objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes.
• Hyperplanes are decision boundaries that help classify the data points.
• The dimension of the hyperplane depends upon the number of features.
• If the number of input features is 2, then the hyperplane is just a line.
• If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
• It becomes difficult to imagine when the number of features exceeds 3.
• Using these support vectors, can maximize the margin of the classifier.
• Let’s consider two independent variables x1,x2 and one dependent variables
which is either a blue circle or a red box.
• In SVM algorithm, to maximize the margin between the data points and the
hyperplane, the loss function helps to maximize the margin is called Hinge loss.
Hinge Loss:
• The cost is 0 if the predicted value and the actual value are of the same sign. If
they are not, then calculate the loss value.
• The objective of the regularization parameter is to balance the margin
maximization and loss.
SVM Kernel:
• The SVM kernel is a function that takes low dimensional input space and
transforms it into high dimensional space.
• It converts non-separable problem to separable problem.
• It is mostly useful in non-linear separation problems.
Types of SVM:
• Simple SVM: Typically used for linear regression and classification problems.
• Kernel SVM: More flexibility for non-linear data.
Advantages:
✓ Effective on datasets with multiple features.
✓ Memory Efficient
✓ Different kernel functions can be specified foe decision functions
Disadvantages:
✓ Works best on small sample sets
✓ Regularization is crucial.
Applications:
✓ Used to solve various real-world problems
✓ Helpful in text and hypertext categorization
✓ Classification of images
✓ Classification of satellites
Decision Tree:
• Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• In a Decision tree, there are two nodes, the Decision Node and Leaf Node.
• Decision nodes are used to make any decision.
• The goal of using a Decision Tree is to create a training model that can use to
predict the class or value of the target variable by learning simple decision rules
inferred from prior data.
• In order to build a tree, use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
Types of Decision Trees:
Categorical Variable Decision Tree: Decision Tree which has a categorical target
variable then it is called as Categorical Variable Decision Tree.
Continuous Variable Decision Tree: Decision Tree has a continuous target variable
then it is called as Continuous Variable Decision Tree.
Reason for using Decision Tree:
• Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
• The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies:
• Root Node: It is from where the decision tree starts. It represents entire dataset.
• Leaf Node: They are final output node and the tree can’t be segregated after leaf
node.
• Splitting: It is the process of dividing the decision node / root node into sub-
nodes
• Branch/Sub Tree: A tree formed by splitting the tree
• Pruning: It is the process of removing the unwanted branches from tree.
• Parent/child node: The root node of the tree is called parent node, and other
nodes are called child nodes.
Working of Decision Tree Algorithm:
Step 1: Begin the tree with the root node, says S which contains complete dataset.
Step 2: Find the best attribute in the dataset using Attribute Selection Measure
Step 3: Divide the S into subsets that contains possible values for the best attributes.
Step 4: Generate the decision tree node, which contains the best attribute.
Step 5: : Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached.
Algorithms used to construct Decision Tree:
• ID3 → (extension of D3)
• C4.5 → (successor of ID3)
• CART → (Classification And Regression Tree)
• CHAID → (Chi-square automatic interaction detection Performs multi-level
splits when computing classification trees)
• MARS → (multivariate adaptive regression splines)
Attribute Selection Measures:
• Entropy
• Information Gain
• Gini Index
• Gain Ratio
• Reduction in Variance
• Chi – Square
Entropy:
• Entropy is a metric to measure the impurity in a given attribute.
• Entropy is a measure of the randomness in the information being processed.
• Entropy can be calculated as:
Information Gain:
• Information gain or IG is a statistical property that measures how well a
given attribute separates the training examples according to their target
classification.
• It can be calculated using formula:
Gini Index:
• Gini index as a cost function used to evaluate splits in the dataset.
• It can be calculated using formula:
Gain Ratio:
• It is defined as the information gain is divided by SplitInfo
Reduction in Variance:
• Reduction in variance is an algorithm that uses the standard formula of
variance to choose the best split.
Chi – Square:
• The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
• It finds out the statistical significance between the differences between sub-
nodes and parent node.
Advantages:
✓ Simple to understand
✓ Useful for solving decision-related problems.
Disadvantages:
✓ Complex
✓ Over fitting issue
Random Forest:
• Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
• It can be used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning.
Steps in working process:
Step 1: In Random forest n number of random records are taken from the data
set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for
Classification and regression respectively.
Need for Random Forest:
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy.
Important Features of Random Forest:
Diversity: Not all attributes/variables/features are considered while making an
individual tree, each tree is different.
Immune to curse of dimensionality: Since each tree does not consider all the
features, the feature space is reduced.
Parallelization: Each tree is created independently out of different data and
attributes.
Train – Test split: In a random forest we don’t have to segregate the data for
train and test as there will always be 30% of the data which is not seen by the
decision tree.
Advantages:
✓ Performs both classification and regression
✓ Accuracy
Disadvantages:
✓ Not more suitable for regression tasks
Application:
✓ Banking
✓ Medicine
✓ Marketing
✓ Land Use
Difference between Decision Tree and Random Forest: