UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian
Mixture Models, Expectation-Maximization.
Reinforcement Learning: Exploration and exploitation trade-offs, non-associative
learning, Markov decision processes, Q-learning.
Unsupervised Learning:
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in
the human brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained
using unlabeled dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset
containing images of different types of cats and dogs. The algorithm is never
trained upon the given dataset, which means it does not have any idea about the
features of the dataset. The task of the       unsupervised     learning   algorithm   is   to
identify   the   image   features   on their     own. Unsupervised learning algorithm will
perform this task by clustering the     image dataset         into the groups according to
similarities between images.
Below are some main reasons which describe the importance of Unsupervised Learning:
   o Unsupervised learning is helpful for finding useful insights from the data.
   o Unsupervised learning is much similar as a human learns to think by
      their own experiences, which makes it closer to the real AI.
   o Unsupervised learning works on unlabeled and uncategorized data
      which make unsupervised learning more important.
   o In real-world, we do not always have input data with the corresponding
      output so to solve such cases, we need unsupervised learning.
Working of unsupervised learning can be understood by the below diagram:
 Here, we have taken an unlabeled input data, which means it is not categorized
 and corresponding outputs are also not given. Now, this unlabeled input data is fed to
 the machine learning model in order to train it. Firstly, it will interpret the raw
 data to find the hidden patterns from the data and then will apply suitable
 algorithms such as k-means clustering, Decision tree, etc.
 Once it applies the suitable algorithm, the algorithm divides the data objects into
 groups according to the similarities and difference between the objects.
 The unsupervised learning algorithm can be further categorized into two types of
 problems:
   o Clustering: Clustering is a method of grouping the objects into clusters
       such that objects with most similarities remains into a group and has less or no
       similarities with the objects of another group. Cluster analysis finds the
       commonalities between the data objects and categorizes them as per
       the presence and absence of those commonalities.
   o Association: An association rule is an unsupervised learning method which is
       used for finding the relationships between variables in the large database. It
       determines the set of         items that      occurs together        in   the   dataset.
       Association rule makes marketing strategy more effective. Such as people
       who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
       item. A typical example of Association rule is Market Basket Analysis
K-Means Clustering
    K-Means clustering is an unsupervised iterative clustering technique.
    It partitions the given data set into k predefined distinct clusters.
    A cluster is defined as a collection of data points exhibiting certain similarities.
     It partitions the data set such that-
     Each data point belongs to a cluster with the nearest mean.
     Data points belonging to one cluster have high degree of similarity.
     Data points belonging to different clusters have high degree of dissimilarity.
K-Means Clustering Algorithm- (Refer class notes)
     K-Means Clustering Algorithm involves the following steps-
     Step-01:
     1. Choose the number of clusters K.
     2. Step-02:
     1. Randomly select any K data points as cluster centers.
    2. Select cluster centers in such a way that they are as farther as possible
        from each other.
Step-03:
     1. Calculate the distance between each data point and each cluster center.
    2. The distance may be calculated either by using given distance function or by
        using euclidean distance formula.
Step-04:
     1. Assign each data point to some cluster.
   2. A data point is assigned to that cluster whose center is nearest to that data
   point.
Step-05:
     1. Re-compute the center of newly formed clusters.
Example: Cluster the following eight points (with (x, y) representing locations) into
three clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
Euclidean distance:
 From here, New clusters are-
Cluster-01:
First cluster contains points-
     A1(2, 10)
Cluster-02:
  Second cluster contains points-
      A3(8, 4)
      A4(5, 8)
      A5(7, 5)
      A6(6, 4)
      A8(4, 9)
 Cluster-03:
  Third cluster contains points-
. A2(2, 5)
. A7(1, 2)
 Now,
      We re-compute the new cluster clusters.
         The new cluster center is computed by taking mean of all the points contained
    in that cluster.
 For Cluster-01:
      We have only one point A1(2, 10) in Cluster-01.
      So, cluster center remains the same.
 For Cluster-02:
  Center of Cluster-02
 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
 = (6, 6)
 For Cluster-03:
 Center of Cluster-03
 = ((2 + 1)/2, (5 + 2)/2)
 = (1.5, 3.5)
 This is completion of Iteration-01.
 Iteration-02:
      We calculate the distance of each point from each of the center of the three
clusters.
      The distance is calculated by using the given distance function.
From here, New clusters are-
Cluster-01:
First cluster contains points-
    A1(2, 10)
    A8(4, 9)
Cluster-02:
Second cluster contains points-
    A3(8, 4)
    A4(5, 8)
    A5(7, 5)
    A6(6, 4)
Cluster-03:
Third cluster contains points-
   A2(2, 5)
   A7(1, 2)
Now,
    We re-compute the new cluster clusters.
       The new cluster center is computed by taking mean of all the points contained
  in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
  = (6.5, 5.25)
  For Cluster-03:
  Center of Cluster-03
  = ((2 + 1)/2, (5 + 2)/2)
  = (1.5, 3.5)
  This is completion of Iteration-02.
  After second iteration, the centers of the three clusters are-
                    C1(3, 9.5)
                  . C2(6.5, 5.25)
                    C3(1.5, 3.5)
Continue the iteration until the new cluster and previous clusters remains same.
Stopping Criteria for K-Means Clustering
There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
   1. Centroids of newly formed clusters do not change
   2. Points remain in the same cluster
   3. Maximum number of iterations is reached
We can stop the algorithm if the centroids of newly formed clusters are not changing.
Even after multiple iterations, if we are getting the same centroids for all the clusters,
we can say that the algorithm is not learning any new pattern, and it is a sign to stop
the training.
Another clear sign that we should stop the training process is if the points remain in
the same cluster even after training the algorithm for multiple iterations.
How to choose optimal k value?
We have in the previous section that the value of k needs to be chosen beforehand. The
performance of the K-means clustering algorithm depends on the optimal and packed
clusters. So, let’s see how to choose the optimal number of clusters using the techniques
given below:
Elbow Method
The Elbow method is the go to method to find the optimal number of clusters. It uses the
concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, means the total
variations within a cluster. In simple words, it is sum of the squared distances between each
data point and its centroid and calculates the average distance within a cluster. To measure
the distance between data points and centroid, we can use any method such as Euclidean
distance, Manhattan distance or cosine distance, etc.
                                               Within Cluster Sum of Square
To find the optimal value of clusters, the elbow method follows the below steps:
    1. Perform the K-means clustering multiple times using various k values (from 1-10).
    2. For each value of k, calculates the WCSS value.
    3. Plots a curve between calculated WCSS values (sum of squared distance) and the
       number of clusters k.
    4. Look for sharp bend in the curve (looks like an arm or elbow), that point is
       considered as the optimal value of k.
Elbow Method to find optimal value of k
The above plot shows the sharp bend, which looks like an elbow. Therefore, it is called the
elbow method.
Advantages of k-means
1. Simple and easy to implement: The k-means algorithm is easy to understand and
   implement, making it a popular choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets
   with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of data points and
   can be easily scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used with
   different distance metrics and initialization methods.
 Disadvantages of K-Means:
1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids
   and can converge to a suboptimal solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be
   specified before running the algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact
   on the resulting clusters.
 K MODE CLUSTERING
 K-mode clustering is an unsupervised machine-learning technique used to group a set of
 data objects into a specified number of clusters, based on their categorical attributes.
 The algorithm is called “K-Mode” because it uses modes (i.e. the most frequent values)
 instead of means or medians to represent the clusters.
 In K-means clustering when we used categorical data after converting it into a numerical
 form. it doesn’t give a good result for high-dimensional data.
 So, Some changes are made for categorical data t.
  Replace Euclidean distance with Dissimilarity metric
  Replace Mean by Mode for cluster centers.
  Apply a frequency-based method in each iteration to update the mode.
 And then this is called K-MODE Clustering because of MODE.
 Dissimilarity measurements between two data objects
 K-modes is an algorithm for clustering categorical data. It is used to partition a dataset
 into a specified number of clusters, where each cluster is characterized by a mode, which
 is the most frequent categorical value in the cluster.
 Similarity and dissimilarity measurements are used to determine the distance between
 the data objects in the dataset. In the case of K-modes, these distances are calculated
 using a dissimilarity measure called the Hamming distance. The Hamming distance
 between two data objects is the number of categorical attributes that differ between
 the two objects.
 Let x and y be two categorical data objects defined by m features or attributes.
                Where,
 For example, consider the following dataset with three categorical
 attributes:
 To calculate the Hamming distance between data objects 1 and 2, we compare their
 values for each attribute and count the number of differences. In this case, there is one
 difference (Attribute 3 is C for object 1 and D for object 2), so the Hamming distance
 between objects
 1 and 2 is 1.
 To calculate the Hamming distance between objects 1 and 3, we again compare their
 values for each attribute and count the number of differences. In this case, there
 are two differences (Attribute 2 is B for object 1 and C for object 3, and Attribute 3 is C
 for object 1 and E for object 3), so the Hamming distance between objects 1 and 3 is 2.
 To calculate the Hamming distance between objects 1 and 4, we again compare their
 values for each attribute and count the number of differences. In this case, there
 are three differences (Attribute 1 is A for objects 1 and B for object 4, Attribute 2 is B
 for object 1 and C for object 4, and Attribute 3 is C for objects 1 and E for object 4), so
 the Hamming distance between objects 1 and 4 is 3.
 Data objects with a smaller Hamming distance are considered more similar, while objects
 with a larger Hamming distance is considered more dissimilar.
It consists of the following steps.
1.   Select K data objects for each cluster.
2.   Calculate dissimilarities and allocate each data object to nearest cluster.
3.   Calculate the new modes for all clusters.
4.   Repeat step 2 and 3 until the cluster will become stable.
Overall, the goal of K-modes clustering is to minimize the dissimilarities between the
data objects and the centroids (modes) of the clusters, using a measure of
categorical similarity such as the Hamming distance.
K-Prototypes clustering
K-Prototypes clustering is a partitioning clustering algorithm. We use k-prototypes
clustering to cluster datasets that have categorical as well as numerical attributes. The K-
Prototypes clustering algorithm is an ensemble of k-means clustering and k-modes
clustering algorithm. Hence, it can handle both numerical and categorical data.
k-prototypes clustering, we select k-prototypes randomly at the start. After that, we
calculate the distance between each data point and the prototypes. Accordingly, all the
data points are assigned to clustering associated with different prototypes.
After assigning data points to the clusters, we calculate the new prototype for the current
cluster using the method discussed in the next sections. After that, we recalculate the
distance of prototypes from the data points and reassign the clusters. This process is
continued until the clusters converge.
  Reinforcement Learning
  Unlike      supervised and unsupervised learning,             reinforcement learning is a
  feedback-based approach in which agent learns by performing some actions as well
  as their outcomes. Based on action status (good or bad), the agent gets positive or
  negative feedback. Further, for each positive             feedback,    they    get      rewarded,
  whereas, for each negative feedback, they also get penalized.
        Def: “Reinforcement learning is a type of machine learning technique,
        where an intelligent agent (computer program) interacts with the environment,
        explore it by itself, and makes actions within that."
        o    Reinforcement learning does not require any labeled data for the learning
             process. It learns through the feedback of action performed by the
             agent. Moreover, in
         reinforcement learning, agents also learn from past experiences.
    Reinforcement learning methods are used to solve tasks where decision-making
            is sequential and the goal is long-term, e.g., robotics, online chess, etc.
    o       Reinforcement learning aims to get maximum positive feedback so that
            they can improve their performance.
   o    Reinforcement         learning   involves   various   actions,    which    include   taking
        action, changing/unchanged state, and getting feedback. And based on these
        actions, agents learn and explore the environment.
5.6.1 Exploration and Exploitation in Reinforcement Learning:
Before going to a brief description of exploration and exploitation in machine learning,
let's first understand these terms in simple words. In reinforcement learning, whenever
agents get a situation in which they have to make a difficult choice between
whether to continue the same work or explore something new at a specific time,
then,   this      situation   results    in Exploration-Exploitation     Dilemma    because    the
knowledge of an agent about the state, actions, rewards and resulting states is
always partial.
Now we will discuss exploitation and exploration in technical terms.
Exploitation in Reinforcement Learning
Exploitation is defined as a greedy approach in which agents try to get more rewards by
using estimated value but not the actual value. So, in this technique, agents make the
best decision based on current information.
Exploration in Reinforcement Learning
Unlike exploitation, in exploration techniques, agents primarily focus on improving
their knowledge about each action instead of getting more rewards so that they can get
long-term benefits. So, in this technique, agents work on gathering more information to
make the best overall decision.
Examples of Exploitation and Exploration in Machine Learning
Let's understand exploitation and exploration with some interesting real-world
examples.
Example 1: Let's say we have a scenario of online restaurant selection for food orders,
where you have two options to select the restaurant. In the first option, you can choose
your favorite restaurant from where you ordered food in the past; this is called
exploitation because here, you only know information about a specific restaurant. And
for other options, you can try a new restaurant to explore new varieties and tastes
of food, and it is called exploration. However, food quality might be better in the first
option, but it is also possible that it is more delicious in another restaurant.
Non-Associative Learning
   In reinforcement learning, non-associative learning refers to a type of learning that
   does not involve forming associations or relationships between different stimuli or
   actions. It is a simpler form of learning compared to associative learning, which
   involves linking different stimuli or actions together.
   Non-associative learning is typically observed in situations where an agent's
   behavior changes in response to a single stimulus or repeated exposure to the same
   stimulus. There are two common types of non-associative learning: habituation and
   sensitization.
   1. Habituation: Habituation occurs when an agent's response to a particular
   stimulus decreases over time with repeated exposure. It is a form of adaptive
   behavior where the agent learns to ignore irrelevant or harmless stimuli. For example,
   if a robot is repeatedly exposed to a loud noise that is not associated with any
   reward or punishment, it may gradually stop reacting to the noise and become
   habituated to it.
   2. Sensitization: Sensitization is the opposite of habituation and occurs when an
   agent's response to a stimulus increases over time with repeated exposure. It
   involves an increased sensitivity or responsiveness to a stimulus. For example, if a
   robot is repeatedly exposed to a painful stimulus, it may become more sensitive to
   that stimulus and show an increased response.
   Non-associative     learning    is    not     directly    related   to   reinforcement
   learning,    as reinforcement learning primarily focuses on associative learning,
   where agents learn to associate their actions with rewards or punishments.
   However, non-associative learning mechanisms can play a role in shaping an agent's
   behavior and influencing its responses to stimuli, which can indirectly impact the
   learning process in reinforcement learning scenarios.
Markov-Decision Process
Reinforcement Learning is a type of Machine Learning. It allows machines and
software agents to automatically determine the ideal behavior within a specific context,
in order to maximize its performance. Simple reward feedback is required for the
agent to learn its behavior; this is known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of
fact, Reinforcement Learning is defined by a specific type of problem, and all its
solutions are classed as Reinforcement Learning algorithms. In the problem, an
agent is supposed to decide the best action to select based on his current state.
When this step is repeated, the problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
A set of possible world states S.
   A set of Models.
   A set of possible actions A.
   A real-valued reward function R(s,a).
   A policy the solution of Markov Decision Process.
State:
A State is a set of tokens that represent every state that the agent can be in.
Model:
   A Model (sometimes called Transition Model) gives an action’s effect in a state.
   In particular, T(S, a, S’) defines a transition T where being in state S and taking an
   action
   ‘a’ takes us to state S’ (S and S’ may be the same). For stochastic actions (noisy, non-
   deterministic) we also define a probability P(S’|S,a) which represents the probability
   of reaching a state S’ if action ‘a’ is taken in state S. Note Markov property states
   that the effects of an action taken in a state depend only on that state and
   not on the prior history.
Actions
An Action A is a set of all possible actions. A(s) defines the set of actions that can be
taken being in state S.
Reward
A Reward is a real-valued reward function. R(s) indicates the reward for simply being
in the state S. R(S,a) indicates the reward for being in a state S and taking an
action ‘a’. R(S,a,S’) indicates the reward for being in a state S, taking an action ‘a’ and
ending up in a state S’.
Policy
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It
indicates the action ‘a’ to be taken while in state S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a
START state(grid no 1,1). The purpose of the agent is to wander around the grid to
finally reach the Blue Diamond (grid no 4,3). Under all circumstances, the agent should
avoid the Fire grid (orange color, grid no 4,2). Also the grid no 2,2 is a blocked grid, it acts
as a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would
have taken, the agent stays in the same place. So for example, if the agent says
LEFT in the START grid he would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two
such sequences can be found:
   RIGHT RIGHT UP UPRIGHT
   UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of
the time the action agent takes causes it to move at right angles. For example, if the
agent says UP the probability of going UP is 0.8 whereas the probability of going LEFT is
0.1, and the probability of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
The agent receives rewards each time
step:-
       Small reward each step (can be negative when can also be term as punishment,
   in the above example entering the Fire can have a reward of -1).
   Big rewards come at the end (good or bad).
   The goal is to Maximize the sum of rewards.
   Q-learning
   Q-learning is a model-free, value-based, off-policy algorithm that will find the best
   series of actions based on the agent's current state. The “Q” stands for quality.
   Quality represents how valuable the action is in maximizing future rewards.
   The model-based algorithms use transition and reward functions to estimate the
   optimal policy and create the model. In contrast, model-free algorithms learn
   the consequences of their actions through the experience without transition and
   reward function.
   The value-based method trains the value function to learn which state is more
   valuable and take action. On the other hand, policy-based methods train the policy
   directly to learn which action to take in a given state.
   In the off-policy, the algorithm evaluates and updates a policy that differs from
   the policy used to take an action. Conversely, the on-policy algorithm evaluates and
   improves the same policy used to take an action
   Before we jump into how Q-learning works, we need to learn a few useful
   terminologies to understand Q-learning's fundamentals.
        States(s): the current position of the agent in the environment.
        Action(a): a step taken by the agent in a particular state.
        Rewards: for every action, the agent receives a reward and penalty.
        Episodes: the end of the stage, where agents can’t take new action. It happens
when
    the agent has achieved the goal
    or failed.
     Q(St+1, a): expected optimal Q-value of doing the action in a particular     state.
        Q(St, At): it is the current estimation of Q(St+1, a).
        Q-Table: the agent maintains the Q-table of sets of states and actions.
            Temporal Differences(TD): used to estimate the expected value of Q(St+1, a) by
    using the current state and action and previous state and action.
Q-
Table
The agent will use a Q-table to take the best possible action based on the expected
reward for each state in the environment. In simple words, a Q-table is a data structure
of sets of actions and states, and we use the Q-learning algorithm to update the values
in the table.
Q-
Function
The Q-function uses the Bellman equation and takes state(s) and action(a) as input.
The equation simplifies the state values and state-action value calculation.
Q-learning algorithm
Q learning algorithm