KEMBAR78
Unit 2 | PDF | Cluster Analysis | Statistical Classification
0% found this document useful (0 votes)
64 views57 pages

Unit 2

ddd

Uploaded by

smilemadan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views57 pages

Unit 2

ddd

Uploaded by

smilemadan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

VIVEKANANDHA

COLLEGE OF TECHNOLOGY FOR WOMEN


Elayampalayam, Tiruchengode – 637205.

DEPARTMENT OF INFORMATION TECHNOLOGY

SUBJECT NAME: BIG DATA ANALYTICS


SUBJECT CODE: CS8091
YEAR/SEM: III / VI

YEAR/SEM: IV / VIII
Unit - 2

Regulation: 2017

Staff in charge HOD/IT Dean

1
Part - A
UNIT II CLUSTERING AND CLASSIFICATION

Advanced Analytical Theory and Methods: Overview of Clustering - K-means - Use Cases -
Overview of the Method - Determining the Number of Clusters - Diagnostics - Reasons to
Choose and Cautions .- Classification: Decision Trees - Overview of a Decision Tree - The
General Algorithm - Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in
R - Naïve Bayes - Bayes‘ Theorem - Naïve Bayes Classifier.

1. What is Clustering?

 Clustering is a Machine Learning technique that involves the grouping of data


points.

 Given a set of data points, we can use a clustering algorithm to classify each
data point into a specific group.

 Data points that are in the same group should have similar properties and/or
features, while data points in different groups should have highly dissimilar
properties and/or features.

 Clustering is a method of unsupervised learning and is a common technique


for statistical data analysis used in many fields.

There are many different clustering models:

 Connectivity models based on connectivity distance.

 Centroid models based on central individuals and distance.

 Density models based on connected and dense regions in space.

 Graph-based models based on cliques and their relaxations.

2. What are the fields in which clustering techniques are used?

 Clustering is used in biology to develop new plants and animal taxonomies.

 Clustering is used in business to enable marketers to develop new distinct


groups of their customers and characterize the customer group on basis of
purchasing.

2
 Clustering is used in the identification of groups of automobiles Insurance
policy customer.

 Clustering is used in the identification of groups of house in a city on the basis


of house type, their cost and geographical location.

 Clustering is used to classify the document on the web for information


discovery.

3. What are the requirements of cluster analysis?

The basic requirements of cluster analysis are

 Dealing with different types of attributes.


 Dealing with noisy data.
 Constraints on clustering.
 Dealing with arbitrary shapes.
 High dimensionality
 Ordering of input data
 Interpretability and usability
 Determining input parameter and
 Scalability

4. What is classification? Define the concept of classification.

 Classification is a data-mining technique that assigns categories to a collection


of data to aid in more accurate predictions and analysis.

 Classification is one of several methods intended to make the analysis of very


large datasets effective.

Two step process

 A model is built describing a predefined set of data classes or concepts. The


model is constructed by analyzing database tuples described by attributes.

 The model is used for classification.

3
5. Define k-means clustering algorithm.

 k-means is one of the simplest unsupervised learning algorithms that solve the
well known clustering problem.

 The procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed apriori.

 The main idea is to define k centers, one for each cluster.

 These centers should be placed in a cunning way because


of different location causes different result.

 So, the better choice is to place them as much as possible far away from each
other.

 The next step is to take each point belonging to a given data set and associate it to
the nearest center.

 When no point is pending, the first step is completed and an early group
age is done.

 At this point we need to re-calculate k new centroids as bary center of the clusters
resulting from the previous step.

 After we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new center.

 A loop has been generated. As a result of this loop we may notice that the k
centers change their location step by step until no more changes are done or
in other words centers do not move any more.

 Finally, this algorithm aims at minimizing an objective function know as


squared error function given by:

4
where,

 ‘||xi - vj||’ is the Euclidean distance between xi and vj.

 ‘ci’ is the number of data points in ith cluster.

 ‘c’ is the number of cluster centers.

6. What is Decision tree?

 A Decision Tree is an algorithm used for supervised learning problems such


as classification or regression. A decision tree or a classification tree is a tree
in which each internal (non leaf) node is labeled with an input feature.

 A decision tree is a flow chart like tree structures, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and leaf nodes represent classes or class distributions.

 The top most in a tree is the root node.

5
7. Describe Tree pruning methods.

 When a decision tree is built, many of the branches will reflect anomalies in
the training data due to noise or outlier.

 Tree pruning methods address this problem of over fitting the data.

Two Approaches:

 Pre pruning

 Post pruning

8. List the advantages and disadvantages of Decision Trees.

 The major advantage of using decision trees is that they are intuitively very
easy to explain.

 They closely mirror human decision-making compared to other regression and


classification approaches.

 They can be displayed graphically, and they can easily handle qualitative
predictors without the need to create dummy variables.

 However, decision trees generally do not have the same level of predictive
accuracy as other approaches, since they aren't quite robust.

 A small change in the data can cause a large change in the final estimated tree.

 By aggregating many decision trees, using methods like bagging, random


forests, and boosting, the predictive performance of decision trees can be
substantially improved.

9. Define Random Forests method.

 Random Forests is a versatile machine learning method capable of


performing both regression and classification tasks.

 It also undertakes dimensional reduction methods, treats missing values,


outlier values and other essential steps of data exploration, and does a fairly
good job.

 Random Forests provides an improvement over bagged trees by a small tweak


that de correlates the trees.

6
 As in bagging, you build a number of decision trees on bootstrapped training
samples.

 But when building these decision trees, each time a split in a tree is
considered, a random sample of m predictors is chosen as split candidates
from the full set of pp predictors.

 The split is allowed to use only one of those mm predictors.

 This is the main difference between random forests and bagging; because as in
bagging, the choice of predictor m=pm=p.

10. What is Naïve Bayes?

 It is a classification technique based on Bayes' Theorem with an assumption


of independence among predictors.

 In simple terms, a Naive Bayes classifier assumes that the presence of a


particular feature in a class is unrelated to the presence of any other feature.

11. What are the types of Naive Bayes Classifier?

Multinomial Naive Bayes:

 This is mostly used for document classification problem, i.e whether a


document belongs to the category of sports, politics, technology etc. The
features/predictors used by the classifier are the frequency of the words
present in the document.

7
Bernoulli Naive Bayes:

 This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up only
values yes or no, for example if a word occurs in the text or not.

Gaussian Naive Bayes:

 When the predictors take up a continuous value and are not discrete, we
assume that these values are sampled from a gaussian distribution.

12. List the Applications of Naive Bayes Algorithms

 Real time Prediction: Naive Bayes is an eager learning classifier and it is


sure fast. Thus, it could be used for making predictions in real time.

 Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.

 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes


classifiers mostly used in text classification (due to better result in multi class
problems and independence rule) have higher success rate as compared to
other algorithms. As a result, it is widely used in Spam filtering (identify spam
e-mail) and Sentiment Analysis (in social media analysis, to identify positive
and negative customer sentiments)

 Recommendation System: Naive Bayes Classifier and Collaborative


Filtering together builds a Recommendation System that uses machine
learning and data mining techniques to filter unseen information and predict
whether a user would like a given resource or not

8
13. Differentiate classification and clustering.

14. Define CLARA and CLARANS?

 Clustering in LARge Applications is called as CLARA.

 The efficiency of CLARA depends upon the size of the representative data
set.

 CLARA does not work properly if any representative dataset from the
selected representative datasets does not find best k-medoids.

9
 To recover this drawback a new algorithm, Clustering Large Applications
based upon RANdomized search (CLARANS) is introduced. The CLARANS
works like

 CLARA, the only difference between CLARA and CLARANS is the


clustering process that is done after selecting the representative data sets.

15. How to determine the optimal number of clusters?

 Compute clustering algorithm (e.g., k-means clustering) for different values of


k.

 For each k, calculate the total within-cluster sum of square (wss).

 Plot the curve of wss according to the number of clusters k.

10
Part – B
1. What is clustering, briefly explain various types of clustering, algorithms and
applications.

Table of Contents

1. Overview
2. Types of Clustering
3. Types of Clustering Algorithms
4. K Means Clustering
5. Hierarchical Clustering
6. Difference between K Means and Hierarchical clustering
7. Applications of Clustering
8. Improving Supervised Learning algorithms with clustering

1. Overview

 Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups.

 In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.

 Let’s understand this with an example. Suppose, you are the head of a rental store and
wish to understand preferences of your costumers to scale up your business.

 Is it possible for you to look at details of each costumer and devise a unique business
strategy for each one of them? Definitely not.

 But, what you can do is to cluster all of your costumers into say 10 groups based on their
purchasing habits and use a separate strategy for costumers in each of these 10 groups.
And this is what we call clustering.

11
 Now, that we understand what is clustering. Let’s take a look at the types of clustering.

2. Types of Clustering

Broadly speaking, clustering can be divided into two subgroups:

 Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one
group out of the 10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is assigned.
 For example, from the above scenario each costumer is assigned a probability to be in
either of 10 clusters of the retail store.

3. Types of clustering algorithms

 Since the task of clustering is subjective, the means that can be used for achieving this
goal are plenty.

 Every methodology follows a different set of rules for defining the ‘similarity’ among
data points. In fact, there are more than 100 clustering algorithms known.

But few of the algorithms are used popularly; let’s look at them in detail:

Connectivity models

 As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data
points lying farther away.
 These models can follow two approaches. In the first approach, they start with
classifying all data points into separate clusters & then aggregating them as
the distance decreases.
 In the second approach, all data points are classified as a single cluster and
then partitioned as the distance increases.

12
 Also, the choice of distance function is subjective. These models are very easy
to interpret but lack scalability for handling big datasets.
 Examples of these models are hierarchical clustering algorithm and its
variants.

Centroid models

 These are iterative clustering algorithms in which the notion of similarity is


derived by the closeness of a data point to the centroid of the clusters.
 K-Means clustering algorithm is a popular algorithm that falls into this category.
 In these models, the no. of clusters required at the end has to be mentioned
beforehand, which makes it important to have prior knowledge of the dataset.
 These models run iteratively to find the local optima.

Distribution models

 These clustering models are based on the notion of how probable is it that all data
points in the cluster belong to the same distribution (For example: Normal,
Gaussian).
 These models often suffer from over fitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal
distributions.

Density Models

 These models search the data space for areas of varied density of data points in
the data space.
 It isolates various different density regions and assign the data points within these
regions in the same cluster. Popular examples of density models are DBSCAN
and OPTICS.

Now I will be taking you through two of the most popular clustering algorithms in
detail – K Means clustering and Hierarchical is clustering. Let’s begin.

13
4. K Means Clustering

 K means is an iterative clustering algorithm that aims to find local maxima in each
iteration.

 This algorithm works in these 5 steps:

1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-
D space.

2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey colors.

14
3. Compute cluster centroids: The centroid of data points in the red cluster is shown using
Red Cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid: Note that only the data point at the
bottom is assigned to the red cluster even though it’s closer to the centroid of grey
cluster. Thus, we assign that data point into grey cluster

15
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible:

 Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima.
 When there will be no further switching of data points between two clusters for
two successive repeats.
 It will mark the termination of the algorithm if not explicitly mentioned.

Here is a live coding window where you can try out K Means Algorithm using scikit-learn
library.

5. Hierarchical Clustering

 Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy


of clusters. This algorithm starts with all the data points assigned to a cluster of
their own.

 Then two nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left.

 The results of hierarchical clustering can be shown using dendrogram.

16
The dendrogram can be interpreted as:

 At the bottom, we start with 25 data points, each assigned to separate clusters.
Two closest clusters are then merged till we have just one cluster at the top.

 The height in the dendrogram at which two clusters are merged represents the
distance between two clusters in the data space.

 The decision of the no. of clusters that can best depict different groups can be
chosen by observing the dendrogram.

 The best choice of the no. of clusters is the no. of vertical lines in the dendrogram
cut by a horizontal line that can transverse the maximum distance vertically
without intersecting a cluster.

 In the above example, the best choice of no. of clusters will be 4 as the red
horizontal line in the dendrogram below covers maximum vertical distance AB.

17
Two important things that you should know about hierarchical clustering are:

1. This algorithm has been implemented above using bottom up approach. It


is also possible to follow top-down approach starting with all data points
assigned in the same cluster and recursively performing splits till each
data point is assigned a separate cluster.
2. The decision of merging two clusters is taken on the basis of closeness of
these clusters. There are multiple metrics for deciding the closeness of two
clusters :
 Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
 Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
 Manhattan distance: ||a-b||1 = Σ|ai-bi|
 Maximum distance:||a-b||INFINITY = maxi|ai-bi|
 Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

6. Difference between K Means and Hierarchical clustering

 Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).

18
 In K Means clustering, since we start with random choice of clusters, the results produced
by running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide
your data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram

7. Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most
popular applications of clustering are:

 Recommendation engines
 Market segmentation
 Social network analysis
 Search result grouping
 Medical imaging
 Image segmentation
 Anomaly detection

8. Improving Supervised Learning Algorithms with Clustering

 Clustering is an unsupervised machine learning approach, but can it be used to


improve the accuracy of supervised machine learning algorithms as well by
clustering the data points into similar groups and using these cluster labels as
independent variables in the supervised machine learning algorithm? Let’s find
out.

 Let’s check out the impact of clustering on the accuracy of our model for the
classification problem using 3000 observations with 100 predictors of stock data
to predicting whether the stock will go up or down using R.

19
2. Briefly explain K-means Clustering algorithm with example.

K-means clustering is one of the simplest algorithms which use unsupervised learning
method to solve known clustering issues. K-means clustering require following two inputs.

1. k = number of clusters
2. Training set(m) = {x1, x2, x3,……….., xm}

Let’s say you have an unlabeled data set like the one shown below and you want to group this
data into clusters.

Now, the important question is how should you choose the optimum number of clusters? There
are two possible ways for choosing the number of clusters.

(i) Elbow Method

 Here, you draw a curve between WSS (within sum of squares) and the number of
clusters.
 It is called elbow method because the curve looks like a human arm and the elbow point
gives us the optimum number of clusters.

20
 As you can see that after the elbow point, there is a very slow change in the value of
WSS, so you should take the elbow point value as the final number of clusters.

(ii) Purpose Based

 You can run k-means clustering algorithm to get different clusters based on a variety of
purposes.
 You can partition the data on different metrics and see how well it performs for that
particular case. Let’s take an example of marketing T-shirts of different sizes.
 You can partition the dataset into different number of clusters depending upon the
purpose that you want to meet.
 In the following example, I have taken two different criteria, price and comfort.

Let’s see these two possibilities as shown in the image below.

21
1. K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will
divide the data set into 3 clusters.
2. K=5: Now, if you want to provide more comfort and variety to your customers with more
sizes (XS, S, M, L, XL), then you will divide the data set into 5 clusters.

Now, once we have the value of k with us, let’s understand its execution.
Initialization

 Firstly, you need to randomly initialize two points called the cluster centroids.
 Here, you need to make sure that your cluster centroids depicted by an orange and
blue cross as shown in the image are less than the training data points depicted by
navy blue dots.
 K-means clustering algorithm is an iterative algorithm and it follows next two
steps iteratively.
 Once you are done with the initialization, let’s move on to the next step.

22
Cluster Assignment

 In this step, it will go through all the navy blue data points to compute the distance
between the data points and the cluster centroid initialized in the previous step.
 Now, depending upon the minimum distance from the orange cluster centroid or blue
cluster centroid, it will group itself into that particular group.
 So, data points are divided into two groups, one represented by orange color and the other
one in blue color as shown in the graph.
 Since these cluster formations are not the optimized clusters, so let’s move ahead and see
how to get final clusters.

MoveCentroid

 Now, you will take the above two cluster centroids and iteratively reposition them
for optimization.
 You will take all blue dots, compute their average and move current cluster
centroid to this new location.
 Similarly, you will move orange cluster centroid to the average of orange data
points.
 Therefore, the new cluster centroids will look as shown in the graph.

23
 Moving forward, let’s see how we can optimize clusters which will give us better
insight.

Optimization

 You need to repeat above two steps iteratively till the cluster centroids stop
changing their positions and become static.
 Once the clusters become static, then k-means clustering algorithm is said to be
converged.

Convergence

 Finally, k-means clustering algorithm converges and divides the data points into
two clusters clearly visible in orange and blue.
 It can happen that k-means may end up converging with different solutions
depending on how the clusters were initialized.
 As you can see in the graph below, the three clusters are clearly visible but you
might end up having different clusters depending upon your choice of cluster
centroids.

24
 Below shown are some other possibilities of cluster partitioning based on the different
choice of cluster centroids.
 You may end up having any of these groupings based on your requirements and the goal
that you are trying to achieve.

 Now that you have understood the concepts of clustering, so let’s do some hands-on in R.

K-means clustering case study: Movie clustering

 Let’s say, you have a movie dataset with 28 different attributes ranging from director face
book likes, movie likes, actor likes, budget to gross and you want to find out movies with
maximum popularity amongst the viewers.
 You can achieve this by k-means clustering and divide the entire data into different
clusters and do further analysis based on the popularity.
 For this, I have taken the movie dataset of 5000 values with 28 attributes. You can find
the dataset here.

Step 1. First, I have loaded the dataset in RStudio.

movie_metadata <- read_csv(“~/movie_metadata.csv”)

View(movie_metadata)

25
Step 2. As you can see that there are many NA values in this data, so I will clean the dataset and
remove all the null values from it.

movie <- data.matrix(movie_metadata)

movie <- na.omit(movie)

Step 3. In this example, I have taken first 500 values from the data set for analysis.

smple <- movie[sample(nrow(movie),500),]

Step 4. Further, with the R code below, you can take two attributes budget and gross from the
dataset to make clusters.

smple_short <- smple[c(9,23)]

26
3. What is a Decision Tree? How to Create a Perfect Decision Tree?

 A decision tree is a map of the possible outcomes of a series of related choices.


 It allows an individual or organization to weigh possible actions against one another
based on their costs, probabilities, and benefits.
 As the name goes, it uses a tree-like model of decisions. They can be used either to drive
informal discussion or to map out an algorithm that predicts the best choice
mathematically.
 A decision tree typically starts with a single node, which branches into possible
outcomes. Each of those outcomes leads to additional nodes, which branch off into other
possibilities.
 This gives it a tree-like shape.

There are three different types of nodes:

 Chance nodes,
 Decision nodes

27
 End nodes.
 A chance node, represented by a circle, shows the probabilities of certain results.
 A decision node, represented by a square, shows a decision to be made, and an end node
shows the final outcome of a decision path.

Advantages & Disadvantages of Decision Trees

Advantages

 Decision trees generate understandable rules.


 Decision trees perform classification without requiring much computation.
 Decision trees are capable of handling both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for prediction
or classification.

Disadvantages

 Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
 Decision trees can be computationally expensive to train.
 The process of growing a decision tree is computationally expensive. At each node, each
candidate splitting field must be sorted before its best split can be found.
 In some algorithms, combinations of fields are used and a search must be made for
optimal combining weights.
 Pruning algorithms can also be expensive since many candidate sub-trees must be formed
and compared.

Creating a Decision Tree

 Let us consider a scenario where a new planet is discovered by a group of astronomers.

28
 Now the question is whether it could be ‘the next earth?’ The answer to this question will
revolutionize the way people live. Well, literally!
 There is n number of deciding factors which need to be thoroughly researched to take an
intelligent decision.
 These factors can be whether water is present on the planet, what is the temperature,
whether the surface is prone to continuous storms, flora and fauna survives the climate or
not, etc.
 Let us create a decision tree to find out whether we have discovered a new habitat.
 The habitable temperature falls into the range 0 to 100 Celsius.

Whether water is present or not?

29
Whether flora and fauna flourishes?

The planet has a stormy surface?

30
Classification Rules:

 Classification rules are the cases in which all the scenarios are taken into
consideration and a class variable is assigned to each.

Class Variable:
 Each leaf node is assigned a class-variable. A class-variable is the final output
which leads to our decision.

Let us derive the classification rules from the Decision Tree created:

1. If Temperature is not between 273 to 373K, -> Survival Difficult

2. If Temperature is between 273 to 373K, and water is not present, -> Survival Difficult

3. If Temperature is between 273 to 373K, water is present, and flora and fauna is not present ->
Survival Difficult

4. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a
stormy surface is not present -> Survival Probable

5. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a
stormy surface is present -> Survival Difficult

Decision Tree

A decision tree has the following constituents:

 Root Node: The factor of ‘temperature’ is considered as the root in this case.
 Internal Node: The nodes with one incoming edge and 2 or more outgoing edges.
 Leaf Node: This is the terminal node with no out-going edge.

31
 As the decision tree is now constructed, starting from the root-node we check the test
condition and assign the control to one of the outgoing edges, and so the condition is
again tested and a node is assigned.
 The decision tree is said to be complete when all the test conditions lead to a leaf node.
The leaf node contains the class-labels, which vote in favor or against the decision.
 Let's imagine you are playing a game of Twenty Questions.
 Your opponent has secretly chosen a subject, and you must figure out what he/she chose.
 At each turn, you may ask a yes-or-no question, and your opponent must answer
truthfully.
 How do you find out the secret in the fewest number of questions?

 It should be obvious some questions are better than others.


 For example, asking "Can it fly?" as your first question is likely to be unfruitful, whereas
asking "Is it alive?" is a bit more useful. Intuitively, you want each question to
significantly narrow down the space of possibly secrets, eventually leading to your
answer.
 That is the basic idea behind decision trees.
 At each point, you consider a set of questions that can partition your data set.
 You choose the question that provides the best split and again find the best questions for
the partitions. You stop once all the points you are considering are of the same class.
 Then the task of classication is easy. You can simply grab a point, and chuck it down the
tree. The questions will guide it to its appropriate class.
 Since this tutorial is in R, I highly recommend you take a look at our Introduction to
R or Intermediate R course, depending on your level of advancement.
 In this tutorial, you will learn about the different types of decision trees, the advantages
and disadvantages, and how to implement this yourself in R.

32
Let's identify important terminologies on Decision Tree, looking at the image above:

 Root Node represents the entire population or sample. It further gets divided into two or
more homogeneous sets.
 Splitting is a process of dividing a node into two or more sub-nodes.
 When a sub-node splits into further sub-nodes, it is called a Decision Node.
 Nodes that do not split is called a Terminal Node or a Leaf.
 When you remove sub-nodes of a decision node, this process is called Pruning. The
opposite of pruning is Splitting.
 A sub-section of an entire tree is called Branch.
 A node, which is divided into sub-nodes is called a parent node of the sub-nodes;
whereas the sub-nodes are called the child of the parent node.

33
4. Briefly explain the various types of decision trees.

Regression Trees

 Let's take a look at the image below, which helps visualize the nature of
partitioning carried out by a Regression Tree.
 This shows an unpruned tree and a regression tree fit to a random dataset. Both
the visualizations show a series of splitting rules, starting at the top of the tree.
 Notice that every split of the domain is aligned with one of the feature axes. The
concept of axis parallel splitting generalizes straightforwardly to dimensions
greater than two.
 For a feature space of size pp, a subset of RpRp, the space is divided
into MM regions, RmRm, each of which is a pp-dimensional "hyperblock".

34
 In order to build a regression tree, you first use recursive binary splititng to grow
a large tree on the training data, stopping only when each terminal node has fewer
than some minimum number of observations.
 Recursive Binary Splitting is a greedy and top-down algorithm used to minimize
the Residual Sum of Squares (RSS), an error measure also used in linear
regression settings.

The RSS, in the case of a partitioned feature space with M partitions is given by:

 Beginning at the top of the tree, you split it into 2 branches, creating a partition of 2
spaces.
 You then carry out this particular split at the top of the tree multiple times and choose the
split of the features that minimizes the (current) RSS.
 Next, you apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of αα.
 The basic idea here is to introduce an additional tuning parameter, denoted by αα that
balances the depth of the tree and its goodness of fit to the training data.
 You can use K-fold cross-validation to choose αα. This technique simply involves
dividing the training observations into K folds to estimate the test error rate of the
subtrees.
 Your goal is to select the one that leads to the lowest error rate.

Classification Trees

 A classification tree is very similar to a regression tree, except that it is used to predict a
qualitative response rather than a quantitative one.
 Recall that for a regression tree, the predicted response for an observation is given by the
mean response of the training observations that belong to the same terminal node.
 In contrast, for a classification tree, you predict that each observation belongs to the most
commonly occurring class of training observations in the region to which it belongs.

35
 In interpreting the results of a classification tree, you are often interested not only in the
class prediction corresponding to a particular terminal node region, but also in the class
proportions among the training observations that fall into that region.

 The task of growing a classification tree is quite similar to the task of growing a
regression tree.
 Just as in the regression setting, you use recursive binary splitting to grow a classification
tree.
 However, in the classification setting, Residual Sum of Squares cannot be used as a
criterion for making the binary splits.

Instead, you can use any of these 3 methods below:

 Classification Error Rate: Rather than seeing how far a numerical response is away from
the mean value, as in the regression setting, you can instead define the "hit rate" as the
fraction of training observations in a particular region that don't belong to the most
widely occuring class.
 The error is given by this equation:

E = 1 - argmaxc(π^mcπ^mc)

in which π^mcπ^mc represents the fraction of training data in region Rm that belong
to class c.

36
Gini Index:

The Gini Index is an alternative error metric that is designed to show how "pure" a region
is. "Purity" in this case means how much of the training data in a particular region
belongs to a single class.

 If a region Rm contains data that is mostly from a single class c then the Gini Index value
will be small:

Cross-Entropy:

 A third alternative, which is similar to the Gini Index, is known as the Cross-Entropy or
Deviance:

 The cross-entropy will take on a value near zero if the π^mcπ^mc’s are all near 0 or
near 1.
 Therefore, like the Gini index, the cross-entropy will take on a small value if the mth
node is pure.
 In fact, it turns out that the Gini index and the cross-entropy are quite similar
numerically.
 When building a classification tree, either the Gini index or the cross-entropy are
typically used to evaluate the quality of a particular split, since they are more sensitive to
node purity than is the classification error rate.
 Any of these 3 approaches might be used when pruning the tree, but the classification
error rate is preferable if prediction accuracy of the final pruned tree is the goal.

Advantages and Disadvantages of Decision Trees

 The major advantage of using decision trees is that they are intuitively very
easy to explain.

37
 They closely mirror human decision-making compared to other regression and
classification approaches.
 They can be displayed graphically, and they can easily handle qualitative
predictors without the need to create dummy variables.
 However, decision trees generally do not have the same level of predictive
accuracy as other approaches, since they aren't quite robust.
 A small change in the data can cause a large change in the final estimated tree.
 By aggregating many decision trees, using methods like bagging, random
forests, and boosting, the predictive performance of decision trees can be
substantially improved.

Tree-Based Methods

Bagging

 The decision trees discussed above suffer from high variance, meaning if you split the
training data into 2 parts at random, and fit a decision tree to both halves, the results that
you get could be quite different.
 In contrast, a procedure with low variance will yield similar results if applied repeatedly
to distinct dataset.
 Bagging, or bootstrap aggregation, is a technique used to reduce the variance of your
predictions by combining the result of multiple classifiers modeled on different sub-
samples of the same dataset. Here is the equation for bagging:

in which you generate BB different bootstrapped training datasets.


You then train your method on the bthbth bootstrapped training set in order to get f^b(x)f^b(x),
and finally average the predictions.

38
The visual below shows the 3 different steps in bagging:

Step 1: Here you replace the original data with new data. The new data usually have a
fraction of the original data's columns and rows, which then can be used as hyper-
parameters in the bagging model.
Step 2: You build classifiers on each dataset. Generally, you can use the same classifier
for making models and predictions.
Step 3: Lastly, you use an average value to combine the predictions of all the classifiers,
depending on the problem. Generally, these combined values are more robust than a
single model.
 While bagging can improve predictions for many regression and classification methods, it
is particularly useful for decision trees.
 To apply bagging to regression/classification trees, you simply
construct BB regression/classification trees using BB bootstrapped training sets, and
average the resulting predictions.
 These trees are grown deep, and are not pruned. Hence each individual tree has high
variance, but low bias. Averaging these BB trees reduces the variance.

39
 Broadly speaking, bagging has been demonstrated to give impressive improvements in
accuracy by combining together hundreds or even thousands of trees into a single
procedure.

Random Forests

 Random Forests is a versatile machine learning method capable of performing both


regression and classification tasks.
 It also undertakes dimensional reduction methods, treats missing values, outlier values
and other essential steps of data exploration, and does a fairly good job.
 Random Forests provides an improvement over bagged trees by a small tweak that decor
relates the trees.
 As in bagging, you build a number of decision trees on bootstrapped training samples.
But when building these decision trees, each time a split in a tree is considered, a random
sample of m predictors is chosen as split candidates from the full set of pp predictors.
 The split is allowed to use only one of those mm predictors. This is the main difference
between random forests and bagging; because as in bagging, the choice of
predictor m=pm=p.

In order to grow a random forest, you should:


 First assume that the number of cases in the training set is K. Then, take a random sample
of these K cases, and then use this sample as the training set for growing the tree.

40
 If there are pp input variables, specify a number m<pm<p such that at each node, you can
select mm random variables out of the pp. The best split on these mm is used to split the
node.
 Each tree is subsequently grown to the largest extent possible and no pruning is needed.
 Finally, aggregate the predictions of the target trees to predict new data.

 Random Forests is very effective at estimating missing data and maintaining accuracy
when a large proportion of the data is missing.
 It can also balance errors in datasets where the classes are imbalanced. Most
importantly, it can handle massive datasets with large dimensionality.
 However, one disadvantage of using Random Forests is that you might easily over fit
noisy datasets, especially in the case of doing regression.

Boosting

 Boosting is another approach to improve the predictions resulting from a decision tree.
 Like bagging and random forests, it is a general approach that can be applied to many
statistical learning methods for regression or classification.
 Recall that bagging involves creating multiple copies of the original training dataset using
the bootstrap, fitting a separate decision tree to each copy, and then combining all of the
trees in order to create a single predictive model.
 Notably, each tree is built on a bootstrapped dataset, independent of the other trees.
 Boosting works in a similar way, except that the trees are grown sequentially: each tree is
grown using information from previously grown trees.
 Boosting does not involve bootstrap sampling; instead, each tree is fitted on a modified
version of the original dataset.

41
For regression and classification trees, boosting works like this:
 Unlike fitting a single large decision tree to the data, which amounts to fitting the data
hard and potentially over fitting, the boosting approach instead learns slowly.
 Given the current model, you fit a decision tree to the residuals from the model. That is,
you fit a tree using the current residuals, rather than the outcome YY, as the response.
 You then add this new decision tree into the fitted function in order to update the
residuals. Each of these trees can be rather small, with just a few terminal nodes,
determined by the parameter dd in the algorithm. By fitting small trees to the residuals,
you slowly improve f^f^ in areas where it does not perform well.
 The shrinkage parameter νν slows the process down even further, allowing more and
different shaped trees to attack the residuals.
 Boosting is very useful when you have a lot of data and you expect the decision trees to
be very complex.
 Boosting has been used to solve many challenging classification and regression problems,
including risk analysis, sentiment analysis, predictive advertising, price modeling, sales
estimation and patient diagnosis, among others.

42
5.Illustrate the concepts of Decision Trees in R

Classification Trees

 For this part, you work with the Car seats dataset using the tree package in R.
 Mind that you need to install the ISLR and tree packages in your R Studio environment
first. Let's first load the Car seats data frame from the ISLR package.
library (ISLR)
data (package="ISLR")
carseats<-Carseats

Let's also load the tree package.

require(tree)

The Car seats dataset is a data frame with 400 observations on the following 11 variables:
 Sales: unit sales in thousands
 Comp Price: price charged by competitor at each location
 Income: community income level in 1000s of dollars
 Advertising: local ad budget at each location in 1000s of dollars
 Population: regional pop in thousands
 Price: price for car seats at each site
 Shelve Loc: Bad, Good or Medium indicates quality of shelving location
 Age: age level of the population
 Education: ed level at location
 Urban: Yes/No
 US: Yes/No

Names (carseats)

Let's take a look at the histogram of car sales:

hist(carseats$Sales)

43
Observe that Sales is a quantitative variable.
 You want to demonstrate it using trees with a binary response.
 To do so, you turn Sales into a binary variable, which will be called high.
 If the sales are less than 8, it will be not high. Otherwise, it will be high.
 Then you can put that new variable High back into the data frame.

High = ifelse(carseats$Sales<=8, "No", "Yes")


carseats = data.frame(carseats, High)

 Now let's fill a model using decision trees. Of course, you can't have the Sales variable
here because your response variable High was created from Sales. Thus, let's exclude it
and fit the tree.

tree.carseats = tree(High~.-Sales, data=carseats)

Let's see the summary of your classification tree:

Summary (tree.carseats)

 You can see the variables involved, the number of terminal nodes, the residual mean
deviance, as well as the misclassification error rate. To make it more visual, let's plot the
tree as well, then annotate it using the handy text function:

plot(tree.carseats)

text(tree.carseats, pretty = 0)

 There are so many variables, making it very complicated to look at the tree. At least, you
can see that at each of the terminal nodes, they're labeled Yes or No. At each splitting
node, the variables and the value of the splitting choice are shown (for example, Price <
92.5 or Advertising < 13.5).
 For a detailed summary of the tree, simply print it. It'll be handy if you want to extact
details from the tree for other purposes:

tree.carseats

44
 It's time to prune the tree down. Let's create a training set and a test by splitting
the carseats dataframe into 250 training and 150 test samples.
 First, you set a seed to make the results reproducible. Then you take a random sample of
the ID (index) numbers of the samples.
 Specifically here, you sample from the set 1 to n row number of rows of car seats, which
is 400. You want a sample of size 250 (by default, sample uses without replacement).

set.seed(101)

train=sample(1:nrow(carseats), 250)

 So now you get this index of train, which indexes 250 of the 400 observations.
 You can refit the model with tree, using the same formula except telling the tree to use a
subset equals train. Then let's make a plot:

tree.carseats = tree(High~.-Sales, carseats, subset=train)

plot(tree.carseats)

text(tree.carseats, pretty=0)

 The plot looks a bit different because of the slightly different dataset. Nevertheless, the
complexity of the tree looks roughly the same.
 Now you're going to take this tree and predict it on the test set, using the predict method
for trees. Here you'll want to actually predict the class labels.

tree.pred = predict(tree.carseats, carseats[-train,], type="class")

Then you can evalute the error by using a misclassification table.

with(carseats[-train,], table(tree.pred, High))

 On the diagonals are the correct classifications, while off the diagonals are the incorrect
ones. You only want to recored the correct ones.
 To do that, you can take the sum of the 2 diagonals divided by the total (150 test
observations).

45
(72 + 43) / 150

Ok, you get an error of 0.76 with this tree.

 When growing a big bushy tree, it could have too much variance.
 Thus, let's use cross-validation to prune the tree optimally. Using cv.tree, you'll use the
misclassification error as the basis for doing the pruning.

cv.carseats = cv.tree(tree.carseats, FUN = prune.misclass)

cv.carseats

 Printing out the results shows the details of the path of the cross-validation.
 You can see the sizes of the trees as they were pruned back, the deviances as the pruning
proceeded, as well as the cost complexity parameter used in the process.

Let's plot this out:

plot(cv.carseats)

 Looking at the plot, you see a downward spiral part because of the misclassification error
on 250 cross-validated points. So let's pick a value in the downward steps (12).
 Then, let's prune the tree to a size of 12 to identify that tree. Finally, let's plot and
annotate that tree to see the outcome.

prune.carseats = prune.misclass(tree.carseats, best = 12)

plot(prune.carseats)

text(prune.carseats, pretty=0)

 It's a bit shallower than previous trees, and you can actually read the labels. Let's evaluate
it on the test dataset again.

46
tree.pred = predict(prune.carseats, carseats[-train,], type="class")

with(carseats[-train,], table(tree.pred, High))

(74 + 39) / 150

 Seems like the correct classifications dropped a little bit. It has done about the same as
your original tree, so pruning did not hurt much with respect to misclassification errors,
and gave a simpler tree.
 Often case, trees don't give very good prediction errors, so let's go ahead take a look at
random forests and boosting, which tend to outperform trees as far as prediction and
misclassification are concerned.

Random Forests

 For this part, you will use the Boston housing data to explore random forests and
boosting. The dataset is located in the MASS package.
 It gives housing values and other statistics in each of 506 suburbs of Boston based on a
1970 census.

library(MASS)

data(package="MASS")

boston<-Boston

dim(boston)

names(boston)

Let's also load the randomForest package.

require(randomForest)

47
 To prepare data for random forest, let's set the seed and create a sample training set of
300 observations.

set.seed(101)

train = sample(1:nrow(boston), 300)

 In this dataset, there are 506 surburbs of Boston. For each surburb, you have variables
such as crime per capita, types of industry, average # of rooms per dwelling, average
proportion of age of the houses etc.
 Let's use medv - the median value of owner-occupied homes for each of these surburbs,
as the response variable.
 Let's fit a random forest and see how well it performs. As being said, you use the
response medv, the median housing value (in $1K dollars), and the training sample set.

rf.boston = randomForest(medv~., data = boston, subset = train)

rf.boston

 Printing out the random forest gives its summary: the # of trees (500 were grown), the
mean squared residuals (MSR), and the percentage of variance explained.
 The MSR and % variance explained are based on the out-of-bag estimates, a very clever
device in random forests to get honest error estimates.
 The only tuning parameter in a random Forests is the argument called mtry, which is the
number of variables that are selected at each split of each tree when you make a split.
 As seen here, mtry is 4 of the 13 exploratory variables (excluding medv) in the Boston
Housing data - meaning that each time the tree comes to split a node, 4 variables would
be selected at random, then the split would be confined to 1 of those 4 variables. That's
how randomForests de-correlates the trees.
 You're going to fit a series of random forests. There are 13 variables, so let's
have mtry range from 1 to 13:

In order to record the errors, you set up 2 variables oob.err and test.err.

48
 In a loop of mtry from 1 to 13, you first fit the randomForest with that value of mtry on
the train dataset, restricting the number of trees to be 350.

Then you extract the mean-squared-error on the object (the out-of-bag error).

 Then you predict on the test dataset (boston[-train]) using fit (the fit of randomForest).

Lastly, you compute the test error: mean-squared error, which is equals to mean( (medv - pred) ^
2 ).

oob.err = double(13)

test.err = double(13)

for(mtry in 1:13){

fit = randomForest(medv~., data = boston, subset=train, mtry=mtry, ntree = 350)

oob.err[mtry] = fit$mse[350]

pred = predict(fit, boston[-train,])

test.err[mtry] = with(boston[-train,], mean( (medv-pred)^2 ))

 Basically you just grew 4550 trees (13 times 350). Now let's make a plot using
the matplot command.
 The test error and the out-of-bag error are binded together to make a 2-column matrix.
 There are a few other arguments in the matrix, including the plotting character values
(pch = 23 means filled diamond), colors (red and blue), type equals both (plotting both
points and connecting them with the lines), and name of y-axis (Mean Squared Error).
You can also put a legend at the top right corner of the plot.

49
matplot(1:mtry, cbind(test.err, oob.err), pch = 23, col = c("red", "blue"), type = "b", ylab="Mean
Squared Error")

legend("topright", legend = c("OOB", "Test"), pch = 23, col = c("red", "blue"))

 Ideally, these 2 curves should line up, but it seems like the test error is a bit lower.
 However, there's a lot of variability in these test error estimates.
 Since the out-of-bag error estimate was computed on one dataset and the test error
estimate was computed on another dataset, these differences are pretty much well within
the standard errors.
 Notice that the red curve is smoothly above the blue curve? These error estimates are
very correlated, because the random Forest with mtry = 4 is very similar to the one
with mtry = 5.
 That's why each of the curves is quite smooth. What you see is that mtry around 4 seems
to be the most optimal choice, at least for the test error. This value of mtry for the out-of-
bag error equals 9.
 So with very few tiers, you have fitted a very powerful prediction model using random
forests. How so? The left-hand side shows the performance of a single tree.
 The mean squared error on out-of-bag is 26, and you've dropped down to about 15 (just a
bit above half). This means you reduced the error by half.
 Likewise for the test error, you reduced the error from 20 to 12.

Boosting

 Compared to random forests, boosting grows smaller and stubbier trees and goes at the
bias. You will use the package GBM (Gradient Boosted Modeling), in R.

require(gbm)

 GBM asks for the distribution, which is Gaussian, because you'll be doing squared error
loss. You're going to ask GBM for 10,000 trees, which sounds like a lot, but these are
going to be shallow trees.

50
 Interaction depth is the number of splits, so you want 4 splits in each tree. Shrinkage is
0.01, which is how much you're going to shrink the tree step back.

boost.boston = gbm(medv~., data = boston[train,], distribution = "gaussian", n.trees = 10000,


shrinkage = 0.01, interaction.depth = 4)

summary(boost.boston)

 The summary function gives a variable importance plot. It seems like there are 2
variables that have high relative importance: rm (number of rooms) and lstat (percentage
of lower economic status people in the community). Let's plot these 2 variables:

plot(boost.boston,i="lstat")

plot(boost.boston,i="rm")

 The 1st plot shows that the higher the proportion of lower status people in the suburb, the
lower the value of the housing prices. The 2nd plot shows the reversed relationship with
the number of rooms: the average number of rooms in the house increases as the price
increases.
 It's time to predict a boosted model on the test dataset. Let's look at the test performance
as a function of the number of trees:
 First, you make a grid of number of trees in steps of 100 from 100 to 10,000.
 Then, you run the predict function on the boosted model. It takes n.trees as an argument,
and produces a matrix of predictions on the test data.
 The dimensions of the matrix are 206 test observations and 100 different predict vectors
at the 100 different values of tree.

n.trees = seq(from = 100, to = 10000, by = 100)

predmat = predict(boost.boston, newdata = boston[-train,], n.trees = n.trees)

dim(predmat)

51
 It's time to compute the test error for each of the predict vectors:
 predmat is a matrix, medv is a vector, thus (predmat - medv) is a matrix of differences.
You can use the apply function to the columns of these square differences (the mean).
That would compute the column-wise mean squared error for the predict vectors.
 Then you make a plot using similar parameters to that one used for Random Forest. It
would show a boosting error plot.

boost.err = with(boston[-train,], apply( (predmat - medv)^2, 2, mean) )

plot(n.trees, boost.err, pch = 23, ylab = "Mean Squared Error", xlab = "# Trees", main =
"Boosting Test Error")

abline(h = min(test.err), col = "red")

 The boosting error pretty much drops down as the number of trees increases.
 This is evidence showing that boosting is reluctant to over fit.
 Let's also include the best test error from the random Forest into the plot.
 Boosting actually gets a reasonable amount below the test error for random Forest.

52
6. What is a classifier in big data analytics? Explain in detail.

 A classifier is a machine learning model that is used to discriminate different objects based
on certain features.

Principle of Naive Bayes Classifier:

 A Naive Bayes classifier is a probabilistic machine learning model that’s used for
classification task. The crux of the classifier is based on the Bayes theorem.

Bayes Theorem:

 Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis.
 The assumption made here is that the predictors/features are independent. That is
presence of one particular feature does not affect the other. Hence it is called naive.

Example:

 Let us take an example to get some better intuition. Consider the problem of playing golf.
The dataset is represented as below.

53
 We classify whether the day is suitable for playing golf, given the features of the day.
The columns represent these features and the rows represent individual entries.
 If we take the first row of the dataset, we can observe that is not suitable for playing golf
if the outlook is rainy, temperature is hot, humidity is high and it is not windy.
 We make two assumptions here, one as stated above we consider that these predictors are
independent.
 That is, if the temperature is hot, it does not necessarily mean that the humidity is high.
54
 Another assumption made here is that all the predictors have an equal effect on the
outcome. That is, the day being windy does not have more importance in deciding to play
golf or not.

According to this example, Bayes theorem can be rewritten as:

 The variable y is the class variable (play golf), which represents if it is suitable to play golf
or not given the conditions. Variable X represents the parameters/features.

X is given as,

 Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook,
temperature, humidity and windy. By substituting for X and expanding using the chain
rule we get,

 Now,
you can obtain the values for each by looking at the dataset and substitute them into the
equation.
 For all entries in the dataset, the denominator does not change, it remain static. Therefore,
the denominator can be removed and proportionality can be introduced.

55
 In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate.
 Therefore, we need to find the class y with maximum probability.

Using the above function, we can obtain the class, given the predictors.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:

 This is mostly used for document classification problem, i.e whether a document belongs
to the category of sports, politics, technology etc.

 The features/predictors used by the classifier are the frequency of the words present in the
document.

Bernoulli Naive Bayes:

 This is similar to the multinomial naive bayes but the predictors are boolean variables.

 The parameters that we use to predict the class variable take up only values yes or no, for
example if a word occurs in the text or not.

Gaussian Naive Bayes:

56
 When the predictors take up a continuous value and are not discrete, we assume that these
values are sampled from a gaussian distribution.
Gaussian Distribution(Normal Distribution)

 Since the way the values are present in the dataset changes, the formula for conditional
probability changes to,

57

You might also like