KEMBAR78
Algorithms New | PDF | Cluster Analysis | Statistical Classification
0% found this document useful (0 votes)
20 views8 pages

Algorithms New

Evaluation

Uploaded by

Rehan Zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

Algorithms New

Evaluation

Uploaded by

Rehan Zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Algorithms

K-mean-clustering

K-mean clustering

An implementation of K-mean clustering:

function: k_mean(x, k)

Takes:

x: input data points in form of list of data point

eg: x = [[12,34,21], [23,34,23], ...... ]


k: number of cluster for dividing data points

eg: k=3
Returns:

cluster: contain assigned cluser for every data point

eg: cluster = [1, 0 ,2, ......]

K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw
inferences from them. It is based on centroid-based clustering.

Centroid - A centroid is a data point at the centre of a cluster. In centroid-based clustering,


clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity
is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as
follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result.
The algorithm requires number of clusters K and the data set as input. The data set is a collection
of features for each data point. The algorithm starts with initial estimates for the K centroids. The
algorithm then iterates between two steps:-
1. Data assignment step

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids
in set C, then each data point is assigned to a cluster based on minimum Euclidean distance.

2. Centroid update step

In this step, the centroids are recomputed and updated. This is done by taking the mean of all
data points assigned to that centroid’s cluster.

The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping
criteria means no data points change the clusters, the sum of the distances is minimized or some
maximum number of iterations is reached. This algorithm is guaranteed to converge to a result.
The result may be a local optimum meaning that assessing more than one run of the algorithm
with randomized starting centroids may give a better outcome.

3. Choosing the value of K


The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-
defined value of K. To find the number of clusters in the data, we need to run the K-Means
clustering algorithm for different values of K and compare the results. So, the performance of K-
Means algorithm depends upon the value of K. We should choose the optimal value of K that
gives us best performance. There are different techniques available to find the optimal value of
K. The most common technique is the elbow method which is described below.

4. The elbow method


The elbow method is used to determine the optimal number of clusters in K-means clustering.
The elbow method plots the value of the cost function produced by different values of K.

If K increases, average distortion will decrease. Then each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the
improvements in average distortion will decline as K increases. The value of K at which
improvement in distortion declines the most is called the elbow, at which we should stop
dividing the data into further clusters.
5. The problem statement
In this project, I implement K-Means clustering with Python and Scikit-Learn. As mentioned
earlier, K-Means clustering is used to find intrinsic groups within the unlabelled dataset and
draw inferences from them. I have used Facebook Live Sellers in Thailand Dataset for this
project. I implement K-Means clustering to find intrinsic groups within this dataset that display
the same status_type behaviour. The status_type behaviour variable consists of posts of a
different nature (video, photos, statuses and links).

Naïve Bayes
A custom implementation of a Naive Bayes Classifier written in Python 3.

Dataset

Loan Defaulters

Home Marital Annual Defaulted


Owner Status Income Borrower

Yes Single $125,000 No

No Married $100,000 No

No Single $70,000 No

Yes Married $120,000 No

No Divorced $95,000 Yes

No Married $60,000 No

Yes Divorced $220,000 No


Home Marital Annual Defaulted
Owner Status Income Borrower

No Single $85,000 Yes

No Married $75,000 No

No Single $90,000 Yes

Source: Introduction to Data Mining (1st Edition) by Pang-Ning Tan

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|
c). Look at the equation below:

Decision Tree

1. Introduction to Decision Tree algorithm


A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree
like structure and their possible combinations to solve a particular problem. It belongs to the
class of supervised learning algorithms where it can be used for both classification and
regression purposes.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.

2. Classification and Regression Trees (CART)


Nowadays, Decision Tree algorithm is known by its modern name CART which stands
for Classification and Regression Trees. Classification and Regression Trees or CART is a
term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for
classification and regression modeling problems.The CART algorithm provides a foundation for
other important algorithms like bagged decision trees, random forest and boosted decision trees.

In this project, I will solve a classification problem. So, I will refer the algorithm also as
Decision Tree Classification problem.

3. Decision Tree algorithm intuition


The Decision-Tree algorithm is one of the most frequently and widely used supervised machine
learning algorithms that can be used for both classification and regression tasks. The intuition
behind the Decision-Tree algorithm is very simple to understand.

The Decision Tree algorithm intuition is as follows:-

1. For each attribute in the dataset, the Decision-Tree algorithm forms a node. The most
important attribute is placed at the root node.
2. For evaluating the task in hand, we start at the root node and we work our way down the
tree by following the corresponding node that meets our condition or decision.
3. This process continues until a leaf node is reached. It contains the prediction or the
outcome of the Decision Tree.

4. Attribute selection measures


The primary challenge in the Decision Tree implementation is to identify the attributes which we
consider as the root node and each level. This process is known as the attributes selection.
There are different attributes selection measure to identify the attribute which can be considered
as the root node at each level.

There are 2 popular attribute selection measures. They are as follows:-

 Information gain
 Gini index

While using Information gain as a criterion, we assume attributes to be categorical and for Gini
index attributes are assumed to be continuous. These attribute selection measures are described
below.

Information gain
By using information gain as a criterion, we try to estimate the information contained by each
attribute. To understand the concept of Information Gain, we need to know another concept
called Entropy.

Entropy measures the impurity in the given dataset. In Physics and Mathematics, entropy is
referred to as the randomness or uncertainty of a random variable X. In information theory, it
refers to the impurity in a group of examples. Information gain is the decrease in entropy.
Information gain computes the difference between entropy before split and average entropy after
split of the dataset based on given attribute values.

The ID3 (Iterative Dichotomiser) Decision Tree algorithm uses entropy to calculate information
gain. So, by calculating decrease in entropy measure of each attribute we can calculate their
information gain. The attribute with the highest information gain is chosen as the splitting
attribute at the node.

Gini index
Another attribute selection measure that CART (Categorical and Regression Trees) uses is
the Gini index. It uses the Gini method to create split points.

Gini index says, if we randomly select two items from a population, they must be of the same
class and probability for this is 1 if the population is pure.
It works with the categorical target variable “Success” or “Failure”. It performs only binary
splits. The higher the value of Gini, higher the homogeneity. CART (Classification and
Regression Tree) uses the Gini method to create binary splits.

Steps to Calculate Gini for a split

1. Calculate Gini for sub-nodes, using formula sum of the square of probability for success
and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split.

In case of a discrete-valued attribute, the subset that gives the minimum gini index for that
chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy
is to select each pair of adjacent values as a possible split-point and point with smaller gini index
chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting
attribute.

5. The problem statement


The problem is to predict the safety of the car. In this project, I build a Decision Tree Classifier
to predict the safety of the car. I implement Decision Tree Classification with Python and Scikit-
Learn. I have used the Car Evaluation Data Set for this project, downloaded from the UCI
Machine Learning Repository website.

Random Forest Regression

A random forest is a meta estimator that fits a number of classifying decision trees on various
sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting. The sub-sample size is always the same as the original input sample size but the
samples are drawn with replacement (can be changed by user).

Generally, Decision Tree and Random Forest models are used for classification task. However,
the idea of Random Forest as a regularizing meta-estimator over single decision tree is best
demonstrated by applying them to regresion problems. This way it can be shown that, in the
presence of random noise, single decision tree is prone to overfitting and learn spurious
correlations while a properly constructed Random Forest model is more immune to such
overfitting.

What is make_regression method?


It is a convenient method/function from scikit-learn stable to generate a random regression
problem. The input set can either be well conditioned (by default) or have a low rank-fat tail
singular profile.

Import dataset, make Scatter plots and histograms.

How will a Decision Tree regressor do?


Every run will generate different result but on most occasions, the single decision tree regressor
is likely to learn spurious features i.e. will assign small importance to features which are not true
regressors.

The output is generated by applying a (potentially biased) random linear regression model
with n_informative nonzero regressors to the previously generated input and some gaussian
centered noise with some adjustable scale.

Show the relative importance of regressors side by side


For Random Forest Model, show the relative importance of features as determined by the meta-
estimator. For the OLS model, show normalized t-statistic values.

It will be clear that although the RandomForest regressor identifies the important regressors
correctly, it does not assign the same level of relative importance to them as done by OLS
method t-statistic.

You might also like