UNIT II
Data types, Input and Output of Data Mining Algorithms Decision Trees – Constructing
    Classification Trees – CHAID – CART – Regression Trees – Pruning Model Estimation.
             CHAID:
    CHAID (Chi-square Automatic Interaction Detector)
    Market research is an essential activity for every business and helps you to identify and analyze
    market demand, market size, market trends and the strength of your competition. It also enables
    you to assess the viability of a potential product or service before taking it to market. It is a field
    that recognizes the importance of utilizing data to make evidence based decisions and many
    statistical and analytical methods have become popular in the field of quantitative market
    research.
•   CHAID distinguishes 3 types of indep. variables
                     - Monotonic
                     - Free
                     - Floating
•   Basic components of CHAID analysis
             1. A categorical dep. var.
             2. A set of categorical indep. variables
             3. Settings for various CHAID parameters
             4. Analyze subgroups and identify “best” indep.      var
    What is it for?
    CHAID (Chi-square Automatic Interaction Detector) analysis is an algorithm used for
    discovering relationships between a categorical response variable and other categorical predictor
    variables. It is useful when looking for patterns in datasets with lots of categorical variables and
    is a convenient way of summarizing the data as the relationships can be easily visualized.
    In practice, CHAID is often used in direct marketing to understand how different groups of
    customers might respond to a campaign based on their characteristics. So suppose, for example,
    that we run a marketing campaign and are interested in understanding what customer
    characteristics (e.g., gender, socio-economic status, geographic location, etc.) are associated with
    the response rate achieved. We build a CHAID “tree” showing the effects of different customer
    characteristics on the likelihood of response.
    What statistical techniques are used?
    As indicated in the name, CHAID uses Person’s Chi-square tests of independence, which test for
    an association between two categorical variables. A statistically significant result indicates that
    the two variables are not independent, i.e., there is a relationship between them.
    Alternatives methods
    CHAID is sometimes used as an exploratory method for predictive modeling. Advantage of this
    modeling approach is that we are able to analyze the data all-in-one rather than splitting the data
    into subgroups and performing multiple tests. In particular, where a continuous response variable
    is of interest or there are a number of continuous predictors to consider, we would recommend
    performing a multiple regression analysis instead.
    Example: Opening of Cinema/ Children’s Park/Exhibition Center
•   To find consumer responses to opening of Cinema, Children’s park or Exhibition
•   903 respondents were asked to rate each alternative on a 5 point scale: 1(v.low) to 5 (v.high)
•   The analyst also collected demographic data on the respondents
•   Dependent var - % of positive responses
•   Indep variables (with coding in parenthesis)
                            Gender: Male (1), Female (2)
                            Age: 16-20 (1)
                                    21-24 (2)
                                    25-34 (3)
                                    35-44 (4)
                                    45-54 (5)
                                    55-64 (6)
                                    65+ (7)
                    Socio-economic group had 6 categories:A(1), B(2), C1(3), C2(4) etc
•   CHAID is a dependence method.
•   For given dep var. we want technique that can
            1. Indicate indep. var. that most affect dep. var.
            2. Identify mkt. segments that differ most on these important. indep. var.
•   Early interaction detection method is AID
            AID employs hierarchical binary splitting algorithm
•   General procedure
        1. First select indep. var. whose subgroups differ most w.r.t dep. var.
        2. Each subgroup of this var. is further divided into subgroups on remaining variables
        3. These subgroups are tested for differences on dep. var.
        4. Var. with greatest difference is selected next
        5. Continue until subgroups are too small
    Introduction to Classification & Regression Trees (CART)
    Decision Trees are commonly used in data mining with the objective of creating a model that
    predicts the value of a target (or dependent variable) based on the values of several input (or
    independent variables).
    The CART decision tree methodology. The CART or Classification & Regression Trees
    methodology was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
    Olshen and Charles Stone as an umbrella term to refer to the following types of decision trees:
   Classification Trees: where the target variable is categorical and the tree is used to identify the
    "class" within which a target variable would likely fall into.
   Regression Trees: where the target variable is continuous and tree is used to predict it's value.
    The CART algorithm is structured as a sequence of questions, the answers to which determine
    what the next question, if any should be. The result of these questions is a tree like structure
    where the ends are terminal nodes at which point there are no more questions. A simple example
    of a decision tree is as follows
   The main elements of CART (and any decision tree algorithm) are:
1. Rules for splitting data at a node based on the value of one variable;
2. Stopping rules for deciding when a branch is terminal and can be split no more; and
3. Finally, a prediction for the target variable in each terminal node.
    Some useful features and advantages of CART
   CART is nonparametric and therefore does not rely on data belonging to a particular type of
    distribution.
   CART is not significantly impacted by outliers in the input variables.
   You can relax stopping rules to "overgrow" decision trees and then prune back the tree to the
    optimal size. This approach minimizes the probability that important structure in the data set
    will be overlooked by stopping too soon.
   CART incorporates both testing with a test data set and cross-validation to assess the goodness
    of fit more accurately.
   CART can use the same variables more than once in different parts of the tree. This capability
    can uncover complex interdependencies between sets of variables.
   CART can be used in conjunction with other prediction methods to select the input set of
    variables.
    PRUNING (Decision Trees)
Pruning is a technique in machine learning that reduces the size of decision trees by removing
sections of the tree that provide little power to classify instances. Pruning reduces the complexity
of the final classifier, and hence improves predictive accuracy by the reduction of overfitting
One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A
tree that is too large risks overfitting the training data and poorly generalizing to new samples. A
small tree might not capture important structural information about the sample space. However,
it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition
of a single extra node will dramatically decrease error. This problem is known as the horizon
effect. A common strategy is to grow the tree until each node contains a small number of
instances then use pruning to remove nodes that do not provide additional information.[1]
Pruning should reduce the size of a learning tree without reducing predictive accuracy as
measured by a cross-validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance
Pruning can occur in a top down or bottom up fashion. A top down pruning will traverse nodes
and trim subtrees starting at the root, while a bottom up pruning will start at the leaf nodes.
Below are several popular pruning algorithms.
One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node
is replaced with its most popular class. If the prediction accuracy is not affected then the change
is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and
speed.
Cost complexity pruning generates a series of trees T0 . . . Tm where T0 is the initial tree and Tm is
the root alone. At step i the tree is created by removing a subtree from treei-1 and replacing it
with a leaf node with value chosen as in the tree building algorithm. The subtree that is removed
is chosen as follows. Define the error rate of tree T over data set S as err(T,S). The subtree that
minimizes                                                        is chosen for removal. The
function prune(T,t) defines the tree gotten by pruning the subtrees t from the tree T. Once the
series of trees has been created, the best tree is chosen by generalized accuracy as measured by a
training set or cross-validation.
UNIT III                                                                  (12 Hours)
       Preprocessing and post processing in Data Mining – Steps in preprocessing –
Discretization – Feature Extraction, selection and construction – Post processing – Association
Rule.
Discretization
Pre-processing of numerics.
      Not every learning can handle numerics
      Not every learning can handle numerics very well
      Discretization: convert numbers to a finite number of bins.
      Simple schemes (like nbins), are surprisingly effective.
      E.g. really helps for NaiveBayes.
Two kinds of discretization
Supervised Discretization
Separates the numerics according to the class variable
      E.g. find a cliff where suddenly everything switches from one class to another
           o e.g. class=happy if age under 20; class=sad if age over 20
           o 20 is the cliff
      Best left to the learner
Some details on cliff learning Per numeric attribute, apply the following:
      Sort the instances on the attribute of interest
      Look for potential cut-points.
           o Cut points are points in the sorted list above where the class labels change.
           o Eg. if I had five instances with values for the attribute of interest and labels
           o (1.0,A), (1.4,A), (1.7, A), (2.0,B), (3.0, B), (7.0, A),
           o then there are two cut points of interest (mid-way between the points where the
               classes change from A to B or vice versa):
                    1.85
                    5
      Apply your favorite measure on each of the cuts, and choose the one with the maximum
       value
           o e.g. info gain, gain ratio, gini coefficient, chi-squared test.
           o Common practice is to follow the lead of and use info gain
      Repeat recursively in both subsets (the ones less than and greater than the cut point) until
       either
           o the subset is "pure" i.e. only contains instances of a single class
           o some stopping criterion is reached. e.g. too few examples to proceed
Unsupervised Discretization
Ignores class symbols
e.g. EqualIntervalDiscretization
      Nbins: divide according to (max-min)/N; often N=10
          o So the frequency counts in each bin may be unequal (bumpy histogram).
          o e.g. divide 0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100 using 10bins
           o   0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
           o   -------------------------------------------|--|--|--|--|---|
               bin1                          b2 b3 b5 b9 b10
           o   one bucket would get numbers 1 to 25,
           o   the last 4 numbers would get a bin each.
           o   So our learner would have 5 bins with nothing in it,
                    one bin with 83% of the data and
                    4 bins with 3.3% of the data in each.
           o   Variants:
                    BinLogging: set N via the number of unique numerics
                        N=max(1,log2(uniqueValues))
e.g. EqualFrequencyDiscretization (a.k.a. HistogramDiscretization or PercentileChop):
      Keep the number of numbers in each division equal
      So the frequency counts in each bin is equal (flat histogram).
      In practice, not quite flat. e.g. 10 equal frequency bins on the above data:
0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
 -----|----|-----|-----|-----|-----|-----|------|---------|
 bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9
      Note the buckets with repeated entries.
          o Its a design choice what to do with those.
          o We might squash them together such that there are no repeats in the numbers that
              are the boundaries between bins.
           o   0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
           o   --------------|-------------|--------|----|-------|-------
               bin1        bin2      bin3    bin4 bin5    bin6
      Variants: ProportionalKIntervalDiscretization
          o EqualFrequencyDiscretization where the number of bins is
              sqrt(numberOfInstances)
          o Best for NaiveBaues
UNIT IV                                                                   (12 Hours)
        Algorithms for classification and Regression – Naïve Bayes – Multiple Regression
Analysis – Logistic K – Nearest Neighbour classification – GMDH – Cluster Analysis –
Partitioning clustering – K – Medoids – Visualization of Multidimensional Data.
GMDH :
       Group method of data handling (GMDH) is a family of inductive algorithms for
        computer-based mathematical modeling of multi-parametric datasets that features fully
        automatic structural and parametric optimization of models.
       GMDH is used in such fields as data mining, knowledge discovery, prediction, complex
        systems modeling, optimization and pattern recognition.
       GMDH algorithms are characterized by inductive procedure that performs sorting-out of
        gradually complicated polynomial models and selecting the best solution by means of the
        so-called external criterion.
       A GMDH model with multiple inputs and one output is a subset of components of
        the base function (1):
where f are elementary functions dependent on different sets of inputs, a are coefficients and m is
the number of the base function components.
Algorithms:
   Combinatorial (COMBI)
   Multilayered Iterative (MIA)
   GN
   Objective System Analysis (OSA)
   Harmonical
   Two-level (ARIMAD)
   Multiplicative-Additive (MAA)
   Objective Computer Clusterization (OCC);
   Pointing Finger (PF) clusterization algorithm;
   Analogues Complexing (AC)
   Harmonical Rediscretization
   Algorithm on the base of Multilayered Theory of Statistical Decisions (MTSD)
   Group of Adaptive Models Evolution (GAME)
Cluster Analysis:
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
    Points to Remember
   A cluster of data objects can be treated as one group.
   While doing cluster analysis, we first partition the set of data into groups based on data similarity
    and then assign the labels to the groups.
   The main advantage of clustering over classification is that, it is adaptable to changes and helps
    single out useful features that distinguish different groups.
    Applications of Cluster Analysis
   Clustering analysis is broadly used in many applications such as market research, pattern
    recognition, data analysis, and image processing.
   Clustering can also help marketers discover distinct groups in their customer base. And they can
    characterize their customer groups based on the purchasing patterns.
   In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
    with similar functionalities and gain insight into structures inherent to populations.
   Clustering also helps in identification of areas of similar land use in an earth observation
    database. It also helps in the identification of groups of houses in a city according to house type,
    value, and geographic location.
   Clustering also helps in classifying documents on the web for information discovery.
   Clustering is also used in outlier detection applications such as detection of credit card fraud.
   As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
    data to observe characteristics of each cluster.
    Requirements of Clustering in Data Mining
    The following points throw light on why clustering is required in data mining −
   Scalability − We need highly scalable clustering algorithms to deal with large databases.
   Ability to deal with different kinds of attributes − Algorithms should be capable to be applied
    on any kind of data such as interval-based (numerical) data, categorical, and binary data.
   Discovery of clusters with attribute shape − The clustering algorithm should be capable of
    detecting clusters of arbitrary shape. They should not be bounded to only distance measures that
    tend to find spherical cluster of small sizes.
   High dimensionality − The clustering algorithm should not only be able to handle low-
    dimensional data but also the high dimensional space.
   Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
    algorithms are sensitive to such data and may lead to poor quality clusters.
   Interpretability − The clustering results should be interpretable, comprehensible, and usable.
    Clustering Methods
    Clustering methods can be classified into the following categories −
   Partitioning Method
   Hierarchical Method
   Density-based Method
   Grid-Based Method
   Model-Based Method
   Constraint-based Method
    Partitioning Method
    Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
    partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the
    data into k groups, which satisfy the following requirements −
   Each group contains at least one object.
   Each object must belong to exactly one group.
    Points to remember −
   For a given number of partitions (say k), the partitioning method will create an initial
    partitioning.
   Then it uses the iterative relocation technique to improve the partitioning by moving objects
    from one group to other.
    Hierarchical Methods
    This method creates a hierarchical decomposition of the given set of data objects. We can
    classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
    There are two approaches here −
   Agglomerative Approach
   Divisive Approach
    Agglomerative Approach
    This approach is also known as the bottom-up approach. In this, we start with each object
    forming a separate group. It keeps on merging the objects or groups that are close to one another.
    It keep on doing so until all of the groups are merged into one or until the termination condition
    holds.
    Divisive Approach
    This approach is also known as the top-down approach. In this, we start with all of the objects in
    the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
    until each object in one cluster or the termination condition holds. This method is rigid, i.e., once
    a merging or splitting is done, it can never be undone.
    Approaches to Improve Quality of Hierarchical Clustering
    Here are the two approaches that are used to improve the quality of hierarchical clustering −
   Perform careful analysis of object linkages at each hierarchical partitioning.
   Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
    group objects into micro-clusters, and then performing macro-clustering on the micro-clusters.
    Density-based Method
    This method is based on the notion of density. The basic idea is to continue growing the given
    cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
    point within a given cluster, the radius of a given cluster has to contain at least a minimum
    number of points.
    Grid-based Method
    In this, the objects together form a grid. The object space is quantized into finite number of cells
    that form a grid structure.
    Advantage
   The major advantage of this method is fast processing time.
   It is dependent only on the number of cells in each dimension in the quantized space.
    Model-based methods
    In this method, a model is hypothesized for each cluster to find the best fit of data for a given
    model. This method locates the clusters by clustering the density function. It reflects spatial
    distribution of the data points.
    This method also provides a way to automatically determine the number of clusters based on
    standard statistics, taking outlier or noise into account. It therefore yields robust clustering
    methods.
    Constraint-based Method
    In this method, the clustering is performed by the incorporation of user or application-oriented
    constraints. A constraint refers to the user expectation or the properties of desired clustering
    results. Constraints provide us with an interactive way of communication with the clustering
    process. Constraints can be specified by the user or the application requirement.
    k-means:
    What does it do? k-means creates groups from a set of objects so that the members of a group
    are more similar. It’s a popular cluster analysis technique for exploring a dataset.
    what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such
    that the group members are more similar versus non-group members. Clusters and groups are
    synonymous in the world of cluster analysis.
    Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster
    analysis, these would be called observations. We know various things about each patient like
    age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector representing the patient.
     You can basically think of a vector as a list of numbers we know about the patient. This list can
     also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood
     pressure another dimension and so forth.
     You might be wondering:
     Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood
     pressure, etc?
     Want to know the best part?
     You tell k-means how many clusters you want. K-means takes care of the rest.
     How does k-means take care of the rest? k-means has lots of variations to optimize for certain
     types of data.
     At a high level, they all do something like this:
1.   k-means picks points in multi-dimensional space to represent each of the k clusters. These are
     called centroids.
2.   Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the
     same one, so they’ll form a cluster around their nearest centroid.
3.   What we have are k clusters, and each patient is now a member of a cluster.
4.   k-means then finds the center for each of the k clusters based on its cluster members (yep, using
     the patient vectors!).
5. This center becomes the new centroid for the cluster.
6. Since the centroid is in a different place now, patients might now be closer to other centroids. In
   other words, they may change cluster membership.
7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships
   stabilize. This is called convergence.
     Is this supervised or unsupervised? It depends, but most would classify k-means as
     unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its
     own without any information about which cluster an observation belongs to. k-means can
     be semi-supervised.
     Why use k-means? I don’t think many will have an issue with this:
     The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and
     more efficient than other algorithms, especially over large datasets.
     It gets better:
k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster
analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore
whether there are overlooked patterns or relationships in the dataset.
Drawbacks:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial
choice of centroids. One final thing to keep in mind is k-means is designed to operate on
continuous data — you’ll need to do some tricks to get it to work on discrete data.
Where is it used? A ton of implementations for k-means clustering are available online:
Apache Mahout, Julia, R, SciPy, Weka, MATLAB, SAS,
Apriori:
What does it do? The Apriori algorithm learns association rules and is applied to a database
containing a large number of transactions.
What are association rules? Association rule learning is a data mining technique for learning
correlations and relations among variables in a database.
What’s an example of Apriori? Let’s say we have a database full of supermarket transactions.
You can think of a database as a giant spreadsheet where each row is a customer transaction and
every column represents a different grocery item.
Here’s the best part:
By applying the Apriori algorithm, we can learn the grocery items that are purchased together
a.k.a association rules.
The power of this is:
You can find those items that tend to be purchased together more frequently than other items —
the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.
For example:
You can probably quickly see that chips + dip and chips + soda seem to frequently occur
together. These are called 2-itemsets. With a large enough dataset, it will be much harder to
“see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely
what Apriori helps with!
   You might be wondering how Apriori works? Before getting into the nitty gritty of algorithm,
   you’ll need to define 3 things:
1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
2. The second is your support or the number of transactions containing the itemset divided by the
   total number of transactions. An itemset that meets the support is called a frequent itemset.
3. The third is your confidence or the conditional probability of some item given you have certain
   other items in your itemset. A good example is given chips in your itemset, there is a 67%
   confidence of having soda also in the itemset.
   The basic Apriori algorithm is a 3 step approach:
1. Join. Scan the whole database for how frequent 1-itemsets are.
2. Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-
   itemsets.
3. Repeat. This is repeated for each itemset level until we reach our previously defined size.
   Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning
   approach, since it’s often used to discover or mine for interesting patterns and relationships.
   But wait, there’s more…
   Apriori can also be modified to do classification based on labelled data.
   Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.
   The algorithm can be quite memory, space and time intensive when generating itemsets.
   Where is it used? Plenty of implementations of Apriori are available. Some popular ones are
   the ARtool, Weka, and Orange.
   k-Nearest Neighbor:
   What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs
   from the classifiers previously described because it’s a lazy learner.
   What’s a lazy learner? A lazy learner doesn’t do much during the training process other than
   store the training data. Only when new unlabeled data is input does this type of learner look to
   classify.
   On the other hand, an eager learner builds a classification model during training. When new
   unlabeled data is input, this type of learner feeds the data into the classification model.
   How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.
   Here’s why:
1. C4.5 builds a decision tree classification model during training.
2. SVM builds a hyperplane classification model during training.
3. AdaBoost builds an ensemble classification model during training.
   So what does kNN do? kNN builds no such classification model. Instead, it just stores the
   labeled training data.
   When new unlabeled data comes in, kNN operates in 2 basic steps:
1. First, it looks at the closest labeled training data points — in other words, the k-nearest
   neighbors.
2. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be
   classified.
   You might be wondering…
   How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric
   like Euclidean distance. The choice of distance metric largely depends on the data. Some even
   suggest learning a distance metric based on the training data. There’s tons more details and
   papers on kNN distance metrics.
   For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:
1. Using Hamming distance as a metric for the “closeness” of 2 text strings.
2. Transforming discrete data into binary features.
   These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:
 KNN classification with categorical data, Using k-NN in R with categorical values
   How does kNN classify new data when neighbors disagree? kNN has an easy time when all
   neighbors are the same class. The intuition is if all the neighbors agree, then the new data point
   likely falls in the same class.
   I’ll bet you can guess where things get hairy…
   How does kNN decide the class when neighbors don’t have the same class?
   2 common techniques for dealing with this are:
1. Take a simple majority vote from the neighbors. Whichever class has the greatest number of
   votes becomes the class for the new data point.
2. Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way
   to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote
   1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly
   what we want!
    Is this supervised or unsupervised? This is supervised learning, since kNN is provided a
    labeled training dataset.
   Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN.
   Depending on the distance metric, kNN can be quite accurate.
   But that’s just part of the story…
   Here are 5 things to watch out for:
1. kNN can get very computationally expensive when trying to determine the nearest neighbors on
   a large dataset.
2. Noisy data can throw off kNN classifications.
3. Features with a larger range of values can dominate the distance metric relative to features that
   have a smaller range, so feature scaling is important.
4. Since data processing is deferred, kNN generally requires greater storage requirements than
   eager classifiers.
5. Selecting a good distance metric is crucial to kNN’s accuracy.
    Where is it used? A number of kNN implementations exist:
   MATLAB k-nearest neighbor classification, scikit-learn KNeighborsClassifier, k-Nearest
    Neighbour Classification in R
    Naive Bayes:
    What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms
    that share one common assumption:
    Every feature of the data being classified is independent of all other features given the class.
    What does independent mean? 2 features are independent when the value of one feature has no
    effect on the value of another feature.
    For example:
    Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight,
    height and zip code. All features would be independent if the value of all features have no effect
    on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are
    independent, since a patient’s height has little to do with their zip code.
    But let’s not stop there, are the other features independent?
    Sadly, the answer is no. Here are 3 feature relationships which are not independent:
   If height increases, weight likely increases.
   If cholesterol level increases, weight likely increases.
   If cholesterol level increases, pulse likely increases as well.
    In my experience, the features of a dataset are generally not all independent.
    And that ties in with the next question…
    Why is it called naive? The assumption that all features of a dataset are independent is precisely
    why it’s called naive — it’s generally not the case that all features are independent.
    What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named
    after. You can click on the link to find about more about Bayes’ Theorem.
    In a nutshell, the theorem allows us to predict the class given a set of features using probability.
    What does the equation mean? The equation finds the probability of Class A given Features 1
    and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.
    The equation reads: The probability of Class A given Features 1 and 2 is a fraction.
   The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the
    probability of Feature 2 given Class A multiplied by the probability of Class A.
   The fraction’s denominator is the probability of Feature 1 multiplied by the probability of
    Feature 2.
    What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow
    thread (Ram’s answer).
    Here’s the deal:
   We have a training dataset of 1,000 fruits.
   The fruit can be a Banana, Orange or Other (these are the classes).
   The fruit can be Long, Sweet or Yellow (these are the features).
    What do you see in this training dataset?
   Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
   Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
   Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.
    If we are given the length, sweetness and color of a fruit (without knowing its class), we can now
    calculate the probability of it being a banana, orange or other fruit.
    Suppose we are told the unknown fruit is long, sweet and yellow.
    Here’s how we calculate all the probabilities in 4 steps:
    Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks
    familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or
    more succinctly:
    This is exactly like the equation discussed earlier.
    Step 2: Starting with the numerator, let’s plug everything in.
    Multiplying everything together (as in the equation), we get:
    Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.
    Step 4: Do a similar calculation for the other classes:
    Since the         is greater than        , Naive Bayes would classify this long, sweet and yellow
    fruit as a banana.
    Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided
    a labeled training dataset in order to construct the tables.
    Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple
    arithmetic. It’s just tallying up counts, multiplying and dividing.
    Once the frequency tables are calculated, classifying an unknown fruit just involves calculating
    the probabilities for all the classes, and then choosing the highest probability.
    Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to
    be effective for spam filtering.
    Where is it used? Implementations of Naive Bayes can be found in Orange, scikit-
    learn, Weka and R.
    CART:
    What does it do? CART stands for classification and regression trees. It is a decision tree
    learning technique that outputs either classification or regression trees. CART is a classifier.
    Is a classification tree like a decision tree? A classification tree is a type of decision tree. The
    output of a classification tree is a class.
    For example, given a patient dataset, you might attempt to predict whether the patient will get
    cancer. The class would either be “will get cancer” or “won’t get cancer.”
What’s a regression tree? Unlike a classification tree which predicts a class, regression trees
predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.
Classification trees output classes, regression trees output numbers.
Is this supervised or unsupervised? CART is a supervised learning technique, since it is
provided a labeled training dataset in order to construct the classification or regression tree
model.
Why use CART? CART is a decision tree learning technique. Things like ease of interpretation
and explanation also apply to CART as well. CART is quite fast, quite popular and the output is
human readable.
Where is it used? scikit-learn implements CART in their decision tree classifier. R’s tree
package has an implementation of CART. Weka and MATLAB also have implementations.
Finally, Salford Systems has the only implementation of the original proprietary CART code
based on the theory introduced by world-renowned statisticians at Stanford University and the
University of California at Berkeley