Unit – III
Machine Learning
What is machine learning?
Mike Roberts
“Machine learning is the process by which a computer can work more
accurately as it collects and learns from the data it is given.”
Arthur Samuel, 1959
“Machine learning is a field of study that gives computers the ability to
learn without being explicitly programmed.”
Applications for machine learning in data science
●   Regression and classification are of primary importance to a data scientist. To
    achieve these goals, one of the main tools a data scientist uses is machine
    learning.
●   The uses for regression and automatic classification are wide ranging, such as
    the following:
        ■ Finding oil fields, gold mines, or archeological sites based on existing
        sites (classification and regression)
        ■ Finding place names or persons in text (classification)
        ■ Identifying people based on pictures or voice recordings (classification)
        ■ Recognizing birds based on their whistle (classification)
Applications for machine learning in data science
■Identifying profitable customers (regression and classification)
■ Proactively identifying car parts that are likely to fail (regression)
■ Identifying tumors and diseases (classification)
■ Predicting the amount of money a person will spend on product (regression)
■ Predicting the number of eruptions of a volcano in a period (regression)
■ Predicting your company’s yearly revenue (regression)
■ Predicting which team will win the Champions League in soccer
(classification)
Applications for machine learning in data science
Occasionally data scientists build a model (an abstraction of reality) that
provides insight to the underlying processes of a phenomenon. When the goal
of a model isn’t prediction but interpretation, it’s called root cause analysis(
identifying “root causes” of problems or events ). Here are a few examples:
    ■ Understanding and optimizing a business process, such as determining
    which products add value to a product line
    ■ Discovering what causes diabetes
    ■ Determining the causes of traffic jams
Where machine learning is used in the data science process
Although machine learning is mainly linked to the data-modeling step of the data
science process, it can be used at almost every step.
The modeling process
The modeling phase consists of four steps:
1 Feature engineering and model selection
2 Training the model
3 Model validation and selection
4 Applying the trained model to unseen data
●   It’s possible to chain or combine multiple techniques. When you chain
    multiple models, the output of the first model becomes an input for the
    second model. When you combine multiple models, you train them
    independently and combine their results. This last technique is also known
    as ensemble learning.
●   A model consists of constructs of information called features or
    predictors and a target or response variable. Your model’s goal is to
    predict the target variable, for example, tomorrow’s high temperature.
    The variables that help you do this and are (usually) known to you are the
    features or predictor variables such as today’s temperature, cloud
    movements, current wind speed, and so on.
1.Engineering features and selecting a model
●   With engineering features, you must come up with and create possible predictors
    for the model. This is one of the most important steps in the process because a
    model recombines these features to achieve its predictions.
●   Often you may need to consult an expert or the appropriate literature to come up
    with meaningful features.
●   Certain features are the variables you get from a data set, as is the case with the
    provided data sets in our exercises and in most school exercises.
●   In practice you’ll need to find the features yourself, which may be scattered
    among different data sets.
●   In several projects we had to bring together more than 20 different data sources
    before we had the raw data we required. Often you’ll need to apply a
    transformation to an input before it becomes a good predictor or to combine
    multiple inputs.
●   Sometimes you have to use modeling techniques to derive features: the output of
    a model becomes part of another model.
●   In text mining,Documents can first be annotated to classify the content into
    categories, or you can count the number of geographic places or persons in the
    text.
●   This counting is often more difficult than it sounds; models are first applied to
    recognize certain words as a person or a place. All this new information is then
    poured into the model you want to build.
●   One of the biggest mistakes in model construction is the availability bias:
●   It refers to the human tendency to judge an event by the ease with which
    examples of the event can be retrieved from your memory or constructed a
    new.(heuristics can lead to inaccurate judgments about how common things occur
    and about how representative certain things may be.)
●   When the initial features are created, a model can be trained to the data.
2.Training your model
●   With the right predictors in place and a modeling technique in mind, you can
    progress to model training.
●   In this phase you present to your model data from which it can learn.
●   The most common modeling techniques have industry-ready implementations in
    almost every programming language, including Python. These enable you to train
    your models by executing a few lines of code. For more state-of-the art data
    science techniques, you’ll probably end up doing heavy mathematical calculations
    and implementing them with modern computer science techniques.
●   Once a model is trained, it’s time to test whether it can be extrapolated to
    reality:model validation.
3.Validating a model
 ●   Data science has many modeling techniques, and the question is which one is the right one
     to use.
 ●   A good model has two properties: it has good predictive power and it generalizes well to
     data it hasn’t seen.
 ●   To achieve this you define an error measure (how wrong the model is) and a validation
     strategy.
Error Measures:
 ●   Two common error measures in machine learning are the classification error rate
     for classification problems and the mean squared error for regression problems.
 ●   The classification error rate is the percentage of observations in the test data
     set that your model mislabeled; lower is better.
 ●   The mean squared error measures how big the average error of your prediction is.
Validation strategies:
 ●   Dividing your data into a training set with X% of the observations and keeping the
     rest as a holdout data set (a data set that’s never used for model creation)—This
     is the most common technique.
 ●   K-folds cross validation—This strategy divides the data set into k parts and uses
     each part one time as a test data set while using the others as a training data set.
     This has the advantage that you use all the data available in the data set.
 ●   Leave-1 out—This approach is the same as k-folds but with k=1. You always leave
     one observation out and train on the rest of the data. This is used only on small
     data sets, so it’s more valuable to people evaluating laboratory experiments than
     to big data analysts.
●   Another popular term in machine learning is regularization.
●   In the context of machine learning, Regularisation is a technique used to reduce the
    errors by fitting the function appropriately on the given training set and avoid
    overfitting.
●   When applying regularization,you incur a penalty for every extra variable used to
    construct the model With L1 regularization you ask for a model with as few predictors
    as possible. This is important for the model’s robustness: simple solutions tend to hold
    true in more situations.
●   L2 regularization aims to keep the variance between the coefficients of the predictors
    as small as possible. Overlapping variance between predictors makes it hard to make out
    the actual impact of each predictor. Keeping their variance from overlapping will
    increase interpretability. To keep it simple: regularization is mainly used to stop a model
    from using too many features and thus prevent over-fitting.
L1 Regularisation
●   A regression model which uses L1 Regularisation technique is called
    LASSO(Least Absolute Shrinkage and Selection Operator) regression.
●   Lasso Regression adds “absolute value of magnitude” of coefficient as penalty
    term to the loss function(L).
L2 regularisation
●   A regression model that uses L2 regularisation technique is called Ridge
    regression.
●   Ridge regression adds “squared magnitude” of coefficient as penalty term to
    the loss function(L).
Difference between L1 and L2
●   The key difference between these techniques is that Lasso shrinks the less
    important feature’s coefficient to zero thus, removing some feature
    altogether. So, this works well for feature selection in case we have a huge
    number of features.
●   From a practical standpoint, L1 tends to shrink coefficients to zero whereas
    L2 tends to shrink coefficients evenly. L1 is therefore useful for feature
    selection, as we can drop any variables associated with coefficients that go to
    zero. L2, on the other hand, is useful when you have collinear/codependent
    features.
4.Predicting new observations
●   If you’ve implemented the first three steps successfully, you now have a
    performant model that generalizes to unseen data. The process of applying
    your model to new data is called model scoring.
●   Model scoring involves two steps. First, you prepare a data set that has
    features exactly as defined by your model. This boils down to repeating the
    data preparation you did in step one of the modeling process but for a new
    data set. Then you apply the model on this new data set, and this results in a
    prediction.
Types of machine learning
●   Supervised learning techniques attempt to discern results and learn by trying
    to find patterns in a labeled data set. Human interaction is required to label
    the data.
●   Unsupervised learning techniques don’t rely on labeled data and attempt to find
    patterns in a data set without human interaction.
●   Semi-supervised learning techniques need labeled data, and therefore human
    interaction, to find patterns in the data set, but they can still progress toward
    a result and learn even if passed unlabeled data as well.
Supervised learning-CASE STUDY: DISCERNING DIGITS FROM IMAGES
●   One of the many common approaches on the web to stopping computers
    from hacking into user accounts is the Captcha check—a picture of text
    and numbers that the human user must decipher and enter into a form
    field before sending the form back to the web server.
     Figure 3.3 A simple Captcha control can be used to prevent automated spam
     being sent through an online web form.
●   With the help of the Naïve Bayes classifier, a simple yet powerful algorithm to
    categorize observations into classes that’s explained in more detail in the
    sidebar, you can recognize digits from textual images.
●   These images aren’t unlike the Captcha checks many websites have in place to
    make sure you’re not a computer trying to hack into the user accounts.
●   Let’s see how hard it is to let a computer recognize images of numbers.
●   Our research goal is to let a computer recognize images of numbers (step
    one of the data science process).
●   The data we’ll be working on is the MNIST data set, which is often used in
    the data science literature for teaching and benchmarking.
●   The MNIST database is a large database of handwritten digits that is
    commonly used for training various image processing systems.
●   The MNIST images can be found in the data sets package of Scikit-learn and are
    already normalized for you (all scaled to the same size: 64x64 pixels), so we won’t need
    much data preparation (step three of the data science process).
●   But let’s first fetch our data as step two of the data science process, with the
    following listing.
In the case of a gray image, you put a value in every matrix entry that depicts the gray
value to be shown. The following code demonstrates this process and is step four of the
data science process: data exploration.
●   For a grayscale images, the pixel value is a single number that represents the brightness
    of the pixel. The most common pixel format is the byte image, where this number is
    stored as an 8-bit integer giving a range of possible values from 0 to 255.
●   Typically zero is taken to be black, and 255 is taken to be white.
●   The Naïve Bayes classifier is expecting a list of values, but pl.matshow() returns a
    two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a
    list, we need to call reshape() on digits.images.
●   The net result will be a one-dimensional array that looks something like this:
    array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,
    0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,
    0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,
    0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
Figure 3.4 Blurry grayscale representation of the number 0 with its corresponding matrix.
The higher the number, the closer it is to white; the lower the number, the closer it is to
black.
Figure 3.5 We’ll turn an image into something usable by the Naïve Bayes classifier by
getting the grayscale value for each of its pixels (shown on the right) and putting those
values in a list.
●   Now that we have a way to pass the contents of an image into the classifier, we
    need to pass it a training data set so it can start learning how to predict the
    numbers in the images.
●   We mentioned earlier that Scikit-learn contains a subset of the MNIST database
    (1,800 images), so we’ll use that.
●   Each image is also labeled with the number it actually shows. This will build a
    probabilistic model in memory of the most likely digit shown in an image given its
    grayscale values.
●   Once the program has gone through the training set and built the model, we can
    then pass it the test set of data to see how well it has learned to interpret the
    images using the model.
●   The end result of this code is called a confusion matrix, such as the one shown in
    figure 3.6. Returned as a two-dimensional array, it shows how often the number
    predicted was the correct number on the main diagonal and also in the matrix
    entry (i,j), where j was predicted but the image showed i.
●   Looking at figure 3.6 we can see that the model predicted the number 2 correctly
    17 times (at coordinates 3,3), but also that the model predicted the number 8 15
    times when it was actually the number 2 in the image (at 9,3).
                                                        Figure 3.6 Confusion matrix
                                                        produced by predicting what
                                                        number is depicted by a blurry
                                                        image
The model was correct in (35+40) 75 cases and incorrect in (15+10) 25 cases,
resulting in a (75 correct/100 total observations) 75% accuracy.
Figure 3.7 For each blurry image a number is predicted; only the number 2 is misinterpreted as 8.
Then an ambiguous number is predicted to be 3 but it could as well be 5; even to human eyes this
isn’t clear.
●   By discerning which images were misinterpreted, we can train the model further by labeling
    them with the correct number they display and feeding them back into the model as a new
    training set (step 5 of the data science process).
●   This will make the model more accurate, so the cycle of learn, predict, correct continues and
    the predictions become more accurate.
Unsupervised learning
 ●   Unsupervised Learning is a machine learning technique in which the users do not need to
     supervise the model.
 ●   Instead, it allows the model to work on its own to discover patterns and information that was
     previously undetected.
 ●   It mainly deals with the unlabelled data.
Types of Unsupervised Learning
     Unsupervised learning problems further grouped into clustering and association problems.
 ●   Hierarchical clustering
 ●   K-means clustering
 ●   K-NN (k nearest neighbors)
 ●   Principal Component Analysis
 ●   Singular Value Decomposition
 ●   Independent Component Analysis
Applications of unsupervised machine learning
Some applications of unsupervised machine learning techniques are:
 ●   Clustering automatically split the dataset into groups base on their similarities
 ●   Anomaly detection can discover unusual data points in your dataset. It is useful
     for finding fraudulent transactions
 ●   Association mining identifies sets of items which often occur together in your
     dataset
 ●   Latent variable models are widely used for data preprocessing. Like reducing the
     number of features in a dataset or decomposing the dataset into multiple
     components
DISCERNING A SIMPLIFIED LATENT STRUCTURE FROM YOUR DATA
●   In statistics, latent variables are variables that are not directly observed but are
    rather inferred (through a mathematical model) from other variables that are
    observed (directly measured).
●   A latent variable is hidden, and therefore can’t be observed.
●   Example:Actual customer satisfaction is a hidden or latent factor, that can only
    be measured in comparison to a manifest variable, or observable factor.
●   The company might choose to study observable variables, such as sales numbers, the
    price per sale, regional trends of purchasing, the gender of the customer, age of the
    customer, percentage of return customers, and how high a customer ranked the
    product on various sites all in the pursuit of the latent factor — namely, customer
    satisfaction.
CASE STUDY: FINDING LATENT VARIABLES IN A WINE QUALITY DATA SET
●   In this short case study, you’ll use a technique known as Principal Component Analysis
    (PCA) to find latent variables in a data set that describes the quality of wine.
●   Then you’ll compare how well a set of latent variables works in predicting the quality of
    wine against the original observable set.
●   How to identify and derive those latent variables.
●   How to analyze where the sweet spot is—how many new variables return the most
    utility—by generating and interpreting a scree plot generated by PCA.
●   A scree plot is a line plot of the eigenvalues of factors or principal components in an
    analysis.
●   The scree plot is used to determine the number of factors to retain in an exploratory
    factor analysis
Main components of this example
Data set
●   The University of California, Irvine (UCI) has an online repository of 325
    data      sets       for      machine       learning   exercises      at
    http://archive.ics.uci.edu/ml/.
●   We’ll use the Wine Quality Data Set for red wines created by P. Cortez,
    A. Cerdeira, F. Almeida, T. Matos, and J. Reis4. It’s 1,600 lines long and
    has 11 variables per line, as shown in table 3.2.
The first three rows of the Red Wine Quality Data Set
Principal Component Analysis—A technique to find the latent variables in your
data set while retaining as much information as possible.
Scikit-learn—We use this library because it already implements PCA for us
and is a way to generate the scree plot
Part one of the data science process is to set our research goal: We want to
explain the subjective “wine quality” feedback using the different wine
properties.
With the initial data preparation behind you, you can execute the PCA. The resulting
scree plot (which will be explained shortly) is shown in figure 3.8. Because PCA is an
explorative technique, we now arrive at step four of the data science process: data
exploration, as shown in the following listing.
The plot generated from the wine data set is shown in figure 3.8. What you hope to see is an
elbow or hockey stick shape in the plot. This indicates that a few variables can represent
the majority of the information in the data set while the rest only add a little more. In our
plot, PCA tells us that reducing the set down to one variable can capture approximately 28%
of the total information in the set
An elbow shape in the plot suggests that five variables can hold most of the information found
inside the data.At this point, we could go ahead and see if the original data set recoded with five
latent variables is good enough to predict the quality of the wine accurately,
INTERPRETING THE NEW VARIABLES
 ●   With the initial decision made to reduce the data set from 11 original variables to 5 latent
     variables, we can check to see whether it’s possible to interpret or name them based on
     their relationships with the originals.
 ●    Actual names are easier to work with than codes such as lv1, lv2, and so on.
The rows in the resulting table (table 3.4) show the mathematical correlation. Or, in English, the
first latent variable lv1, which captures approximately 28% of the total information in the set,
has the following formula.
Lv1 = (fixed acidity * 0.489314) + (volatile acidity * -0.238584) + … + (alcohol *
-0.113232)
We can now recode the original data set with only the five latent variables.
Doing this is data preparation again, so we revisit step three of the data
science process: data preparation.
Already we can see high values for wine 0 in volatile acidity, while wine 2 is
particularly high in persistent acidity
COMPARING THE ACCURACY OF THE ORIGINAL DATA SET WITH LATENT VARIABLES
●   Now that we’ve decided our data set should be recoded into 5 latent
    variables rather than the 11 originals, it’s time to see how well the new
    data set works for predicting the quality of wine when compared to the
    original.
●   We’ll use the Naïve Bayes Classifier algorithm we saw in the previous
    example for supervised learning to help. Let’s start by seeing how well
    the original 11 variables could predict the wine quality scores.
GROUPING SIMILAR OBSERVATIONS TO GAIN INSIGHT FROM THE
               DISTRIBUTION OF YOUR DATA
●   The general technique we’re describing here is known as clustering.
●   In this process, we attempt to divide our data set into observation
    subsets, or clusters, wherein observations should be similar to those in
    the same cluster but differ greatly from the observations in other
    clusters.
●   Figure 3.10 gives you a visual idea of what clustering aims to achieve. The
    circles in the top left of the figure are clearly close to each other while
    being farther away from the others
●   Scikit-learn implements several common algorithms for clustering data in
    its sklearn.cluster module, including the k-means algorithm, affinity
    propagation, and spectral clustering.
●   k-means is a good general-purpose algorithm with which to get started.
    However, like all the clustering algorithms, you need to specify the
    number of desired clusters in advance, which necessarily results in a
    process of trial and error before reaching a decent conclusion
●   One other disadvantage is the need to specify the number of desired
    clusters in advance. This often results in a process of trial and error
    before coming to a satisfying Conclusion.
● The following listing uses an iris data set to see if the algorithm can
    group the different types of irises.
import sklearn
from sklearn import cluster
import pandas as pd
data = sklearn.datasets.load_iris()
X = pd.DataFrame(data.data, columns = list(data.feature_names))
print X[:5]
model = cluster.KMeans(n_clusters=3, random_state=25)
results = model.fit(X)
X["cluster"] = results.predict(X)
X["target"] = data.target
X["c"] = "lookatmeIamimportant"
print X[:5]
classification_result =
X[["cluster","target","c"]].groupby(["cluster","target"]).agg("count")
print(classification_result)
Semi-supervised learning
●   Semi-supervised learning is the type of machine learning that uses a
    combination of a small amount of labeled data and a large amount of unlabeled
    data to train models.
●   This approach to machine learning is a combination of supervised machine
    learning, which uses labeled training data, and unsupervised learning, which
    uses unlabeled training data.
●   Unlabeled data, when used in conjunction with a small amount of labeled data,
    can produce considerable improvement in learning accuracy.
●   A common semi-supervised learning technique is label propagation.
●   In this technique,you start with a labeled data set and give the same label to
    similar data points.
●   This is similar to running a clustering algorithm over the data set and labeling each
    cluster based on the labels they contain.
●   One special approach to semi-supervised learning worth mentioning here is active
    learning.
●   Active learning is a special case of machine learning in which a learning algorithm
    can interactively query a user (or some other information source) to label new data
    points with the desired outputs.