Module 3
Fitting a Model to Data
              What is a model?
In certain fields of Statistics & Economics, the bare model with unspecified
parameters is called “The Model”.
The Model: the model is a convenient fiction that necessarily glosses over
some of the details of the actual thing being modeled. It is meant to capture
the structure of the data as simply as possible.
The basic structure of a statistical model is:
data=model + error
We generally describe a model in terms of its parameters, which are
values that we can change in order to modify the predictions of the
model.
                                      Outline
What is Model Fitting?
Model Fitting is a measurement of how well a machine learning model adapts to data
that is similar to the data on which it was trained. The fitting process is generally
built-in to models and is automatic. A well-fit model will accurately approximate the
output when given new data, producing more precise results.
Classification via Mathematical Functions
Decision Boundaries
                             A     main      purpose   of   creating
                             homogeneous regions is that we can
                             predict the target variable of a new,
                             unseen instance by determining which
                             segment it falls into.
                              For example, in Figure 4-1,
                              if a new customer falls into the lower-left
                             segment, we can conclude that the target
                             value is very likely to be “•”.
                             Similarly, if it falls into the upper-right
                             segment, we can predict its value as “+”.
Classification via Mathematical Functions
  Instance Space                 Linear Classifier
                  Linear Discriminant Functions
Our goal is going to be to fit our model to the data, and to do so it is
quite helpful to represent the model mathematically.
You may recall that the equation of a line in two dimensions is
y = mx + b, where m is the slope of the line and b is the y intercept
(the y value when x = 0).
The line in Figure 4-3 can be expressed in this form (with Balance in
thousands) as:
We would classify an instance x as a + if it is above the line,
  and as a • if it is below the line.
For this example Decision Boundary the classification solution is,
•Linear discriminant:
  𝑐𝑙𝑎𝑠𝑠    = {+ if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 > 0
              ● if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 ≤ 0
• We now have a parameterized model: the weights of the linear function
  are the parameters.
For our purposes, the important thing is that we can express the model
as a weighted sum of the attribute values.
• The weights are often loosely interpreted as importance indicators of the
  features.
Thus, this linear model is a different sort of multivariate supervised
segmentation.
For example, consider our feature vector x, with the individual component
features being xi.
A linear model then can be written as follows in Equation 4-2.
Equation 4-2. A general linear model
The concrete example from Equation 4-1 can be written in this form:
To use this model as a linear discriminant, for a given instance
represented by a feature vector x, we check whether f(x) is positive or
negative.
In the two-dimensional case, this corresponds to seeing whether the
instance x falls above or below the line.
In Figure 4-5, there actually are many different linear discriminants that
can separate the classes perfectly.
They have very different slopes and intercepts, and each represents a
different model of the data. In fact, there are infinitely many lines (models)
that classify this training set perfectly.
Which should we pick?
      Optimizing an Objective Function
what should be our goal or objective in choosing the parameters?
 Our general procedure will be to define an objective function that represents
our goal, and can be calculated for a particular set of weights and a particular
set of data.
We will then find the optimal value for the weights by maximizing or minimizing
the objective function.
What can easily be overlooked is that these weights are “best” only if we
believe that the objective function truly represents what we want to achieve.
Unfortunately, creating an objective function that matches the true goal of the
data mining is usually impossible.
Several choices have been shown to be remarkably effective. One of these
choices creates the so-called “support vector machine.”
Linear regression, logistic regression, and support vector machines
are all very similar instances of our basic fundamental technique:
fitting a (linear) model to data.
The key difference is that each uses a different objective function.
Regression
• Technique used for the modeling and analysis of numerical data.
• Exploits the relationship between two or more variables so that we can
gain information about one of them through knowing values of the other.
• Regression can be used for prediction, estimation,hypothesis testing,
and modeling causal relationships.
              Y=          X1 + X2 + X3
Dependent Variable     Independent Variable
Outcome Variable       Predictor Variable
Response Variable      Explanatory Variable
Linear Regression
Linear regression is the type of regression that forms a relationship between
the target variable and one or more independent variables utilizing a straight
line. The given equation represents the equation of linear regression
Y = a + b*X + e.
Where,      a represents the intercept,    b represents the slope of the
regression line, e represents the error, X and Y represent the predictor
and target variables, respectively.
If X is made up of more than one variable, termed as multiple linear equations.
In linear regression, the best fit line is achieved utilizing the least squared
method, and it minimizes the total sum of the squares of the deviations from
each data point to the line of regression. Here, the positive and negative
deviations do not get canceled as all the deviations are squared.
Linear Regression Line
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
                            Logistic Regression
This type of statistical model (also known as logit model) is often used for
classification and predictive analytics.
 Logistic regression estimates the probability of an event occurring, such as
voted or didn’t vote, based on a given dataset of independent variables.
Since the outcome is a probability, the dependent variable is bounded
between 0 and 1.
In logistic regression, a logit transformation is applied on the odds—that is, the
probability of success divided by the probability of failure.
This is also commonly known as the log odds, or the natural logarithm of
odds, and this logistic function is represented by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
     Logistic regression is a misnomer
• The distinction between classification and regression is whether the value for
  the target variable is categorical or numeric
• For logistic regression, the model produces a numeric estimate.
• However, the values of the target variable in the data are
  categorical.
• Logistic regression is estimating the probability of class membership
  (a numeric quantity) over a categorical class.
• Logistic regression is a class probability estimation model and not a
  regression model.
An Example of Mining a Linear Discriminant from Data
 • To illustrate linear discriminant functions, we use an adaptation of the Iris
   dataset.
 • This is an old and fairly simple dataset representing various types of iris, a
   genus of flowering plant.
 • The original dataset includes three species of irises represented with
   four attributes, and the data mining problem is to classify each instance
   as belonging to one of the three species based on the attributes.
 • For this illustration we’ll use just two species of irises, Iris Setosa and Iris
   Versicolor.
An Example of Mining a Linear Discriminant from Data
 • The dataset describes a collection of flowers of these two species, each
   described with two measurements: the Petal width and the Sepal width
   (Figure 4-6).
Classifying Flowers
•   The flower dataset is plotted in Figure 4-7, with these two attributes on the x and y
    axis, respectively.
•   Each instance is one flower and corresponds to one dot on the graph.
•    The filled dots are of the species Iris Setosa and the circles are instances of
    the species Iris Versicolor.
Classifying Flowers
• Two different separation lines are shown in the figure, one generated by
  logistic regression and the second by another linear method, a support
  vector machine (which will be described shortly).
• Note that the data comprise two fairly distinct clumps, with a few outliers.
  Logistic regression separates the two classes completely: all the Iris
  Versicolor examples are to the left of its line and all the Iris Setosa to
  the right.
• The Support vector machine line is almost midway between the clumps,
  though it misclassifies the starred point at (3, 1).
• Notice that the methods produce different boundaries because they’re
  optimizing different functions.
Ranking Instances and Probability Class Estimation
• In many applications, we don’t simply want a yes or no prediction of whether an
  instance belongs to the class, but we want some notion of which examples are more
  or less likely to belong to the class. For ex,
   • Which consumers are most likely to respond to this offer?
   • Which customers are most likely to leave when their contracts expire?
• Ranking
   • Tree induction
   • Linear discriminant functions (e.g., linear regressions, logistic regressions,
    SVMs)
      • Ranking is free
• Class Probability Estimation
   • Tree induction
   • Logistic regression
The many faces of classification:
Classification / Probability Estimation / Ranking
                                 Increasing difficulty
                Classification             Ranking         Probability
 Ranking:
 • Business context determines the number of actions (“how far down the
   list”)
 Probability:
 • You can always rank / classify if you have probabilities!
Ranking: Examples
• Search engines
  • Whether a document is relevant to a topic / query
 Class Probability Estimation: Examples
• MegaTelCo
  • Ranking vs. Class Probability Estimation.
• Identify accounts or transactions as likely to have been defrauded
  • The director of the fraud control operation may want the analysts to focus
    not simply on the cases most likely to be fraud, but on accounts where the
    expected monetary loss is higher.
     • We need to estimate the actual probability of fraud.
Logistic regression (“sigmoid”) curve
Application of Logistic Regression
 • The Wisconsin Breast Cancer Dataset
Wisconsin Breast Cancer dataset
 •   From each of these basic characteristics, three values were
     computed: the mean (_mean), standard error (_SE), and “worst” or
     largest
Wisconsin Breast Cancer dataset
 Support Vector Machines
In machine learning, support vector machines (SVMs, also support vector networks) are
supervised learning models with associated learning algorithms that analyze data for
classification and regression analysis.
            Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
                        B1
     One Possible Solution
Support Vector Machines
                        B2
     Another possible solution
Support Vector Machines
                        B2
     Other possible solutions
Support Vector Machines
                   B1
                   B2
Which one is better? B1 or B2?
How do you define better?
Definitions
           Define the hyperplane H such that:
           xi•w+b  +1 when yi =+1                 H1
           xi•w+b  -1 when yi =-1
H1 and H2 are the planes:                    H2           d+
H1: xi•w+b = +1
                                                     d-
H2: xi•w+b = -1
                                                               H
The points on the planes H1 and H2
are the
Support Vectors:
         d+ = the shortest distance to the closest positive point
         d- = the shortest distance to the closest negative point
         The margin of a separating hyperplane is d+ + d-.
              Maximizing the margin
          We want a classifier with as big margin as possible.
                                                     H1
Recall the distance from a point(x0,y0) to a     H
line:                                          H2
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2)                        d+
                                                            d-
 The distance between H and H1 is:
 |w•x+b|/||w||=1/||w||
 The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1    Can be combined into yi(xi•w)  1
Optimization Problem
 Support Vector Machines
                  B1
 
w• x + b = 0
                                                 
 
w • x + b = −1                                  w • x + b = +1
                                          b11
                                  b12
        1     if w • x + b  1                           2
f ( x) =                                 Margin =       
          − 1 if w  • x + b  −1                       || w ||
Support Vector Machines
                   B1
                   B2
                                                               b21
                                                               b22
                                                   margin
                                                               b11
                                                        b12
Find hyperplane maximizes the margin => B1 is better than B2
Support Vector Machines
                                                  2
      We want to maximize:          Margin =      
                                               || w ||
                                                        
                                                     || w ||2
          Which is equivalent to minimizing: L( w) =
                                                        2
          But subjected to the following constraints:
                             𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
                             𝑤 ∙ 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1
              This is a constrained optimization problem
                    Numerical approaches to solve it (e.g., quadratic
                    programming)
         Loss Functions
• Loss functions define what a good prediction is and isn’t. In
  short, choosing the right loss function dictates how well your
  estimator will be.
• Loss functions measure how far an estimated value is from its
  true value.
• A loss function maps decisions to their associated costs.
• Loss functions are not fixed, they change depending on the
  task in hand and the goal to be met.
         Loss Functions
• Zero-one loss assigns a loss of zero for a correct decision
  and one for an incorrect decision.
• Squared error specifies a loss proportional to the square of
  the distance from the boundary.
   • Squared error loss usually is used for numeric value
     prediction (regression), rather than classification.
  • The squaring of the error has the effect of greatly penalizing
   predictions that are grossly wrong.
Hinge Loss functions
• In machine learning, the hinge loss is a loss function used for
  training classifiers. The hinge loss is used for "maximum-
  margin" classification.
• Support vector machines use hinge loss.
• The hinge loss is a special type of cost function that not only
  penalizes misclassified samples but also correctly classified
  ones that are within a defined margin from the decision
  boundary.
Hinge Loss functions
• Hinge loss incurs no penalty for an example that is not on the
  wrong side of the margin.
• The hinge loss only becomes positive when an example is on
  the wrong side of the boundary and beyond the margin.
   • Loss then increases linearly with the example’s distance from
     the margin.
   • Penalizes points more the farther they are from the separating
     boundary.
Characteristics of SVM
The learning problem is formulated as a convex optimization problem
   • Efficient algorithms are available to find the global minima.
   • Many of the other methods use greedy approaches and find locally
      optimal solutions.
   • High computational complexity for building the model.
• Robust to noise.
• Overfitting is handled by maximizing the margin of the decision boundary.
• SVM can handle irrelevant and redundant attributes better than many
  other techniques.
• The user needs to provide the type of kernel function and cost function.
• Difficult to handle missing values.
Simple Neural Network
          Non-linear Functions
• Linear functions can actually represent nonlinear models, if we include
  more complex features in the functions