KEMBAR78
Module 1 | PDF | Data Science | Data
0% found this document useful (0 votes)
25 views181 pages

Module 1

Module 1 introduces the foundations of data science, covering its definition, processes, and tools. It discusses the differences between computer science and real science, the properties of data, and the classification of data science tasks such as supervised and unsupervised learning. The module emphasizes the importance of asking the right questions, understanding data types, and the iterative data science process for effective analysis.

Uploaded by

Gautham J K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views181 pages

Module 1

Module 1 introduces the foundations of data science, covering its definition, processes, and tools. It discusses the differences between computer science and real science, the properties of data, and the classification of data science tasks such as supervised and unsupervised learning. The module emphasizes the importance of asking the right questions, understanding data types, and the iterative data science process for effective analysis.

Uploaded by

Gautham J K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

MODULE 1

Module 1:
Foundations Data Science, process, and tools
 Introduction to data science

 Properties of data, Asking interesting


questions

 Classification of data science

 Data science process


 Collecting, cleaning and visualizing data,

 Languages, and models for data science


Introduction to data science
 Data science is a collection of techniques
used to extract value from data.

 It has become an essential tool for any


organization that collects, stores, and
processes data as part of its operations.

 Data science techniques rely on finding


useful patterns, connections, and
relationships within data.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data science is also commonly referred
to as
 knowledge discovery,
 machine learning,
 predictive analytics,
 data mining

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
AI, ML and DS

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Computer scientists, by nature, don't
respect data

 Examples of the cultural differences


between computer science and real science
include:
 Data vs. method centrism
 Concern about results
 Robustness
 Precision
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Data vs. method centrism
 Scientists are data driven, while computer
scientists are algorithm driven.

 Real scientists spend enormous amounts of effort


collecting data to answer their question of
interest

 computer scientists obsess about methods:


 which algorithm is better than which other algorithm
 which programming language is best for a job
 which program is better than which other program.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Concern about results
 Real scientists care about answers.

 They analyze data to discover something


about how the world works

 Bad computer scientists worry about


producing plausible looking numbers.
 They are personally less invested in what can
be learned from a computation, as opposed to
getting it done quickly and efficiently
Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
Robustness
 Real scientists are comfortable with the idea
that data has errors.
 computer scientists are not

 Scientists think a lot about possible sources of


bias or error in their data, and how these
possible problems can effect the conclusions
derived from them.

 Computer scientists chant “garbage in, garbage


out"
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Precision
 Nothing is ever completely true or false
in science

 Every- thing is either true or false in


computer science or mathematics.

 Computer scientists care what a number


is, while real scientists care what it
means
Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
Asking Interesting Questions
from Data
Asking Interesting Questions
from Data

 What things might you be able to learn


from a given data set?

 What do you/your people really want to


know about the world?

 What will it mean to you once you find


out?

Reference: Skiena, S. S. (2017). The data science design


manual., Springer.
 The Baseball Encyclopedia

 The Internet Movie Database (IMDb)

 Google Ngrams
 New York Taxi Records
 Prepare new questions from datasets
(minimum 3)
Properties of Data
Properties of Data
 Structured vs. Unstructured Data

 Quantitative vs. Categorical Data

 Big Data vs. Little Data

Reference: Skiena, S. S. (2017). The data science design


manual., Springer.
Structured vs. Unstructured
Data
 Structured data

 Data sets are nicely structured, like the tables in a


database or spreadsheet program

 Data is often represented by a matrix, where


 the rows of the matrix represent distinct items or records
 the columns represent distinct properties of these items.

 For example, a data set about U.S. cities might contain


one row for each city, with columns representing features
like state, population, and area.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 Unstructured data
 Record information about the state of the world,
but in a more heterogeneous way

 Collection of tweets from Twitter


 First step is build a matrix to structure it.
 A bag of words model will construct a matrix with
a row for each tweet, and a column for each
frequently used vocabulary word.
 Matrix entry M[i; j] then denotes the number of
times tweet i contains word j.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Quantitative vs. Categorical Data

 Quantitative data
 Consists of numerical values, like height
and weight.
 Data can be incorporated directly into
algebraic formulas and mathematical
models, or displayed in conventional
graphs and charts.

Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
 Categorical Data
 Categorical data consists of labels describing the properties of
the objects under investigation, like gender, hair color, and
occupation.

 This descriptive information can be every bit as precise and


meaningful as numerical data, but it cannot be worked with
using the same techniques.

 Categorical data can usually be coded numerically. For


example, gender might be represented as male = 0 or female
= 1.

 gray hair = 0, red hair = 1, and blond hair = 2.


Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 We cannot really treat these values as
numbers, for anything other than simple
identity testing.

 Does it make any sense to talk about the


maximum or minimum hair color?

 What is the interpretation of my hair


color minus your hair color?
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Big Data vs. Little Data
 Big data
 The analysis of massive data sets resulting
from computer logs and sensor devices.
 In principle, having more data is always
better than having less, because you can
always throw some of it away by sampling
to get a smaller set if necessary
 There are difficulties in working with large
data sets.

Reference: Skiena, S. S. (2017). The data science design


manual., Springer.
 The challenges of big data include:
 The analysis cycle time slows as data size
grows
 Computational operations on data sets take
longer as their volume increases.

 Large data sets are complex to visualize


 Plotswith millions of points on them are
impossible to display on computer screens or
printed images, let alone conceptually
understand.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 Simple models do not require massive data to
fit or evaluate
A typical data science task might be to make a
decision (say, whether I should offer this fellow life
insurance?) on the basis of a small number of
variables: say age, gender, height, weight, and the
presence or absence of existing medical conditions.

 Big data is sometimes called bad data


 We might have to go to heroic efforts to make
sense of something just because we have it.

Reference: Skiena, S. S. (2017). The data science design


manual., Springer.
Classification of data science
Supervised Learning
 Data science problems can be broadly categorized
into supervised or unsupervised learning models.

 Supervised or directed data science tries to infer a


function or relationship based on labeled training
data and uses this function to map new unlabeled
data.

 Supervised techniques predict the value of the


output variables based on a set of input variables.

 A model is developed from a training dataset where


Reference: Kotu, V., & Deshpande, B. (2019). Data
the valuesscience:
of input and
Concepts andoutput are previously known.
practice., Morgan
 The model generalizes the relationship between
the input and output variables and uses it to
predict for a dataset where only input variables
are known.

 The output variable that is being predicted is


also called a class label or target variable.

 Supervised data science needs a sufficient


number of labeled records to learn the model
from the data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Unsupervised Learning
 Unsupervised or undirected data science
uncovers hidden patterns in unlabeled
data.

 There are no output variables to predict.

 Find patterns in data based on the


relationship between data points
themselves.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Science Tasks

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Classification and
Regression
 Classification and regression techniques predict a target
variable based on input variables.

 The prediction is based on a generalized model built from


a previously known dataset.

 In regression tasks, the output variable is numeric (e.g.,


the mortgage interest rate on a loan).

 Classification tasks predict output variables, which are


categorical or polynomial (e.g., the yes or no decision to
approve a loan)

 DeepReference:
learning isV.,a&more
Kotu, sophisticated
Deshpande, B. (2019). Data artificial neural
science: Concepts and practice., Morgan
network used for classification and regression problems
Clustering
 Clustering is the process of identifying the natural
groupings in a dataset
 Generalize the uniqueness of each cluster

 Market basket analysis or Association analysis


 Identify pairs of items that are purchased together, so that
specific items can be bundled or placed next to each other.
 commonly used in cross selling.

 Recommendation engines are the systems that


recommend items to the users based on individual user
preference

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Anomaly or outlier detection identifies the
data points that are significantly different
from other data points in a dataset
 Credit card transaction fraud detection

 Time series forecasting is the process of


predicting the future value of a variable
(e.g., temperature) based on past
historical values that may exhibit a trend
and seasonality.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Text mining is a data science application where
the input data is text,
 which can be in the form of documents, messages,
emails, or web pages
 text file is converted to document vectors
 standard data science tasks such as classification,
clustering, etc., can be applied to vectors

 Feature selection is a process in which attributes


in a dataset are reduced to a few attributes that
really matter.(e.g. Height, Weight, Age, Eye color
predict probability of heart disease)
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 A complete data science application can contain
elements of both supervised and unsupervised
technique

 In marketing analytics, clustering can be used to find


the natural clusters in customer records.
 Each customer is assigned a cluster label at the end
of the clustering process.
 A labeled customer dataset can now be used to
develop a model that assigns a cluster label for any
new customer record with a supervised classification
technique
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data science tasks and examples page
no. 25

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Data science process
Data science process
 The methodical discovery of useful relationships
and patterns in data is enabled by a set of
iterative activities collectively known as the data
science process.

 The standard data science process involves


 (1) understanding the problem
 (2) preparing the data samples
 (3) developing the model
 (4) applying the model on a dataset to see how the
model may work in the real world
 (5) deploying and maintaining the models
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Modeling
 It is the process of building representative
models that can be inferred from the
sample dataset which can be used for
 Either predicting (predictive modeling) or
 Describing the underlying pattern in the
data (descriptive or explanatory modeling).

 There are many data science tools, that


can automate the model building.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Most time-consuming part of the overall data
science process is the preparation of data,
followed by data and business understanding.

• Crucial to the success of the data science process


 Asking the right business question

 Gaining in-depth business understanding

 Sourcing and preparing the data for the data science

task
 Mitigating implementation considerations

 Integrating the model into the business process

 Gaining knowledge from the dataset


Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Business Understanding

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
PRIOR KNOWLEDGE
 Prior knowledge refers to information that is already
known about a subject

 The data science problem doesn’t emerge in isolation

 it always develops on top of existing subject matter and


contextual information that is already known.
 The prior knowledge step in the data science process
helps to
 define what problem is being solved
 how it fits in the business context, and
 What data is needed in order to solve the problem

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Objective
 The data science process starts with a
need for analysis, a question, or a
business objective

 Without a well-defined statement of the


problem, it is impossible to come up with
the right dataset and pick the right data
science algorithm

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Consumer loan business
 a loan is provisioned for individuals against the collateral
of assets like a home or car, that is, a mortgage or an auto
loan

 an important component of the loan, is the interest rate at


which the borrower repays the loan on top of the principal.

 The interest rate on a loan depends on a gamut of


variables like
 the current federal funds rate as determined by the central
bank,
 borrower’s credit score, income level,
 home value, initial down payment amount,
 current assets and liabilities of the borrower, etc.
 The business objective of this
hypothetical case is:

 If the interest rate of past borrowers with a


range of credit scores is known, can the
interest rate for a new borrower be
predicted?

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Subject Area
 The process of data science uncovers
hidden patterns in the dataset by
exposing relationships between
attributes

 It is up to the practitioner to sift through


the exposed patterns and accept the
ones that are valid and relevant to the
answer of the objective question
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The lending business
 The objective is to predict the lending interest
rate

 Important to know how the lending business works,


 why the prediction matters,
 what happens after the rate is predicted,
 what data points can be collected from borrowers,
 what data points cannot be collected because of the
external regulations and the internal policies,
 what other external factors can affect the interest
rate,
 how to verify
Reference: the
Kotu, V., & validity
Deshpande, of the
B. (2019). outcome
Data
science: Concepts and practice., Morgan
Data
 Prior knowledge in the data can also be
gathered.

 Understanding how the data is collected,


stored, transformed, reported, and used is
essential to the data science process

 Surveys all the data available to answer the


business question and narrows down the
new data that need to be sourced.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 There are quite a range of factors to consider:
 quality of the data, quantity of data,
 availability of data, gaps in data,
 does lack of data compel the practitioner to change the
business question

 The objective of this step is to come up with a


dataset to answer the business question through
the data science process.

 It is critical to recognize that an inferred model is


only as good as the data used to create it.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Dataset
 A dataset (example set) is a collection of
data with a defined structure

 This structure is also sometimes referred


to as a “data frame”.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Dataset Example
Borrower ID Credit Score Interest Rate (%)
01 500 7.31
02 600 6.70
03 700 5.95
04 700 6.40
05 800 5.40
06 800 5.70
07 750 5.90
08 550 7.00
09 650 6.50
10 825 5.70
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data point
 A data point (record, object or example)
is a single instance in the dataset.

 Each row in dataset table is a data point.

 Each instance contains the same


structure as the dataset.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Attribute
 An attribute (feature, input, dimension, variable,
or predictor) is a single property of the dataset.

 Each column is an attribute.

 Attributes can be numeric, categorical, date-


time, text, or Boolean data types.

 Both the credit score and the interest rate are


numeric attributes

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Label
 A label (class label, output, prediction,
target, or response) is the special
attribute to be predicted based on all the
input attributes.

 The interest rate is the output variable

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Identifiers
 Identifiers are special attributes that are used for
locating or providing context to individual records.

 names, account numbers, and employee ID numbers

 Identifiers are often used as lookup keys to join


multiple datasets.

 They bear no information that is suitable for building


data science models and should, thus, be excluded
for the actual modeling
 Borrower ID is an identifier
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
DATA PREPARATION
 Preparing the dataset to suit a data science task
is the most time-consuming part of the process

 Data to be structured in a tabular format with


records in the rows and attributes in the
columns.

 If the data is in any other format, the data would


need to be transformed by applying pivot, type
conversion, join, or transpose functions, etc., to
condition the data into the required structure.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Exploration
 Also known as exploratory data analysis, provides a set of simple
tools to achieve basic understanding of the data.

 Data exploration approaches involve computing descriptive statistics


and visualization of data.

 They can expose the structure of the data, the distribution of the
values, the presence of extreme values, and highlight the inter-
relationships within the dataset.

 Descriptive statistics like mean, median, mode, standard deviation,


and range for each attribute provide an easily readable summary of
the key characteristics of the distribution of data.

 A visual plot of data points provides an instant grasp of all the data
points condensed into one chart
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Quality
 Errors in data will impact the representativeness of the model.

 Organizations use data alerts, cleansing, and transformation


techniques to improve and manage the quality of the data and
store them in companywide repositories called data
warehouses.

 Data sourced from well-maintained data warehouses have


higher quality, as there are proper controls in place to ensure
a level of data accuracy for new and existing data.

 The data cleansing practices include elimination of duplicate


records, quarantining outlier records that exceed the bounds,
standardization of attribute values, substitution of missing
values, etc
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Missing Values
 Understand the reason behind why the values are
missing.

 Tracking the data lineage (provenance) of the data source


can lead to the identification of systemic issues during
data capture or errors in data transformation.

 Knowing the source of a missing value will often guide


which mitigation methodology to use.

 The missing value can be substituted with a range of


artificial data so that the issue can be managed with
marginal impact on the later steps in the data science
process
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 This method is useful if the missing values
occur randomly and the frequency of
occurrence is quite rare.

 Alternatively, to build the representative


model, all the data records with missing
values or records with poor data quality can
be ignored.

 This method reduces the size of the dataset


Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Some data science algorithms are good at
handling records with missing values, while
others expect the data preparation step to handle
it before the model is inferred.

 k-nearest neighbor (k-NN) algorithm for classification


tasks are often robust with missing values.

 Neural network models for classification tasks do not


perform well with missing attributes, and thus, the
data preparation step is essential for developing
neural network models.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Types and Conversion
 The attributes in a dataset can be of
different types
 Continuous numeric (interest rate)
 Integer numeric (credit score)
 Categorical (poor, good, excellent– credit
score)

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 In case of linear regression models, the input attributes
have to be numeric.

 If the available data are categorical, they must be


converted to numeric attribute.

 A specific numeric score can be encoded for each


category value, such as poor (5400), good (5600),
excellent (5700), etc.

 Numeric values can be converted to categorical data


types by a technique called binning
 A range of values are specified for each category, for example,
a score between 400 and 500 can be encoded as “low”
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Transformation
 Input attributes are expected to be
numeric and normalized for algorithms
like KNN

 The algorithm compares the values of


different attributes and calculates
distance between the data points.

 Normalization prevents one attribute


dominating the distance results because
Reference: Kotu, V., & Deshpande, B. (2019). Data
of largescience:
values.Concepts and practice., Morgan
 For example, consider income (in thousands) and credit
score (in hundreds).

 The distance calculation will always be dominated by


slight variations in income.

 One solution is to convert the range of income and


credit score to a more uniform scale from 0 to 1 by
normalization.

 This way, a consistent comparison can be made


between the two different attributes with different units
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Outliers
 Outliers are anomalies in a given
dataset.

 Outliers may occur because of


 correct data capture or
 (few people with income in tens of millions)

 erroneous data capture


 (human height as 1.73 cm instead of 1.73 m)

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 The presence of outliers needs to be understood
and will require special treatments.

 The purpose of creating a representative model is


to generalize a pattern or a relationship within a
dataset and the presence of outliers skews the
representativeness of the inferred model.

 Detecting outliers may be the primary purpose of


some data science applications, like fraud or
intrusion detection.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Feature Selection
 Many data science problems involve a dataset with
hundreds to thousands of attributes

 A large number of attributes in the dataset significantly


increases the complexity of a model and may degrade
the performance of the model due to the curse of
dimensionality.

 Reducing the number of attributes, without significant


loss in the performance of the model, is called feature
selection.

 It leads to a more simplified model


Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Curse of Dimensionality
Data Sampling
 Sampling is a process of selecting a subset of
records as a representation of the original
dataset for use in data analysis or modeling.

 The sample data serve as a representative of the


original dataset with similar properties, such as a
similar mean.

 Sampling reduces the amount of data that need


to be processed and speeds up the build process
of the modeling
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 Stratified sampling is a process of sampling where
each class is equally represented in the sample; this
allows the model to focus on the difference between
the patterns of each class that is, normal and outlier
records.

 In classification applications, sampling is used create


multiple base models, each developed using a
different set of sampled training datasets.

 These base models are used to build one meta model,


called the ensemble model, where the error rate is
improved when compared to that of the base models.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
MODELING
 A model is the abstract representation of
the data and the relationships in a given
dataset.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Modelling steps

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Training and Testing
Datasets
 The modeling step creates a representative
model inferred from the data.

 The Dataset used to create the model, with


known attributes and target, is called the
training dataset.

 The validity of the created model will also


need to be checked with another known
dataset called the test dataset or validation
dataset.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 To facilitate this process, the overall
known dataset can be split into a
training dataset and a test dataset.

 A standard rule of thumb is two-thirds of


the data are to be used as training and
one-third as a test dataset

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Learning Algorithms
 The business question and the availability
of data will dictate what data science task
(association, classification, regression,
etc.,) can to be used

 Interest rate prediction is a regression


problem.
 A simple linear regression technique will be
used to model and generalize the relationship
between credit score and interest rate.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Evaluation of the Model
 A model should not memorize and output the same values
that are in the training records.

 The phenomenon of a model memorizing the training data


is called overfitting.

 An overfitted model just memorizes the training records and


will underperform on real unlabeled new data.

 The model should generalize or learn the relationship


between credit score and interest rate.

 To evaluate this relationship, the validation or test dataset,


which was not previously used in building the model, is
usedReference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
for evaluation practice., Morgan
 The actual value of the interest rate can be compared
against the predicted value using the model, and
thus, the prediction error can be calculated.

 As long as the error is acceptable, this model is ready


for deployment.

 The error rate can be used to compare this model


with other models developed using different
algorithms like neural networks or Bayesian models,
etc.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Ensemble Modeling
 Ensemble modeling is a process where multiple diverse base
models are used to predict an outcome.

 The motivation for using ensemble models is to reduce the


generalization error of the prediction.

 As long as the base models are diverse and independent, the


prediction error decreases when the ensemble approach is used.

 The approach seeks the wisdom of crowds in making a


prediction.

 Even though the ensemble model has multiple base models


within the model, it acts and performs as a single model.
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
 At the end of the modeling stage of the data
science process, one has
(1) analyzed the business question;
 (2) sourced the data relevant to answer the

question;
 (3) selected a data science technique to answer

the question;
 (4) picked a data science algorithm and prepared

the data to suit the algorithm;


 (5) split the data into training and test datasets;

 (6) built a generalized model from the training

dataset; and
 (7) validated the model against the test dataset
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
APPLICATION
 Deployment is the stage at which the model
becomes production ready or live.

 In business applications, the results of the data


science process have to be assimilated into the
business process—usually in software applications.

 The model deployment stage has to deal with:


 assessing model readiness,
 technical integration,
 response time,
 model maintenance, and
 assimilation.
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Collecting, cleaning and visualizing
data

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Collecting Data
 The most critical issue in any data
science or modeling project is finding the
right data set.

 Who might actually have the data I need?


 Why might they decide to make it available
to me?
 How can I get my hands on it?

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Hunting

 Scrapping

 Logging

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Hunting
 Who has the data, and how can you get
it?

 Companies and Proprietary Data


Sources
 Government Data Sources
 Academic Data Sets

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Companies and Proprietary
Data Sources

 Large companies like Facebook, Google,


Amazon, American Express, and Blue
Cross have amazing amounts of exciting
data about users and transactions

 Companies are reluctant to share data for


two good reasons:
 Business issues, and the fear of helping
their competition.
 Privacy issues, and the fear of offending
their customers
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Many responsible companies like The
New York Times, Twitter, Facebook, and
Google do release certain data
 Providing customers and third parties with
data that can increase sales.
 For example, releasing data about query
frequency and ad pricing can encourage
more people to place ads on a given
platform.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Most organizations have internal data sets of
relevance to their business.

 As an employee, you should be able to get


privileged access while you work there.

 Be aware that companies have internal data


access policies, so you will still be subject to
certain restrictions.

 Violating the terms of these policies is an


excellent way to become an ex-employee
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Government Data Sources
 City, state, and federal governments have become increasingly
committed to open data, to facilitate novel applications and
improve how government can fullfill its mission

 Government data differs from industrial data in that, it belongs to


the People.

 The Freedom of Information Act (FOI) enables any citizen to make


a formal request for any government document or data set.

 Such a request triggers a process to determine what can be


released without compromising the national interest or violating
privacy.

 Preserving privacy is typically the biggest issue in deciding


Reference: Kotu, V., & Deshpande, B. (2019). Data
whether a particular government data set can be released.
science: Concepts and practice., Morgan
Academic Data Sets
 An increasing fraction of academic research
involves the creation of large data sets.

 Many journals now require making source


data available to other researchers prior to
publication.

 Expect to be able to find vast amounts of


economic, medical, demographic, historical,
and scientific data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The key to finding these data sets is to track down
the relevant papers

 Google Scholar is the most accessible source of


research publications.

 Research publications will typically provide pointers


to where its associated data can be found.

 If not, contacting the author directly with a request


should quickly yield the desired result
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 someone else has worked hard to
analyze Published data sets before you
got to them

 But bringing fresh questions to old data


generally opens new possibilities.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Scraping
 Web pages often contain valuable text
and numerical data

 Spidering is the process of downloading


the right set of pages for analysis.

 Scraping is the fine art of stripping this


content from each page to prepare it for
computational analysis.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Scraping programs were site-specific
scripts hacked up to look for particular
HTML patterns flanking the content of
interest

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 The most advanced form of spidering is
web crawling
 where you systematically traverse all
outgoing links from a given root page
 continuing recursively until you have visited
every page on the target website.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Every major website contains a terms of
service document that restricts what you can
legally do with any associated data

 Aaron Schwartz case

 If you are attempting a web-scraping project


professionally, be sure that management
understands the terms of service before you
get too creative with someone else's property.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Logging
 Internal access to a web service, communications
device, or laboratory instrument grants you the
right and responsibility to log all activity for
downstream analysis.

 Amazing things can be done with ambient data


collection from weblogs and sensing devices

 The accelerometers in cell phones can be used to


measure the strength of earthquakes
 out people driving on bumpy roads or
Filter
leaving their phones in a clothes dryer
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
 Monitoring the GPS data of a fleet of taxi cabs
tracks traffic congestion on city streets.

 Computational analysis of image and video


streams opens the door to countless
applications.

 Another cool idea is to use cameras as weather


instruments, by looking at the color of the sky in
the background of the millions of photographs
uploaded to photo sites daily.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Logging
 The important considerations in designing any
logging system are:
 Build it to endure with limited maintenance. Set it
and forget it, by provisioning it with enough
storage for unlimited expansion, and a backup.

 Store all fields of possible value, without going


crazy.

 Use a human-readable format or transactions


database, so you can understand exactly what is
in there when the time comes, months or years
later, to sit down and analyze your data
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Cleaning Data
 Garbage in, garbage out

 Processing before we do our real


analysis, to make sure that the garbage
never gets in in the first place

 Errors vs. Artifacts

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Errors vs. Artifacts

 Under ancient Jewish law, if a suspect on trial


was unanimously found guilty by all judges,
then this suspect would be acquitted.

 Unanimous agreement often indicates the


presence of a systemic error in the judicial
process.

 When something seems too good to be true,


a mistake has likely been made somewhere.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data errors represent information that is
fundamentally lost in acquisition

 Gaussian noise blurring the resolution of our


sensors represents error, precision which has
been permanently lost.

 The two hours of missing logs because the


server crashed represents data error: it is
information which cannot be reconstructed
again.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Artifacts are systematic problems arising
from processing done to the raw information
it was constructed from

 Processing artifacts can be corrected, so


long as the original raw data set remains
available.

 These artifacts must be detected before


they can be corrected.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The key to detecting processing artifacts
is the “sniff test,“

 Something bad is usually something


unexpected or surprising.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Models for data science

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Modeling
 The process of encapsulating information into
a tool which can forecast and make
predictions

 Predictive models are structured around some


idea of what causes future events to happen.

 Extrapolating from recent trends and


observations assumes a world view that the
future will be like the past.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Occam's Razor
 Occam's razor is the philosophical principle that
the “simplest explanation is the best
explanation”.

 Given two models or theories which do an


equally accurate job of making predictions, we
should opt for the simpler one as sounder and
more robust.

 It is more likely to be making the right decision


for the right reasons.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Occam’s notion of simpler generally refers to reducing the
number of assumptions employed in developing the model.

 Minimize the parameter count of a model.

 Overfitting occurs when a model tries too hard to


achieve accurate performance on its training data.
 This happens when there are so many parameters
that the model can essentially memorize its training
set, instead of generalizing appropriately to minimize
the effects of error and outliers.

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 Accuracy is not the best metric to use in
judging the quality of a model.

 Simpler models tend to be more robust


and understandable than complicated
alternatives

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Bias-Variance Trade-Offs

The bias error is an error from erroneous
assumptions in the learning algorithm.
 High bias can cause an algorithm to miss the relevant
relations between features and target outputs
(underfitting).

The variance is an error from sensitivity to small
fluctuations in the training set.If our training set
contains sampling or measurement error, this noise
introduces variance into the resulting model.
 High variance may result from an algorithm modeling the
random noise in the training data (overfitting).
 Errors of bias produce underfit models.
 They do not fitt the training data as tightly
as possible

 Errors of variance result in overfit


models:
 their quest for accuracy causes them to
mistake noise for signal

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
A Taxonomy of Models

 Linear vs. Non-Linear Models

 Blackbox vs. Descriptive Models

 First-Principle vs. Data-Driven Models

 Stochastic vs. Deterministic Models

 Flat vs. Hierarchical Models


Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Linear vs. Non-Linear Models

 Linear models are governed by


equations that weigh each feature
variable by a coefficient reflecting its
importance, and sum up these values to
produce a score.

 Powerful machine learning techniques,


such as linear regression, can be used to
identify the best possible coefficients to
fit training data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 World is not linear.
 Richer mathematical descriptions include
higher-order polynomials, logarithms, and
exponentials.

 These permit models that fit training data


much more tightly than linear functions can.

 It is much harder to find the best possible


coefficients to fit non-linear models
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 But linear models offer substantial benefits.
 They are readily understandable,
 generally defensible, easy to build, and
 avoid overfitting on modest-sized data sets.

 Occam's razor tells us that the simplest


explanation is the best explanation.

 A robust linear model, yielding an accuracy of x%,


better than a complex non-linear beast only a few
percentage points better on limited testing data.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Blackbox vs. Descriptive Models

 Black boxes are devices that do their job,


but in some unknown manner

 Descriptive models provide some insight


into why they are making their decisions

 Theory-driven models are generally


descriptive

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Descriptive Models
 Linear regression models are descriptive,
because
 one can see exactly which variables receive
the most weight, and
 measure how much they contribute to the
resulting prediction.

 Decision tree models enable you to


follow the exact decision path used to
make a classification
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Blackbox Models
 Blackbox modeling techniques such as
deep learning can be extremely
effective.

 Neural network models are generally


completely opaque as to why they do
what they do.

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 A system built for the military to distinguish
images of cars from trucks.

 It performed well in training, but disastrously in


the field.

 Only later was it realized that the training


images for cars were shot on a sunny day and
those of trucks on a cloudy day, so the system
had learned to link the sky in the background
with the class of the vehicle
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
First-Principle vs. Data-Driven Models

 First-principle models are based on a belief


of how the system under investigation really
works.

 It might be a theoretical explanation, like


Newton's laws of motion.

 Such models can employ the full weight of


classical mathematics: calculus, algebra,
geometry, and more.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 voters are unhappy if the economy is bad, therefore
variables which measure the state of the economy
should help us predict who will win the election.

 Data-driven models are based on observed correlations


between input parameters and outcome variables

 The same basic model might be used to predict


tomorrow's weather or the price of a given stock,
differing only on the data it was trained on.

 Machine learning methods make it possible to build an


effective model on a domain one knows nothing about,
provided we are given a good enough training set.
Stochastic vs. Deterministic Models

 Stochastic is a fancy word meaning


randomly determined.

 Techniques that explicitly build some


notion of probability into the model
include logistic regression and Monte
Carlo simulation.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 It is important that your model observe the
basic properties of probabilities, including:

 Each probability is a value between 0 and 1

 That they must sum to 1

 Rare events do not have probability zero:


 Any event that is possible must have a greater
than zero probability of occurrence
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Deterministic models always return the
same answer helps greatly in debugging
their implementation.

 This speaks to the need to optimize


repeatability during model development.

 Fix the initial seed if you are using a random


number generator, so you can rerun it and
get the same answer.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Flat vs. Hierarchical Models
 Interesting problems often exist on several different
levels, each of which may require independent
submodels

 Imposing a hierarchical structure on a model permits


it to be built and evaluated in a logical and
transparent way, instead of as a black box.

 Hierarchical models are descriptive: one can trace a


final decision back to the appropriate top-level
subproblem, and report how strongly it contributed to
making the observed result
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan

Predicting the future price for a particular
stock really should involve submodels for
analyzing such separate issues as
 (a)the general state of the economy,
 (b) the company’s balance sheet, and
 (c) the performance of other companies in its
industrial sector.
 The first step to build a hierarchical
model is explicitly decomposing our
problem into subproblems.

 Deep learning models can be thought of


as being both at flat and hierarchical, at
the same time

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Baseline Models
 The first step to assess the complexity of
your task involves building baseline models:
the simplest reasonable models that
produce answers we can compare against.

 More sophisticated models should do better


than baseline models, but verifying that
they really do and, if so by how much, puts
its performance into the proper context.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Evaluating Models
 But the best way to assess models
involves out-of-sample predictions,
results on data that you never saw when
you built the model.

 Good performance on the data that you


trained models on is very suspect,
because models can easily be overfit.
Evaluating Classifiers
 Two distinct labels or classes (binary
classification)

 The smaller and more interesting of the


two classes as positive and the
larger/other class as negative.

 In a spam classification problem, the


spam would typically be positive and the
ham (non-spam)
Reference: would
Kotu, V., & Deshpande, B. (2019).be
Data negative
science: Concepts and practice., Morgan
 There are four possible results of what the
classification model could do on any given
instance, which defines the confusion matrix or
contingency table

 True Positives (TP):


 Here our classier labels a positive item as positive,
resulting in a win for the classier.

 True Negatives (TN):


 Here the classier correctly determines that a member
of the negative class deserves a negative label.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 False Positives (FP):
 The classier mistakenly calls a negative
item as a positive, resulting in a type I
classification error.

 False Negatives (FN):


 The classier mistakenly declares a positive
item as negative, resulting in a type II"
classification error.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Baseline Evaluators
 We must defend our classier against two baseline
opponents, the sharp and the monkey

 The sharp is the opponent who knows what


evaluation system we are using, and picks the
baseline model which will do best according to it.

 The sharp will try to make the evaluation statistic look


bad, by achieving a high score with a useless classier.

 That might mean declaring all items positive, or


perhaps all negative.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The monkey randomly guesses on each
instance.

 To interpret our model's performance, it


is important to establish by how much it
beats both the sharp and the monkey.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Accuracy of the classifier
 The ratio of the number of correct predictions
over total predictions

 By multiplying such fractions by 100, we can


get a percentage accuracy score.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Accuracy alone has limitations as an evaluation
metric, particularly when the positive class is much
smaller than the negative class

 Consider the development of a classier to diagnose


whether a patient has cancer, where the positive
class has the disease (i.e. tests positive) and the
negative class is healthy.

 The prior distribution is that the vast majority of


people are healthy ((positive)/(positive +
negative))<<1/2
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The expected accuracy of a fair-coin monkey would
still be 0.5:
 it should get an average of half of the positives and half
the negatives right.

 But the sharp would declare everyone to be healthy,


achieving an accuracy of 1- p.

 Suppose that only 5% of the test takers really had the


disease.

 The sharp could brag about her accuracy of 95%


Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Precision
 We need evaluation metrics that are
more sensitive to getting the positive
class right.

 Precision measures how often this


classier is correct when it dares to say
positive.
 Achieving high precision is impossible for
either a sharp or a monkey, because the
fraction of positives (p = 0:05) is so low.

 If the classier issues too many positive


labels, it is doomed to low precision

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Recall
 In the cancer diagnosis case, we might
be more ready to tolerate false positives
(errors where we scare a healthy person
with a wrong diagnosis) than false
negatives (errors where we kill a sick
patient by misdiagnosing their illness).

 Recall measures how often you prove


right on all positive instances:
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 A high recall implies that the classier has
few false negatives

 The easiest way to achieve this declares


that everyone has cancer, as done by a
sharp always answering yes.

 This classier has high recall but low


precision:
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
F-score (F1-score)
 Harmonic mean of precision and recall

 The harmonic mean is always less than


or equal to the arithmetic mean

 Achieving a high F-score requires both


high recall and high precision

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 The higher the F-score the better the
predictive power of classification
procedure.

 A score 1 means classification procedure


is perfect

 Lowest possible F-score is 0

 0≤ F ≤science:
1 Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
 Accuracy is a misleading statistic when the class sizes
are substantially different

 Recall equals accuracy if and only if the classifiers are


balanced

 High precision is very hard to achieve in unbalanced


class sizes:

 F-score does the best job of any single statistic, but all
four work together to describe the performance of a
classier

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Problem 1
 Suppose a computer program for
recognizing dogs in photographs
identifies eight dogs in a picture
containing 12 dogs and some cats.

 Of the eight dogs identified, five actually


are dogs while the rest are cats.

 Compute the precision and recall of the


computer program.
Problem 1
 TP = 5
 FP = 3
 FN = 7

 The precision P is P = TP/( TP + FP)


= 5/( 5 + 3) = 5/ 8

 The recall R is R = TP/( TP + FN)


= 5/( 5 + 7) = 5/ 12
Problem 2
 Let there be 10 balls (6 white and 4 red
balls) in a box and let it be required to
pick up the red balls from them.

 Suppose we pick up 7 balls as the red


balls of which only 2 are actually red
balls.

 What are the values of precision and


recall in picking red ball?
Problem 2
 TP = 2
 FP = 7 − 2 = 5
 FN = 4 − 2 = 2

 The precision P is P = TP/( TP + FP)


= 2/( 2 + 5) = 2/ 7

The recall R is R = TP/( TP + FN )


= 2/(2 + 2) = 1/2
Problem 3
 A database contains 80 records on a particular topic
of which 55 are relevant to a certain investigation.
A search was conducted on that topic and 50
records were retrieved.
 Of the 50 records retrieved, 40 were relevant.
Construct the confusion matrix for the search and
calculate the precision and recall scores for the
search.
 Each record may be assigned a class label
“relevant" or “not relevant”.
 All the 80 records were tested for relevance. The
test classified 50 records as “relevant”.
 But only 40 of them were actually relevant.
Problem 3
Actual Actual ‘Not
‘Relevant’ Relevant’

Predicted 40 10
‘Relevant’

Predicted ‘Not 15 15
Relevant’
Problem 3
 TP = 40
 FP = 10
 FN = 15

 The precision P is P = TP/( TP + FP)


= 40/( 40 + 10) = 4/ 5

The recall R is R = TP/( TP + FN)


= 40/( 40 + 15) = 40/ 55
Other measures of
performance
 Using the data in the confusion matrix of a classifier of
two-class dataset, several measures of performance
have been defined.

 Accuracy = (TP + TN)/( TP + TN + FP + FN )

 Error rate = 1− Accuracy

 Sensitivity = TP/( TP + FN)

 Specificity = TN /(TN + FP)

 F-measure = (2 × TP)/( 2 × TP + FP + FN)


Receiver Operating Characteristic (ROC)

 The acronym ROC stands for Receiver Operating


Characteristic, a terminology coming from signal
detection theory.

 The ROC curve was first developed by electrical


engineers and radar engineers during World War
II for detecting enemy objects in battlefields.

 They are now increasingly used in machine


learning and data mining research.
TPR and FPR
 Let a binary classifier classify a collection of test data.

 TP = Number of true positives


 TN = Number of true negatives
 FP = Number of false positives
 FN = Number of false negatives

 TPR = True Positive Rate = TP/( TP + FN )= Fraction of


positive examples correctly classified = Sensitivity

 FPR = False Positive Rate = FP /(FP + TN) = Fraction of


negative examples incorrectly classified = 1 −
Specificity
ROC space
 We plot the values of FPR along the horizontal
axis (that is , x-axis) and the values of TPR along
the vertical axis (that is, y-axis) in a plane.

 For each classifier, there is a unique point in this


plane with coordinates (FPR,TPR).

 The ROC space is the part of the plane whose


points correspond to (FPR,TPR).

 Each prediction result or instance of a confusion


matrix represents one point in the ROC space.
ROC space
 The position of the point (FPR,TPR) in the
ROC space gives an indication of the
performance of the classifier.

 For example, let us consider some


special points in the space

 One step higher for positive examples


and one step right for negative examples
Special points in ROC space
 The left bottom corner point (0, 0):
 Always negative prediction
 A classifier which produces this point in the
ROC space never classifies an example as
positive, neither rightly nor wrongly,
because for this point TP = 0 and FP = 0.
 It always makes negative predictions.
 All positive instances are wrongly predicted
and all negative instances are correctly
predicted.
 It commits no false positive errors.
Special points in ROC space
 The right top corner point (1, 1):
 Always positive prediction
 A classifier which produces this point in the
ROC space always classifies an example as
positive because for this point FN = 0 and
TN = 0.
 All positive instances are correctly
predicted and all negative instances are
wrongly predicted.
 It commits no false negative errors.
Special points in ROC space
 The left top corner point (0, 1):
 Perfect prediction
 A classifier which produces this point in the
ROC space may be thought as a perfect
classifier.
 It produces no false positives and no false
negatives
Special points in ROC space
 Points along the diagonal:
 Random performance
 Consider a classifier where the class labels are
randomly guessed, say by flipping a coin.
 Then, the corresponding points in the ROC space
will be lying very near the diagonal line joining
the points (0, 0) and (1, 1).
ROC curve
 In the case of certain classification algorithms,
the classifier may depend on a parameter.

 Different values of the parameter will give


different classifiers and these in turn give
different values to TPR and FPR.

 The ROC curve is the curve obtained by plotting


in the ROC space the points (TPR , FPR) obtained
by assigning all possible values to the parameter
in the classifier
ROC curve
 The closer the ROC curve is to the top left
corner (0, 1) of the ROC space, the better the
accuracy of the classifier.

 Among the three classifiers A, B, C with ROC


curves , the classifier C is closest to the top
left corner of the ROC space.

 Hence, among the three, it gives the best


accuracy in predictions.
Area under the ROC curve
(AUC)
 The measure of the area under the ROC
curve is denoted by the acronym AUC .

 The value of AUC is a measure of the


performance of a classifier.

 For the perfect classifier, AUC = 1.0


Evaluating Multiclass Systems

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
Evaluating Value Prediction
Models
 For numerical values, error is a function
of the difference between a forecast y’ =
f(x) and the actual result y.

 Measuring the performance of a value


prediction system involves two
decisions:
 (1) fixing the specific individual error
function,
 (2) selecting the statistic to best represent
Reference: Kotu, V., & Deshpande, B. (2019). Data
the full error
science: distribution.
Concepts and practice., Morgan
 Absolute error:
 The value = y’ - y has the virtue of being
simple and symmetric,
 the sign can distinguish the case where y’ > y
from y > y’

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Relative error:
 The absolute magnitude of error is meaningless without
a sense of the units involved.
 An absolute error of 1.2 in a person's predicted height
is good if it is measured in millimeters, but terrible if
measured in miters.

 Normalizing the error by the magnitude of the


observation produces a unit-less quantity, which
can be sensibly interpreted as a fraction or
(multiplied by 100%) as a percentage:

  = (y –y’)/y.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Squared error:
 The value 2 = (y’ - y)2 is always positive

 Large errors values contribute


disproportionately to the total when
squaring: 2 for = 2 is four times larger
than 2 for = 1.

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 A commonly-used statistic is mean
squared error (MSE), which is computed
it weighs each term quadratically,
outliers have a disproportionate effect.

 Thus median squared error might be a


more informative statistic for noisy
instances

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan
 Root mean squared (RMSD) error is
simply the square root of mean squared
error:
 The advantage of RMSD is that its
magnitude is interpretable on the same
scale as the original values

Reference: Kotu, V., & Deshpande, B. (2019). Data


science: Concepts and practice., Morgan

You might also like