0% found this document useful (0 votes)

25 views181 pages

Module 1

Module 1 introduces the foundations of data science, covering its definition, processes, and tools. It discusses the differences between computer science and real science, the properties of data, and the classification of data science tasks such as supervised and unsupervised learning. The module emphasizes the importance of asking the right questions, understanding data types, and the iterative data science process for effective analysis.

Uploaded by

Gautham J K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views181 pages

Module 1

Uploaded by

Gautham J K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 181

MODULE 1

Module 1:
Foundations Data Science, process, and tools
 Introduction to data science

 Properties of data, Asking interesting

questions

 Classification of data science

 Data science process

 Collecting, cleaning and visualizing data,

 Languages, and models for data science

Introduction to data science
 Data science is a collection of techniques
used to extract value from data.

 It has become an essential tool for any

organization that collects, stores, and
processes data as part of its operations.

 Data science techniques rely on finding

useful patterns, connections, and
relationships within data.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data science is also commonly referred
to as
 knowledge discovery,
 machine learning,
 predictive analytics,
 data mining

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
AI, ML and DS

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Computer scientists, by nature, don't
respect data

 Examples of the cultural differences

between computer science and real science
include:
 Data vs. method centrism
 Concern about results
 Robustness
 Precision
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Data vs. method centrism
 Scientists are data driven, while computer
scientists are algorithm driven.

 Real scientists spend enormous amounts of effort

collecting data to answer their question of
interest

 computer scientists obsess about methods:

 which algorithm is better than which other algorithm
 which programming language is best for a job
 which program is better than which other program.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Concern about results
 Real scientists care about answers.

 They analyze data to discover something

about how the world works

 Bad computer scientists worry about

producing plausible looking numbers.
 They are personally less invested in what can
be learned from a computation, as opposed to
getting it done quickly and efficiently
Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
Robustness
 Real scientists are comfortable with the idea
that data has errors.
 computer scientists are not

 Scientists think a lot about possible sources of

bias or error in their data, and how these
possible problems can effect the conclusions
derived from them.

 Computer scientists chant “garbage in, garbage

out"
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Precision
 Nothing is ever completely true or false
in science

 Every- thing is either true or false in

computer science or mathematics.

 Computer scientists care what a number

is, while real scientists care what it
means
Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
Asking Interesting Questions
from Data
Asking Interesting Questions
from Data

 What things might you be able to learn

from a given data set?

 What do you/your people really want to

know about the world?

 What will it mean to you once you find

out?

Reference: Skiena, S. S. (2017). The data science design

manual., Springer.
 The Baseball Encyclopedia

 The Internet Movie Database (IMDb)

 Google Ngrams
 New York Taxi Records
 Prepare new questions from datasets
(minimum 3)
Properties of Data
Properties of Data
 Structured vs. Unstructured Data

 Quantitative vs. Categorical Data

 Big Data vs. Little Data

Reference: Skiena, S. S. (2017). The data science design

manual., Springer.
Structured vs. Unstructured
Data
 Structured data

 Data sets are nicely structured, like the tables in a

database or spreadsheet program

 Data is often represented by a matrix, where

 the rows of the matrix represent distinct items or records
 the columns represent distinct properties of these items.

 For example, a data set about U.S. cities might contain

one row for each city, with columns representing features
like state, population, and area.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 Unstructured data
 Record information about the state of the world,
but in a more heterogeneous way

 Collection of tweets from Twitter

 First step is build a matrix to structure it.
 A bag of words model will construct a matrix with
a row for each tweet, and a column for each
frequently used vocabulary word.
 Matrix entry M[i; j] then denotes the number of
times tweet i contains word j.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Quantitative vs. Categorical Data

 Quantitative data
 Consists of numerical values, like height
and weight.
 Data can be incorporated directly into
algebraic formulas and mathematical
models, or displayed in conventional
graphs and charts.

Reference: Skiena, S. S.
(2017). The data science
design manual., Springer.
 Categorical Data
 Categorical data consists of labels describing the properties of
the objects under investigation, like gender, hair color, and
occupation.

 This descriptive information can be every bit as precise and

meaningful as numerical data, but it cannot be worked with
using the same techniques.

 Categorical data can usually be coded numerically. For

example, gender might be represented as male = 0 or female
= 1.

 gray hair = 0, red hair = 1, and blond hair = 2.

Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 We cannot really treat these values as
numbers, for anything other than simple
identity testing.

 Does it make any sense to talk about the

maximum or minimum hair color?

 What is the interpretation of my hair

color minus your hair color?
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
Big Data vs. Little Data
 Big data
 The analysis of massive data sets resulting
from computer logs and sensor devices.
 In principle, having more data is always
better than having less, because you can
always throw some of it away by sampling
to get a smaller set if necessary
 There are difficulties in working with large
data sets.

Reference: Skiena, S. S. (2017). The data science design

manual., Springer.
 The challenges of big data include:
 The analysis cycle time slows as data size
grows
 Computational operations on data sets take
longer as their volume increases.

 Large data sets are complex to visualize

 Plotswith millions of points on them are
impossible to display on computer screens or
printed images, let alone conceptually
understand.
Reference: Skiena, S. S. (2017). The data science design
manual., Springer.
 Simple models do not require massive data to
fit or evaluate
A typical data science task might be to make a
decision (say, whether I should offer this fellow life
insurance?) on the basis of a small number of
variables: say age, gender, height, weight, and the
presence or absence of existing medical conditions.

 Big data is sometimes called bad data

 We might have to go to heroic efforts to make
sense of something just because we have it.

Reference: Skiena, S. S. (2017). The data science design

manual., Springer.
Classification of data science
Supervised Learning
 Data science problems can be broadly categorized
into supervised or unsupervised learning models.

 Supervised or directed data science tries to infer a

function or relationship based on labeled training
data and uses this function to map new unlabeled
data.

 Supervised techniques predict the value of the

output variables based on a set of input variables.

 A model is developed from a training dataset where

Reference: Kotu, V., & Deshpande, B. (2019). Data
the valuesscience:
of input and
Concepts andoutput are previously known.
practice., Morgan
 The model generalizes the relationship between
the input and output variables and uses it to
predict for a dataset where only input variables
are known.

 The output variable that is being predicted is

also called a class label or target variable.

 Supervised data science needs a sufficient

number of labeled records to learn the model
from the data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Unsupervised Learning
 Unsupervised or undirected data science
uncovers hidden patterns in unlabeled
data.

 There are no output variables to predict.

 Find patterns in data based on the

relationship between data points
themselves.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Science Tasks

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Classification and
Regression
 Classification and regression techniques predict a target
variable based on input variables.

 The prediction is based on a generalized model built from

a previously known dataset.

 In regression tasks, the output variable is numeric (e.g.,

the mortgage interest rate on a loan).

 Classification tasks predict output variables, which are

categorical or polynomial (e.g., the yes or no decision to
approve a loan)

 DeepReference:
learning isV.,a&more
Kotu, sophisticated
Deshpande, B. (2019). Data artificial neural
science: Concepts and practice., Morgan
network used for classification and regression problems
Clustering
 Clustering is the process of identifying the natural
groupings in a dataset
 Generalize the uniqueness of each cluster

 Market basket analysis or Association analysis

 Identify pairs of items that are purchased together, so that
specific items can be bundled or placed next to each other.
 commonly used in cross selling.

 Recommendation engines are the systems that

recommend items to the users based on individual user
preference

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Anomaly or outlier detection identifies the
data points that are significantly different
from other data points in a dataset
 Credit card transaction fraud detection

 Time series forecasting is the process of

predicting the future value of a variable
(e.g., temperature) based on past
historical values that may exhibit a trend
and seasonality.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Text mining is a data science application where
the input data is text,
 which can be in the form of documents, messages,
emails, or web pages
 text file is converted to document vectors
 standard data science tasks such as classification,
clustering, etc., can be applied to vectors

 Feature selection is a process in which attributes

in a dataset are reduced to a few attributes that
really matter.(e.g. Height, Weight, Age, Eye color
predict probability of heart disease)
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 A complete data science application can contain
elements of both supervised and unsupervised
technique

 In marketing analytics, clustering can be used to find

the natural clusters in customer records.
 Each customer is assigned a cluster label at the end
of the clustering process.
 A labeled customer dataset can now be used to
develop a model that assigns a cluster label for any
new customer record with a supervised classification
technique
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data science tasks and examples page
no. 25

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Data science process
Data science process
 The methodical discovery of useful relationships
and patterns in data is enabled by a set of
iterative activities collectively known as the data
science process.

 The standard data science process involves

 (1) understanding the problem
 (2) preparing the data samples
 (3) developing the model
 (4) applying the model on a dataset to see how the
model may work in the real world
 (5) deploying and maintaining the models
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Modeling
 It is the process of building representative
models that can be inferred from the
sample dataset which can be used for
 Either predicting (predictive modeling) or
 Describing the underlying pattern in the
data (descriptive or explanatory modeling).

 There are many data science tools, that

can automate the model building.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Most time-consuming part of the overall data
science process is the preparation of data,
followed by data and business understanding.

• Crucial to the success of the data science process

 Asking the right business question

 Gaining in-depth business understanding

 Sourcing and preparing the data for the data science

task
 Mitigating implementation considerations

 Integrating the model into the business process

 Gaining knowledge from the dataset

Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Business Understanding

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
PRIOR KNOWLEDGE
 Prior knowledge refers to information that is already
known about a subject

 The data science problem doesn’t emerge in isolation

 it always develops on top of existing subject matter and

contextual information that is already known.
 The prior knowledge step in the data science process
helps to
 define what problem is being solved
 how it fits in the business context, and
 What data is needed in order to solve the problem

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Objective
 The data science process starts with a
need for analysis, a question, or a
business objective

 Without a well-defined statement of the

problem, it is impossible to come up with
the right dataset and pick the right data
science algorithm

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Consumer loan business
 a loan is provisioned for individuals against the collateral
of assets like a home or car, that is, a mortgage or an auto
loan

 an important component of the loan, is the interest rate at

which the borrower repays the loan on top of the principal.

 The interest rate on a loan depends on a gamut of

variables like
 the current federal funds rate as determined by the central
bank,
 borrower’s credit score, income level,
 home value, initial down payment amount,
 current assets and liabilities of the borrower, etc.
 The business objective of this
hypothetical case is:

 If the interest rate of past borrowers with a

range of credit scores is known, can the
interest rate for a new borrower be
predicted?

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Subject Area
 The process of data science uncovers
hidden patterns in the dataset by
exposing relationships between
attributes

 It is up to the practitioner to sift through

the exposed patterns and accept the
ones that are valid and relevant to the
answer of the objective question
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The lending business
 The objective is to predict the lending interest
rate

 Important to know how the lending business works,

 why the prediction matters,
 what happens after the rate is predicted,
 what data points can be collected from borrowers,
 what data points cannot be collected because of the
external regulations and the internal policies,
 what other external factors can affect the interest
rate,
 how to verify
Reference: the
Kotu, V., & validity
Deshpande, of the
B. (2019). outcome
Data
science: Concepts and practice., Morgan
Data
 Prior knowledge in the data can also be
gathered.

 Understanding how the data is collected,

stored, transformed, reported, and used is
essential to the data science process

 Surveys all the data available to answer the

business question and narrows down the
new data that need to be sourced.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 There are quite a range of factors to consider:
 quality of the data, quantity of data,
 availability of data, gaps in data,
 does lack of data compel the practitioner to change the
business question

 The objective of this step is to come up with a

dataset to answer the business question through
the data science process.

 It is critical to recognize that an inferred model is

only as good as the data used to create it.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Dataset
 A dataset (example set) is a collection of
data with a defined structure

 This structure is also sometimes referred

to as a “data frame”.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Dataset Example
Borrower ID Credit Score Interest Rate (%)
01 500 7.31
02 600 6.70
03 700 5.95
04 700 6.40
05 800 5.40
06 800 5.70
07 750 5.90
08 550 7.00
09 650 6.50
10 825 5.70
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data point
 A data point (record, object or example)
is a single instance in the dataset.

 Each row in dataset table is a data point.

 Each instance contains the same

structure as the dataset.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Attribute
 An attribute (feature, input, dimension, variable,
or predictor) is a single property of the dataset.

 Each column is an attribute.

 Attributes can be numeric, categorical, date-

time, text, or Boolean data types.

 Both the credit score and the interest rate are

numeric attributes

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Label
 A label (class label, output, prediction,
target, or response) is the special
attribute to be predicted based on all the
input attributes.

 The interest rate is the output variable

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Identifiers
 Identifiers are special attributes that are used for
locating or providing context to individual records.

 names, account numbers, and employee ID numbers

 Identifiers are often used as lookup keys to join

multiple datasets.

 They bear no information that is suitable for building

data science models and should, thus, be excluded
for the actual modeling
 Borrower ID is an identifier
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
DATA PREPARATION
 Preparing the dataset to suit a data science task
is the most time-consuming part of the process

 Data to be structured in a tabular format with

records in the rows and attributes in the
columns.

 If the data is in any other format, the data would

need to be transformed by applying pivot, type
conversion, join, or transpose functions, etc., to
condition the data into the required structure.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Exploration
 Also known as exploratory data analysis, provides a set of simple
tools to achieve basic understanding of the data.

 Data exploration approaches involve computing descriptive statistics

and visualization of data.

 They can expose the structure of the data, the distribution of the
values, the presence of extreme values, and highlight the inter-
relationships within the dataset.

 Descriptive statistics like mean, median, mode, standard deviation,

and range for each attribute provide an easily readable summary of
the key characteristics of the distribution of data.

 A visual plot of data points provides an instant grasp of all the data
points condensed into one chart
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Quality
 Errors in data will impact the representativeness of the model.

 Organizations use data alerts, cleansing, and transformation

techniques to improve and manage the quality of the data and
store them in companywide repositories called data
warehouses.

 Data sourced from well-maintained data warehouses have

higher quality, as there are proper controls in place to ensure
a level of data accuracy for new and existing data.

 The data cleansing practices include elimination of duplicate

records, quarantining outlier records that exceed the bounds,
standardization of attribute values, substitution of missing
values, etc
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Missing Values
 Understand the reason behind why the values are
missing.

 Tracking the data lineage (provenance) of the data source

can lead to the identification of systemic issues during
data capture or errors in data transformation.

 Knowing the source of a missing value will often guide

which mitigation methodology to use.

 The missing value can be substituted with a range of

artificial data so that the issue can be managed with
marginal impact on the later steps in the data science
process
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 This method is useful if the missing values
occur randomly and the frequency of
occurrence is quite rare.

 Alternatively, to build the representative

model, all the data records with missing
values or records with poor data quality can
be ignored.

 This method reduces the size of the dataset

Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Some data science algorithms are good at
handling records with missing values, while
others expect the data preparation step to handle
it before the model is inferred.

 k-nearest neighbor (k-NN) algorithm for classification

tasks are often robust with missing values.

 Neural network models for classification tasks do not

perform well with missing attributes, and thus, the
data preparation step is essential for developing
neural network models.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Data Types and Conversion
 The attributes in a dataset can be of
different types
 Continuous numeric (interest rate)
 Integer numeric (credit score)
 Categorical (poor, good, excellent– credit
score)

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 In case of linear regression models, the input attributes
have to be numeric.

 If the available data are categorical, they must be

converted to numeric attribute.

 A specific numeric score can be encoded for each

category value, such as poor (5400), good (5600),
excellent (5700), etc.

 Numeric values can be converted to categorical data

types by a technique called binning
 A range of values are specified for each category, for example,
a score between 400 and 500 can be encoded as “low”
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Transformation
 Input attributes are expected to be
numeric and normalized for algorithms
like KNN

 The algorithm compares the values of

different attributes and calculates
distance between the data points.

 Normalization prevents one attribute

dominating the distance results because
Reference: Kotu, V., & Deshpande, B. (2019). Data
of largescience:
values.Concepts and practice., Morgan
 For example, consider income (in thousands) and credit
score (in hundreds).

 The distance calculation will always be dominated by

slight variations in income.

 One solution is to convert the range of income and

credit score to a more uniform scale from 0 to 1 by
normalization.

 This way, a consistent comparison can be made

between the two different attributes with different units
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Outliers
 Outliers are anomalies in a given
dataset.

 Outliers may occur because of

 correct data capture or
 (few people with income in tens of millions)

 erroneous data capture

 (human height as 1.73 cm instead of 1.73 m)

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 The presence of outliers needs to be understood
and will require special treatments.

 The purpose of creating a representative model is

to generalize a pattern or a relationship within a
dataset and the presence of outliers skews the
representativeness of the inferred model.

 Detecting outliers may be the primary purpose of

some data science applications, like fraud or
intrusion detection.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Feature Selection
 Many data science problems involve a dataset with
hundreds to thousands of attributes

 A large number of attributes in the dataset significantly

increases the complexity of a model and may degrade
the performance of the model due to the curse of
dimensionality.

 Reducing the number of attributes, without significant

loss in the performance of the model, is called feature
selection.

 It leads to a more simplified model

Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Curse of Dimensionality
Data Sampling
 Sampling is a process of selecting a subset of
records as a representation of the original
dataset for use in data analysis or modeling.

 The sample data serve as a representative of the

original dataset with similar properties, such as a
similar mean.

 Sampling reduces the amount of data that need

to be processed and speeds up the build process
of the modeling
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 Stratified sampling is a process of sampling where
each class is equally represented in the sample; this
allows the model to focus on the difference between
the patterns of each class that is, normal and outlier
records.

 In classification applications, sampling is used create

multiple base models, each developed using a
different set of sampled training datasets.

 These base models are used to build one meta model,

called the ensemble model, where the error rate is
improved when compared to that of the base models.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
MODELING
 A model is the abstract representation of
the data and the relationships in a given
dataset.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Modelling steps

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Training and Testing
Datasets
 The modeling step creates a representative
model inferred from the data.

 The Dataset used to create the model, with

known attributes and target, is called the
training dataset.

 The validity of the created model will also

need to be checked with another known
dataset called the test dataset or validation
dataset.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 To facilitate this process, the overall
known dataset can be split into a
training dataset and a test dataset.

 A standard rule of thumb is two-thirds of

the data are to be used as training and
one-third as a test dataset

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Learning Algorithms
 The business question and the availability
of data will dictate what data science task
(association, classification, regression,
etc.,) can to be used

 Interest rate prediction is a regression

problem.
 A simple linear regression technique will be
used to model and generalize the relationship
between credit score and interest rate.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Evaluation of the Model
 A model should not memorize and output the same values
that are in the training records.

 The phenomenon of a model memorizing the training data

is called overfitting.

 An overfitted model just memorizes the training records and

will underperform on real unlabeled new data.

 The model should generalize or learn the relationship

between credit score and interest rate.

 To evaluate this relationship, the validation or test dataset,

which was not previously used in building the model, is
usedReference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
for evaluation practice., Morgan
 The actual value of the interest rate can be compared
against the predicted value using the model, and
thus, the prediction error can be calculated.

 As long as the error is acceptable, this model is ready

for deployment.

 The error rate can be used to compare this model

with other models developed using different
algorithms like neural networks or Bayesian models,
etc.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Ensemble Modeling
 Ensemble modeling is a process where multiple diverse base
models are used to predict an outcome.

 The motivation for using ensemble models is to reduce the

generalization error of the prediction.

 As long as the base models are diverse and independent, the

prediction error decreases when the ensemble approach is used.

 The approach seeks the wisdom of crowds in making a

prediction.

 Even though the ensemble model has multiple base models

within the model, it acts and performs as a single model.
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
 At the end of the modeling stage of the data
science process, one has
(1) analyzed the business question;
 (2) sourced the data relevant to answer the

question;
 (3) selected a data science technique to answer

the question;
 (4) picked a data science algorithm and prepared

the data to suit the algorithm;

 (5) split the data into training and test datasets;

 (6) built a generalized model from the training

dataset; and
 (7) validated the model against the test dataset
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
APPLICATION
 Deployment is the stage at which the model
becomes production ready or live.

 In business applications, the results of the data

science process have to be assimilated into the
business process—usually in software applications.

 The model deployment stage has to deal with:

 assessing model readiness,
 technical integration,
 response time,
 model maintenance, and
 assimilation.
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Collecting, cleaning and visualizing
data

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Collecting Data
 The most critical issue in any data
science or modeling project is finding the
right data set.

 Who might actually have the data I need?

 Why might they decide to make it available
to me?
 How can I get my hands on it?

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Hunting

 Scrapping

 Logging

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Hunting
 Who has the data, and how can you get
it?

 Companies and Proprietary Data

Sources
 Government Data Sources
 Academic Data Sets

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Companies and Proprietary
Data Sources

 Large companies like Facebook, Google,

Amazon, American Express, and Blue
Cross have amazing amounts of exciting
data about users and transactions

 Companies are reluctant to share data for

two good reasons:
 Business issues, and the fear of helping
their competition.
 Privacy issues, and the fear of offending
their customers
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Many responsible companies like The
New York Times, Twitter, Facebook, and
Google do release certain data
 Providing customers and third parties with
data that can increase sales.
 For example, releasing data about query
frequency and ad pricing can encourage
more people to place ads on a given
platform.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Most organizations have internal data sets of
relevance to their business.

 As an employee, you should be able to get

privileged access while you work there.

 Be aware that companies have internal data

access policies, so you will still be subject to
certain restrictions.

 Violating the terms of these policies is an

excellent way to become an ex-employee
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Government Data Sources
 City, state, and federal governments have become increasingly
committed to open data, to facilitate novel applications and
improve how government can fullfill its mission

 Government data differs from industrial data in that, it belongs to

the People.

 The Freedom of Information Act (FOI) enables any citizen to make

a formal request for any government document or data set.

 Such a request triggers a process to determine what can be

released without compromising the national interest or violating
privacy.

 Preserving privacy is typically the biggest issue in deciding

Reference: Kotu, V., & Deshpande, B. (2019). Data
whether a particular government data set can be released.
science: Concepts and practice., Morgan
Academic Data Sets
 An increasing fraction of academic research
involves the creation of large data sets.

 Many journals now require making source

data available to other researchers prior to
publication.

 Expect to be able to find vast amounts of

economic, medical, demographic, historical,
and scientific data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The key to finding these data sets is to track down
the relevant papers

 Google Scholar is the most accessible source of

research publications.

 Research publications will typically provide pointers

to where its associated data can be found.

 If not, contacting the author directly with a request

should quickly yield the desired result
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 someone else has worked hard to
analyze Published data sets before you
got to them

 But bringing fresh questions to old data

generally opens new possibilities.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Scraping
 Web pages often contain valuable text
and numerical data

 Spidering is the process of downloading

the right set of pages for analysis.

 Scraping is the fine art of stripping this

content from each page to prepare it for
computational analysis.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Scraping programs were site-specific
scripts hacked up to look for particular
HTML patterns flanking the content of
interest

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 The most advanced form of spidering is
web crawling
 where you systematically traverse all
outgoing links from a given root page
 continuing recursively until you have visited
every page on the target website.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Every major website contains a terms of
service document that restricts what you can
legally do with any associated data

 Aaron Schwartz case

 If you are attempting a web-scraping project

professionally, be sure that management
understands the terms of service before you
get too creative with someone else's property.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Logging
 Internal access to a web service, communications
device, or laboratory instrument grants you the
right and responsibility to log all activity for
downstream analysis.

 Amazing things can be done with ambient data

collection from weblogs and sensing devices

 The accelerometers in cell phones can be used to

measure the strength of earthquakes
 out people driving on bumpy roads or
Filter
leaving their phones in a clothes dryer
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
 Monitoring the GPS data of a fleet of taxi cabs
tracks traffic congestion on city streets.

 Computational analysis of image and video

streams opens the door to countless
applications.

 Another cool idea is to use cameras as weather

instruments, by looking at the color of the sky in
the background of the millions of photographs
uploaded to photo sites daily.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Logging
 The important considerations in designing any
logging system are:
 Build it to endure with limited maintenance. Set it
and forget it, by provisioning it with enough
storage for unlimited expansion, and a backup.

 Store all fields of possible value, without going

crazy.

 Use a human-readable format or transactions

database, so you can understand exactly what is
in there when the time comes, months or years
later, to sit down and analyze your data
Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Morgan
Cleaning Data
 Garbage in, garbage out

 Processing before we do our real

analysis, to make sure that the garbage
never gets in in the first place

 Errors vs. Artifacts

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Errors vs. Artifacts

 Under ancient Jewish law, if a suspect on trial

was unanimously found guilty by all judges,
then this suspect would be acquitted.

 Unanimous agreement often indicates the

presence of a systemic error in the judicial
process.

 When something seems too good to be true,

a mistake has likely been made somewhere.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Data errors represent information that is
fundamentally lost in acquisition

 Gaussian noise blurring the resolution of our

sensors represents error, precision which has
been permanently lost.

 The two hours of missing logs because the

server crashed represents data error: it is
information which cannot be reconstructed
again.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Artifacts are systematic problems arising
from processing done to the raw information
it was constructed from

 Processing artifacts can be corrected, so

long as the original raw data set remains
available.

 These artifacts must be detected before

they can be corrected.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The key to detecting processing artifacts
is the “sniff test,“

 Something bad is usually something

unexpected or surprising.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Models for data science

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Modeling
 The process of encapsulating information into
a tool which can forecast and make
predictions

 Predictive models are structured around some

idea of what causes future events to happen.

 Extrapolating from recent trends and

observations assumes a world view that the
future will be like the past.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Occam's Razor
 Occam's razor is the philosophical principle that
the “simplest explanation is the best
explanation”.

 Given two models or theories which do an

equally accurate job of making predictions, we
should opt for the simpler one as sounder and
more robust.

 It is more likely to be making the right decision

for the right reasons.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Occam’s notion of simpler generally refers to reducing the
number of assumptions employed in developing the model.

 Minimize the parameter count of a model.

 Overfitting occurs when a model tries too hard to

achieve accurate performance on its training data.
 This happens when there are so many parameters
that the model can essentially memorize its training
set, instead of generalizing appropriately to minimize
the effects of error and outliers.

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 Accuracy is not the best metric to use in
judging the quality of a model.

 Simpler models tend to be more robust

and understandable than complicated
alternatives

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Bias-Variance Trade-Offs

The bias error is an error from erroneous
assumptions in the learning algorithm.
 High bias can cause an algorithm to miss the relevant
relations between features and target outputs
(underfitting).

The variance is an error from sensitivity to small
fluctuations in the training set.If our training set
contains sampling or measurement error, this noise
introduces variance into the resulting model.
 High variance may result from an algorithm modeling the
random noise in the training data (overfitting).
 Errors of bias produce underfit models.
 They do not fitt the training data as tightly
as possible

 Errors of variance result in overfit

models:
 their quest for accuracy causes them to
mistake noise for signal

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
A Taxonomy of Models

 Linear vs. Non-Linear Models

 Blackbox vs. Descriptive Models

 First-Principle vs. Data-Driven Models

 Stochastic vs. Deterministic Models

 Flat vs. Hierarchical Models

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and practice.,
Linear vs. Non-Linear Models

 Linear models are governed by

equations that weigh each feature
variable by a coefficient reflecting its
importance, and sum up these values to
produce a score.

 Powerful machine learning techniques,

such as linear regression, can be used to
identify the best possible coefficients to
fit training data
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 World is not linear.
 Richer mathematical descriptions include
higher-order polynomials, logarithms, and
exponentials.

 These permit models that fit training data

much more tightly than linear functions can.

 It is much harder to find the best possible

coefficients to fit non-linear models
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 But linear models offer substantial benefits.
 They are readily understandable,
 generally defensible, easy to build, and
 avoid overfitting on modest-sized data sets.

 Occam's razor tells us that the simplest

explanation is the best explanation.

 A robust linear model, yielding an accuracy of x%,

better than a complex non-linear beast only a few
percentage points better on limited testing data.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Blackbox vs. Descriptive Models

 Black boxes are devices that do their job,

but in some unknown manner

 Descriptive models provide some insight

into why they are making their decisions

 Theory-driven models are generally

descriptive

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Descriptive Models
 Linear regression models are descriptive,
because
 one can see exactly which variables receive
the most weight, and
 measure how much they contribute to the
resulting prediction.

 Decision tree models enable you to

follow the exact decision path used to
make a classification
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Blackbox Models
 Blackbox modeling techniques such as
deep learning can be extremely
effective.

 Neural network models are generally

completely opaque as to why they do
what they do.

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
 A system built for the military to distinguish
images of cars from trucks.

 It performed well in training, but disastrously in

the field.

 Only later was it realized that the training

images for cars were shot on a sunny day and
those of trucks on a cloudy day, so the system
had learned to link the sky in the background
with the class of the vehicle
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
First-Principle vs. Data-Driven Models

 First-principle models are based on a belief

of how the system under investigation really
works.

 It might be a theoretical explanation, like

Newton's laws of motion.

 Such models can employ the full weight of

classical mathematics: calculus, algebra,
geometry, and more.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 voters are unhappy if the economy is bad, therefore
variables which measure the state of the economy
should help us predict who will win the election.

 Data-driven models are based on observed correlations

between input parameters and outcome variables

 The same basic model might be used to predict

tomorrow's weather or the price of a given stock,
differing only on the data it was trained on.

 Machine learning methods make it possible to build an

effective model on a domain one knows nothing about,
provided we are given a good enough training set.
Stochastic vs. Deterministic Models

 Stochastic is a fancy word meaning

randomly determined.

 Techniques that explicitly build some

notion of probability into the model
include logistic regression and Monte
Carlo simulation.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 It is important that your model observe the
basic properties of probabilities, including:

 Each probability is a value between 0 and 1

 That they must sum to 1

 Rare events do not have probability zero:

 Any event that is possible must have a greater
than zero probability of occurrence
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Deterministic models always return the
same answer helps greatly in debugging
their implementation.

 This speaks to the need to optimize

repeatability during model development.

 Fix the initial seed if you are using a random

number generator, so you can rerun it and
get the same answer.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Flat vs. Hierarchical Models
 Interesting problems often exist on several different
levels, each of which may require independent
submodels

 Imposing a hierarchical structure on a model permits

it to be built and evaluated in a logical and
transparent way, instead of as a black box.

 Hierarchical models are descriptive: one can trace a

final decision back to the appropriate top-level
subproblem, and report how strongly it contributed to
making the observed result
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan

Predicting the future price for a particular
stock really should involve submodels for
analyzing such separate issues as
 (a)the general state of the economy,
 (b) the company’s balance sheet, and
 (c) the performance of other companies in its
industrial sector.
 The first step to build a hierarchical
model is explicitly decomposing our
problem into subproblems.

 Deep learning models can be thought of

as being both at flat and hierarchical, at
the same time

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Baseline Models
 The first step to assess the complexity of
your task involves building baseline models:
the simplest reasonable models that
produce answers we can compare against.

 More sophisticated models should do better

than baseline models, but verifying that
they really do and, if so by how much, puts
its performance into the proper context.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Evaluating Models
 But the best way to assess models
involves out-of-sample predictions,
results on data that you never saw when
you built the model.

 Good performance on the data that you

trained models on is very suspect,
because models can easily be overfit.
Evaluating Classifiers
 Two distinct labels or classes (binary
classification)

 The smaller and more interesting of the

two classes as positive and the
larger/other class as negative.

 In a spam classification problem, the

spam would typically be positive and the
ham (non-spam)
Reference: would
Kotu, V., & Deshpande, B. (2019).be
Data negative
science: Concepts and practice., Morgan
 There are four possible results of what the
classification model could do on any given
instance, which defines the confusion matrix or
contingency table

 True Positives (TP):

 Here our classier labels a positive item as positive,
resulting in a win for the classier.

 True Negatives (TN):

 Here the classier correctly determines that a member
of the negative class deserves a negative label.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 False Positives (FP):
 The classier mistakenly calls a negative
item as a positive, resulting in a type I
classification error.

 False Negatives (FN):

 The classier mistakenly declares a positive
item as negative, resulting in a type II"
classification error.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Baseline Evaluators
 We must defend our classier against two baseline
opponents, the sharp and the monkey

 The sharp is the opponent who knows what

evaluation system we are using, and picks the
baseline model which will do best according to it.

 The sharp will try to make the evaluation statistic look

bad, by achieving a high score with a useless classier.

 That might mean declaring all items positive, or

perhaps all negative.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The monkey randomly guesses on each
instance.

 To interpret our model's performance, it

is important to establish by how much it
beats both the sharp and the monkey.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Accuracy of the classifier
 The ratio of the number of correct predictions
over total predictions

 By multiplying such fractions by 100, we can

get a percentage accuracy score.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Accuracy alone has limitations as an evaluation
metric, particularly when the positive class is much
smaller than the negative class

 Consider the development of a classier to diagnose

whether a patient has cancer, where the positive
class has the disease (i.e. tests positive) and the
negative class is healthy.

 The prior distribution is that the vast majority of

people are healthy ((positive)/(positive +
negative))<<1/2
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 The expected accuracy of a fair-coin monkey would
still be 0.5:
 it should get an average of half of the positives and half
the negatives right.

 But the sharp would declare everyone to be healthy,

achieving an accuracy of 1- p.

 Suppose that only 5% of the test takers really had the

disease.

 The sharp could brag about her accuracy of 95%

Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
Precision
 We need evaluation metrics that are
more sensitive to getting the positive
class right.

 Precision measures how often this

classier is correct when it dares to say
positive.
 Achieving high precision is impossible for
either a sharp or a monkey, because the
fraction of positives (p = 0:05) is so low.

 If the classier issues too many positive

labels, it is doomed to low precision

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Recall
 In the cancer diagnosis case, we might
be more ready to tolerate false positives
(errors where we scare a healthy person
with a wrong diagnosis) than false
negatives (errors where we kill a sick
patient by misdiagnosing their illness).

 Recall measures how often you prove

right on all positive instances:
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 A high recall implies that the classier has
few false negatives

 The easiest way to achieve this declares

that everyone has cancer, as done by a
sharp always answering yes.

 This classier has high recall but low

precision:
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
F-score (F1-score)
 Harmonic mean of precision and recall

 The harmonic mean is always less than

or equal to the arithmetic mean

 Achieving a high F-score requires both

high recall and high precision

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 The higher the F-score the better the
predictive power of classification
procedure.

 A score 1 means classification procedure

is perfect

 Lowest possible F-score is 0

 0≤ F ≤science:
1 Concepts and practice., Morgan
Reference: Kotu, V., & Deshpande, B. (2019). Data
 Accuracy is a misleading statistic when the class sizes
are substantially different

 Recall equals accuracy if and only if the classifiers are

balanced

 High precision is very hard to achieve in unbalanced

class sizes:

 F-score does the best job of any single statistic, but all
four work together to describe the performance of a
classier

Reference: Kotu, V., & Deshpande, B. (2019). Data science: Concepts and
practice., Morgan
Problem 1
 Suppose a computer program for
recognizing dogs in photographs
identifies eight dogs in a picture
containing 12 dogs and some cats.

 Of the eight dogs identified, five actually

are dogs while the rest are cats.

 Compute the precision and recall of the

computer program.
Problem 1
 TP = 5
 FP = 3
 FN = 7

 The precision P is P = TP/( TP + FP)

= 5/( 5 + 3) = 5/ 8

 The recall R is R = TP/( TP + FN)

= 5/( 5 + 7) = 5/ 12
Problem 2
 Let there be 10 balls (6 white and 4 red
balls) in a box and let it be required to
pick up the red balls from them.

 Suppose we pick up 7 balls as the red

balls of which only 2 are actually red
balls.

 What are the values of precision and

recall in picking red ball?
Problem 2
 TP = 2
 FP = 7 − 2 = 5
 FN = 4 − 2 = 2

 The precision P is P = TP/( TP + FP)

= 2/( 2 + 5) = 2/ 7

The recall R is R = TP/( TP + FN )

= 2/(2 + 2) = 1/2
Problem 3
 A database contains 80 records on a particular topic
of which 55 are relevant to a certain investigation.
A search was conducted on that topic and 50
records were retrieved.
 Of the 50 records retrieved, 40 were relevant.
Construct the confusion matrix for the search and
calculate the precision and recall scores for the
search.
 Each record may be assigned a class label
“relevant" or “not relevant”.
 All the 80 records were tested for relevance. The
test classified 50 records as “relevant”.
 But only 40 of them were actually relevant.
Problem 3
Actual Actual ‘Not
‘Relevant’ Relevant’

Predicted 40 10
‘Relevant’

Predicted ‘Not 15 15
Relevant’
Problem 3
 TP = 40
 FP = 10
 FN = 15

 The precision P is P = TP/( TP + FP)

= 40/( 40 + 10) = 4/ 5

The recall R is R = TP/( TP + FN)

= 40/( 40 + 15) = 40/ 55
Other measures of
performance
 Using the data in the confusion matrix of a classifier of
two-class dataset, several measures of performance
have been defined.

 Accuracy = (TP + TN)/( TP + TN + FP + FN )

 Error rate = 1− Accuracy

 Sensitivity = TP/( TP + FN)

 Specificity = TN /(TN + FP)

 F-measure = (2 × TP)/( 2 × TP + FP + FN)

Receiver Operating Characteristic (ROC)

 The acronym ROC stands for Receiver Operating

Characteristic, a terminology coming from signal
detection theory.

 The ROC curve was first developed by electrical

engineers and radar engineers during World War
II for detecting enemy objects in battlefields.

 They are now increasingly used in machine

learning and data mining research.
TPR and FPR
 Let a binary classifier classify a collection of test data.

 TP = Number of true positives

 TN = Number of true negatives
 FP = Number of false positives
 FN = Number of false negatives

 TPR = True Positive Rate = TP/( TP + FN )= Fraction of

positive examples correctly classified = Sensitivity

 FPR = False Positive Rate = FP /(FP + TN) = Fraction of

negative examples incorrectly classified = 1 −
Specificity
ROC space
 We plot the values of FPR along the horizontal
axis (that is , x-axis) and the values of TPR along
the vertical axis (that is, y-axis) in a plane.

 For each classifier, there is a unique point in this

plane with coordinates (FPR,TPR).

 The ROC space is the part of the plane whose

points correspond to (FPR,TPR).

 Each prediction result or instance of a confusion

matrix represents one point in the ROC space.
ROC space
 The position of the point (FPR,TPR) in the
ROC space gives an indication of the
performance of the classifier.

 For example, let us consider some

special points in the space

 One step higher for positive examples

and one step right for negative examples
Special points in ROC space
 The left bottom corner point (0, 0):
 Always negative prediction
 A classifier which produces this point in the
ROC space never classifies an example as
positive, neither rightly nor wrongly,
because for this point TP = 0 and FP = 0.
 It always makes negative predictions.
 All positive instances are wrongly predicted
and all negative instances are correctly
predicted.
 It commits no false positive errors.
Special points in ROC space
 The right top corner point (1, 1):
 Always positive prediction
 A classifier which produces this point in the
ROC space always classifies an example as
positive because for this point FN = 0 and
TN = 0.
 All positive instances are correctly
predicted and all negative instances are
wrongly predicted.
 It commits no false negative errors.
Special points in ROC space
 The left top corner point (0, 1):
 Perfect prediction
 A classifier which produces this point in the
ROC space may be thought as a perfect
classifier.
 It produces no false positives and no false
negatives
Special points in ROC space
 Points along the diagonal:
 Random performance
 Consider a classifier where the class labels are
randomly guessed, say by flipping a coin.
 Then, the corresponding points in the ROC space
will be lying very near the diagonal line joining
the points (0, 0) and (1, 1).
ROC curve
 In the case of certain classification algorithms,
the classifier may depend on a parameter.

 Different values of the parameter will give

different classifiers and these in turn give
different values to TPR and FPR.

 The ROC curve is the curve obtained by plotting

in the ROC space the points (TPR , FPR) obtained
by assigning all possible values to the parameter
in the classifier
ROC curve
 The closer the ROC curve is to the top left
corner (0, 1) of the ROC space, the better the
accuracy of the classifier.

 Among the three classifiers A, B, C with ROC

curves , the classifier C is closest to the top
left corner of the ROC space.

 Hence, among the three, it gives the best

accuracy in predictions.
Area under the ROC curve
(AUC)
 The measure of the area under the ROC
curve is denoted by the acronym AUC .

 The value of AUC is a measure of the

performance of a classifier.

 For the perfect classifier, AUC = 1.0

Evaluating Multiclass Systems

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
Evaluating Value Prediction
Models
 For numerical values, error is a function
of the difference between a forecast y’ =
f(x) and the actual result y.

 Measuring the performance of a value

prediction system involves two
decisions:
 (1) fixing the specific individual error
function,
 (2) selecting the statistic to best represent
Reference: Kotu, V., & Deshpande, B. (2019). Data
the full error
science: distribution.
Concepts and practice., Morgan
 Absolute error:
 The value = y’ - y has the virtue of being
simple and symmetric,
 the sign can distinguish the case where y’ > y
from y > y’

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Relative error:
 The absolute magnitude of error is meaningless without
a sense of the units involved.
 An absolute error of 1.2 in a person's predicted height
is good if it is measured in millimeters, but terrible if
measured in miters.

 Normalizing the error by the magnitude of the

observation produces a unit-less quantity, which
can be sensibly interpreted as a fraction or
(multiplied by 100%) as a percentage:

  = (y –y’)/y.
Reference: Kotu, V., & Deshpande, B. (2019). Data
science: Concepts and practice., Morgan
 Squared error:
 The value 2 = (y’ - y)2 is always positive

 Large errors values contribute

disproportionately to the total when
squaring: 2 for = 2 is four times larger
than 2 for = 1.

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 A commonly-used statistic is mean
squared error (MSE), which is computed
it weighs each term quadratically,
outliers have a disproportionate effect.

 Thus median squared error might be a

more informative statistic for noisy
instances

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan
 Root mean squared (RMSD) error is
simply the square root of mean squared
error:
 The advantage of RMSD is that its
magnitude is interpretable on the same
scale as the original values

Reference: Kotu, V., & Deshpande, B. (2019). Data

science: Concepts and practice., Morgan

Ds1 - Shahana
No ratings yet
Ds1 - Shahana
36 pages
Data Science From A Research Perspective
No ratings yet
Data Science From A Research Perspective
45 pages
Data Sciences in Telecommunication-Chapitre-1
No ratings yet
Data Sciences in Telecommunication-Chapitre-1
20 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Project Report
No ratings yet
Project Report
29 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
6220010
No ratings yet
6220010
37 pages
Datas Unit1
No ratings yet
Datas Unit1
20 pages
Unit 1 DS
No ratings yet
Unit 1 DS
14 pages
Lectura 1
No ratings yet
Lectura 1
43 pages
Data Science Design
No ratings yet
Data Science Design
299 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Applied Data Science
100% (1)
Applied Data Science
279 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Unit I
No ratings yet
Unit I
52 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
2017 Book TheDataScienceDesignManual - by WWW - Learnengineering.in
No ratings yet
2017 Book TheDataScienceDesignManual - by WWW - Learnengineering.in
456 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
LIBRO the+Data+Science+Design+Manual
No ratings yet
LIBRO the+Data+Science+Design+Manual
456 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
DS Unit 1 - ABM
No ratings yet
DS Unit 1 - ABM
103 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
7 pages
Applied Data Analysis
No ratings yet
Applied Data Analysis
128 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Introduction To Datascience
No ratings yet
Introduction To Datascience
15 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Approaches in Data Science (Slides)
No ratings yet
Approaches in Data Science (Slides)
13 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Assessment 1&2 in Data Science
No ratings yet
Assessment 1&2 in Data Science
5 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Data
No ratings yet
Data
43 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science
No ratings yet
Data Science
18 pages
Unit 1-FDS
100% (2)
Unit 1-FDS
18 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Module 4.1 - Data Science
No ratings yet
Module 4.1 - Data Science
56 pages
FODS Unit-1
No ratings yet
FODS Unit-1
33 pages
DSS-first Lecture
No ratings yet
DSS-first Lecture
14 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Module 1 Introduction Ds
No ratings yet
Module 1 Introduction Ds
18 pages
Dia 1
No ratings yet
Dia 1
88 pages
Python File Handling
No ratings yet
Python File Handling
35 pages
8253 4
No ratings yet
8253 4
19 pages
Security Issues and Solutions in Mobile Computing
No ratings yet
Security Issues and Solutions in Mobile Computing
4 pages
TCP Ip Module 2
No ratings yet
TCP Ip Module 2
17 pages
Financial Statement and BR
No ratings yet
Financial Statement and BR
31 pages
Research Design: Chapter-1
No ratings yet
Research Design: Chapter-1
27 pages
Cash Flow Complete Guide 1723961471
No ratings yet
Cash Flow Complete Guide 1723961471
96 pages
8 - An Economic Analysis On Financial Structure
No ratings yet
8 - An Economic Analysis On Financial Structure
18 pages
Economic Empowerment of Women An Impact Study of Micro-Enterprises in Nuwakot District Babita Adhikari PDF
No ratings yet
Economic Empowerment of Women An Impact Study of Micro-Enterprises in Nuwakot District Babita Adhikari PDF
23 pages
Money Dance 2011 User Guide
No ratings yet
Money Dance 2011 User Guide
57 pages
Financial System
No ratings yet
Financial System
32 pages
Financial Management: Short-Term Financial Planning
No ratings yet
Financial Management: Short-Term Financial Planning
23 pages
This Study Resource Was Shared Via: B. Working Capital Management
No ratings yet
This Study Resource Was Shared Via: B. Working Capital Management
6 pages
503 Questions Solution
No ratings yet
503 Questions Solution
12 pages
Faircent PL&BL Policy
No ratings yet
Faircent PL&BL Policy
2 pages
Quran Aali Shan Part 4 English by Aurangzaib Yousufzai
No ratings yet
Quran Aali Shan Part 4 English by Aurangzaib Yousufzai
63 pages
Math in Our World: A Quantitative 1st Reasoning Approach Edition David
100% (1)
Math in Our World: A Quantitative 1st Reasoning Approach Edition David
407 pages
Role of Agriculture and Surplus Labour For Industrialization
No ratings yet
Role of Agriculture and Surplus Labour For Industrialization
10 pages
Pas 7 - Statement of Cash Flows
No ratings yet
Pas 7 - Statement of Cash Flows
14 pages
Final Internship Report
0% (1)
Final Internship Report
15 pages
Comparison of Revised Schedule VI and The Old Schedule VI With Illustrative Disclosures
No ratings yet
Comparison of Revised Schedule VI and The Old Schedule VI With Illustrative Disclosures
84 pages
LPB Unit-1 Introduction To Banking
100% (1)
LPB Unit-1 Introduction To Banking
13 pages
Consumer Dispute: Insurance Fraud
No ratings yet
Consumer Dispute: Insurance Fraud
11 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
21 pages
Arvind SIngh Mahor-10BM60017-Summer Project Report
No ratings yet
Arvind SIngh Mahor-10BM60017-Summer Project Report
60 pages
Job Eight Seven
No ratings yet
Job Eight Seven
128 pages
Indian Banking System Overview
No ratings yet
Indian Banking System Overview
118 pages
Project Report On Readymade Garment Shop Special
50% (2)
Project Report On Readymade Garment Shop Special
38 pages
Banking Quiz
No ratings yet
Banking Quiz
37 pages
Interest Rate Theories Explained
No ratings yet
Interest Rate Theories Explained
31 pages
Comparative Analysis of Prism Cement LTD With JK Cement LTD
100% (1)
Comparative Analysis of Prism Cement LTD With JK Cement LTD
57 pages
Econ 11 Pre - Mid
No ratings yet
Econ 11 Pre - Mid
62 pages
Collateral and Guarantees
No ratings yet
Collateral and Guarantees
15 pages
Financial Liabilities
No ratings yet
Financial Liabilities
2 pages