Unit 2.
Data science process:
business problems and
data science solution
Assoc. Prof Nguyen Manh Tuan
Opening
An important principle of data science is that data mining is a process with
fairly well-understood stages; or a set of fairly well-defined subtasks .
- Some involve the application of IT, such as the automated discovery and
evaluation of patterns from data, while others mostly require an analyst’s
creativity, business knowledge, and common sense.
Each data-driven business decision-making problem is unique, comprising
its own combination of goals, desires, constraints, and even personalities.
The solutions to the subtasks can then be composed to solve the overall
problem. Some of subtasks are unique to the particular business problem,
but others are common data mining tasks.
Despite the large number of specific data mining algorithms developed over
the years, there are only a handful of fundamentally different types of tasks
these algorithms address.
9/20/2022 internal use
CRISP-DM process
Cross Industry Standard Process for Data Mining
- End-to-end, multi step, iterative
process
- Going back and forth and at
times back to the 1st step to
redefine the data science problem
statement
9/20/2022 internal use
Process
Business Understanding
It is vital to understand the business problem to solved (NOT a prediction
model building!), and then to design a data analytics solution for it
A part of the craft where the analysts’ creativity plays a large role.
The design team should think carefully about the use scenario.
Data Understanding
The data comprise the available raw material from which the solution will
be built.
Estimating the costs and benefits of each data source and deciding
whether further investment is merited.
Understanding the different kinds of data contained in these sources
9/20/2022 internal use
Process
Data Preparation
Often proceeds along with data understanding.
Including all activities required to convert the disparate data sources to a well-formed
analytics base table
Ex:
converting data to tabular format.
removing or inferring missing values.
converting data to different types.
Modeling
Different data mining tasks are used to build relevant predictive models
Output of modeling is some sort of model or pattern capturing regularities in the data.
9/20/2022 internal use
Process
Evaluation
Assess the data mining results rigorously and to gain confidence that they are valid
and reliable before moving on.
Usually, a data mining solution is only a piece of the larger solution, and it needs to
be evaluated as such.
Deployment
Put into real use in order to realize some return on investment.
The clearest cases of deployment involve implementing a predictive model in some
information system or business process.
9/20/2022 internal use
Process
Business Data
Understanding Understanding 1. Prior Knowledge
- Predictive modeling
- Descriptive/ explanatory modeling Prepare Data 2. Data Preparation
Building Model using
Training Data Algorithms
3. Modeling
Applying Model and
Test Data Performance Evaluation
Deployment 4. Application
Knowledge and Actions
5. Posterior Knowledge
9/20/2022 internal use
1. Prior Knowledge
Gaining information on: Consumer load business (case)
- Objective of the problem - Interest rate (vs principal)
- Federal funds rate (central/national bank)
- Subject area of the problem and - Borrower’s credit score/income level/initial down
contextual information payment amount/ current assets/liabilities
- Data - Lender’s reward (interest) vs risk (default on the
loan)
An individual: default status is Boolean
Group of borrowers: default rate – continuous
numeric variable indicates the percentage of
borrowers who default
If the interest rate of past borrowers
with a range of credit scores is
known, can the interest rate for a
new borrower be predicted?
9/20/2022 internal use
1. Prior Knowledge
Correlation Analysis
- Two factors are correlated when values of x Correlation does not mean
has some predictive power on the value of y.
causation.
- The correlation coefficient of X and Y
measures the degree to which Y is a function The number of police active in a
of X (and visa versa). precinct correlated strongly with the
- Correlation ranges from -1 (anti-correlated)
to 1 (fully correlated) through 0 local crime rate, but the police do
(uncorrelated). not cause the crime.
• SAT scores and freshman GPA (r=0.47)
• Income and coronary disease (r=-0.717)
• Smoking and mortality rate (r=0.716)
• Video games and violent behaviour
(r=0.19)
Causation versus Correlation
9/20/2022 internal use
9/20/2022 internal use
9/20/2022 internal use
9/20/2022 internal use
9/20/2022 internal use
1. Prior Knowledge
A dataset (example set) (sometimes data frame)
A data point (example, record, object)
An attribute (feature, dimension, variable, field, predictor/antecedent, input)
A label (class label, target, response, prediction/consequence, output/outcome)
Identifiers: for locating/providing context for individual records; excluded in
modeling.
Attribute types
- numeric/ continuous
- categorical/ nominal
9/20/2022 internal use
2. Data Preparation
Data Exploration
descriptive statistics
visualization of data
Data quality
Handling missing values
Data type conversion
Transformation
Outliers
Feature selection
Sampling
9/20/2022 internal use
3. Modeling
Training Data Build model
Test Data Evaluation
Final Model
internal use
9/20/2022
3. Modeling
D ata
M odel
D ata M ining
“Training” data have all
values specified
N ew prediction
data
item
M odel
New data item has some value unknown (e.g., will she leave?)
3. Modeling
Splitting training and test data sets
internal use
9/20/2022
3. Modeling
Splitting training and test data sets (rule of thumb: 2/3 for training; 1/3
test)
Training Data
Test Data
internal use
9/20/2022
3. Modeling
mô hình hóa thành pt bậc nhất
internal use
9/20/2022
3. Modeling
Evaluation of test dataset
dữ liệu thực dự báo
internal use
9/20/2022
4. Application
Deployment: the stage at which the model becomes production ready
or live.
The results of data science process have to assimilated into the
business process (usually in business apps).
Product readiness
Technical integration
Model response time
Remodeling
Assimilation
9/20/2022 internal use
5. Posterior Knowledge
Posterior knowledge
9/20/2022 internal use
THE END