CS 5638 – Principles of
Data Science
• Motivation for Data Science
Methodology
• Data Science Methodology
• From Problem to Approach
• Working with the Data
• Deriving the Answer
1
Motivation
Source : IDC 2018
200
180
160
140
Data Size in Zettabytes
120
100
80
60
40
20
0
201 8 202 5
2
Motivation
Source : 451 Research
• 80% of the data will be unstructured
by 2025
• Challenging to obtain information from
unstructured data.
• Need of clear, well thought out and
standardized methodology for data
science
3
Methodology
Source : Wikipedia
• Methodology is the systematic, theoretical analysis of the methods
applied to a field of study.
ia@uetpeshawar.edu.pk 4
CRISP - DM
• CRISP – DM
• Cross Industry Standard Process for Data Mining
• Goal
• Encourage interoperable tools across entire data
mining process
5
Need of Standard Process
• Framework for recording experience
• Allows projects to be replicated
• Aid to project planning and management
• “Comfort factor” for new adopters
• Reduces dependency on “stars”
• Encourage best practices and help to obtain better
results
6
Business Analytic
Understanding Approach
Data
Feedback
Requirement
Data
Data Science Deployment
Collection
Methodology
Data
Evaluation
Understanding
Data
Modeling
Preparation
8
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
9
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?
10
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?
2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?
11
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?
12
A limited budget for providing health
care to public
Hospital re-admission – a sign of failure
of the system
Case Study
Assess the patient condition prior to
the discharge
Providing new data-
How the Data driven tools for
Science can help? timely decision
13
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
14
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?
15
1. From Problem to Approach
• Business Understanding
• What is the problem that is
being solved?
• Analytic Approach
• How the data can be used
to answer the question?
16
1.1 Business Understanding
• Determine Data Science Goals
• Translate the business questions to data science goals
• Specify data science problem type
• Specify criterion for model assessment
• Produce Project Plan
• Define initial process plan, discuss feasibility with stake
holders
• Put identified goals and selected techniques into a
i 17
coherent procedure
• Estimate efforts and resources needed, identify critical steps.
Business Understanding
Case Study
• Question
• What is the best way to allocate the limited health-care budget
to maximize its use in providing quality health care?
18
Business Understanding
Case Study
• Goal
• To provide quality care without increasing
costs
• Objective
• To review the process to identify inefficiencies.
19
Business Understanding
Case Study
Patients re-admitted to a rehabilitation
center
35% within one year
50% within five
years
20
Business Understanding
Case Study
• Review the data
• Findings
• Patients with Congestive Heart Failure (CHF) were at the top of
re- admission data
• A decision model can be applied to check why this is happening
21
Business Understanding
Case Study
• Four business requirements are identified
1. Predict CHF readmission outcome (0 or 1) for each patient
2. Predict the readmission risk for each patient
3. Understand explicitly what combination of events led to
the predicted outcome for each patient
4. Apply easy to understand process to new patients to predict
their readmission risk
22
1.2 Analytic Approach
Determine probability of an action.
• Predictive model
Show relationship
• Descriptive model
Yes/No answer
• Classification model 23
Analytic Approach
Case Study
• Predictive Model
• To predict an outcome
• Decision Tree Classification
• Categorical Outcome
• Explicit Decision Path
showing conditions leading to
high risk
• Easy to understand and apply 24
Analytic Approach
Case Study
Reduce ability to Rapid weight gain
exercise Heart
Failure
Fatigue Y Lack of appetite
False True True False
N Y Y N
25
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
26
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?
2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?
27
2. Working with the Data
What data is Where is the
needed to data coming
answer the from? Or how
question? to get it?
Is the data What additional
collected work is required
representative to manipulate
of the problem and work with
being solved? the data?
28
2.1 Data Requirements
• How to cook Tiramisu?
• Problem to resolve
• How to cook Tiramisu?
• Data
• Ingredients
• Which ingredients are required?
• How to collect them?
• How to prepare the ingredients to cook the desired
dish?
29
2.1 Data Requirements
What are data
Six Key Questions
requirements? What Type of Data is required?
What
How the data will be used? How Where Where Do you Get the Data?
Data
How Do you obtain the Data? How When When Do you Need The Data?
Why
Why We need The Data 30
2.1 Data Requirements
Case Study
• Define data requirements for the decision tree classification
approach
• Define and select cohort
• In-patient within health insurance provider’s service area
• Primary diagnosis of CHF in one year
• Continuous enrollment for at least 6 months prior to primary
CHF admission
• Disqualifying conditions
31
• Patients with other significant medical conditions
Defining the Data
Case Study
• Contents, formats, representation suitable for decision tree
classifier
• One record per patient
• Columns representing variables
• Contents covering all aspects of patient’s clinical history
• Transactional format
• Transformation required
32
2.2 Data Collection
• Assessment of the data collected by the data scientist is
required after initial data collection.
• Determine if the data is what is required?
• Some data might be missing.
• Some might be hard to get.
33
2.2 Data Collection
• Various techniques can be applied
to asses the contents, quality and
initial insight about the data.
• Visualization
• Descriptive Statistics
35
Data Collection
Case Study
• Available data source
• Corporate data warehouse
• Single source of medical and claims
• In-patient record system
• Claim payment system
• Disease management program
information
36
Data Collection
Case Study
• Data wanted but not
available
• Pharmaceutical records
• Ok to defer
37
Data Collection
Case Study
Merging the data
Eliminate the redundant
data
38
2.3 Data Understanding
• Is the data to be collected representative of the problem to be
solved?
• What does it mean to “prepare” or “clean” the data?
39
2.3 Data Understanding
• Describe Data
• Check data volume and examine its properties
• Accessibility and availability of attributes
• Attributes types, range, correlations, identifiers
• Understand the meaning of each attribute and attribute
value in business terms.
• For each attribute, compute basic statistics
• Distribution
• Average ,Max, Min 40
• Std deviation, variance, skewness
2.3 Data Understanding
• Explore Data
• Analyze properties of interesting attributes in detail
• Verify Data Quality
• Identify special values and catalogue their meaning
• Does it cover all the cases required?
• Does it contain error?
• Identify missing attributes.
• Do the meaning of attributes and contained values fit together?
• Check spelling of values (“The case of exploding mangoes”, “the Case of Exploding Mangoes”)
41
2.3 Data Understanding
Case Study
• Run Descriptive Statistics against data column that can
become variables in the model.
• Descriptive Statistics
• Univariate Statistics
• Pairwise Correlation
• Histogram
42
2.3 Data Understanding
Case Study
• Data Quality
• Missing Values
• Invalid or misleading
values
43
2.3 Data Understanding
Case Study
• Iterative Data Collection and
Understanding
• Refined definition of “CHF admission”
• Initial definition
• Initial diagnosis of primary diagnosis
of CHF
• Refine the definition based on the
clinical information 44
2.4 Data Preparation
• Data Cleaning
• Correct, remove or ignore noise
• Decide how to deal with special values and their meaning
• 0 Male, 1 Female
• Aggregation Level
• Outliers
45
2.4 Data Preparation
• Feature Engineering
• Process of using domain knowledge to create features that
make the machine learning algorithm work.
46
2.4 Data Preparation
• Integrate Data
• Integrate sources and store result
• Format Data
• Re-arrange attributes
• First field identifier, last field the
label
• Re-ordering records
• Reformatted within value
47
• Removing illegal characters
• Upper case to lowercase etc
2.4 Data Preparation
Case Study
• CHF broad definition
• Define the readmission criterion
• Index admission
• Readmission
• Based on the expert advise and data, a 30 day time frame is set
for readmission.
48
2.4 Data Preparation
Case Study
• Aggregating Records
• Claims :
• Professional provider , facility, pharmaceutical
• Inpatient and out patient records
• Diagnosis procedure, prescription etc
• Possibly thousands per patients (depends on clinical
history)
49
2.4 Data Preparation
Case Study
• Aggregate to patient level
• Roll up to 1 record per patient
• Create new columns representing the
transaction
• Outpatients visits
• Inpatient episodes
• Frequency,
• Recency
50
2.4 Data Preparation
Case Study
• More or less data needed?
• Literature review of important factors for CHF
readmission
51
2.4 Data Preparation
Case Study
• Completing the Data Set
• Merge all records
• List of variables used in modeling
• Target
• CHF readmission within 30 days (Yes/No) following discharge from
CHF hospitalization
52
2.4 Data Preparation
Case Study
• Target
• CHF readmission within 30 days (Yes/No) following discharge from
CHF hospitalization
• Measures • Gender • Age • Primary Drug
• Length of Stay • Prior Admission • CHF Diagnosis
Important (Primary,
Secondary, Tertiary)
• Diagnosis
Flag • CHF • Renal Failure • Hypertension
• Diabetes • Pneumonia
53
2.4 Data Preparation
Case Study
• Using Training Set
• Total records :: 2,343
• Randomly divide into training and test sets (70%, 30%
split)
• Training – 1,640
• Testing – 703
54
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
55
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?
56
3.1 Modeling
• In what way data can be visualized to get the required
answer?
• Select Modeling Technique
• Select technique
• Identify any assumption made by the technique about data
• Compare assumption with data description report
• Ensure there is no mismatch
57
3.1 Modeling
• Build Model
• Set initial parameters and document reason for choosing those values
• Run the selected technique on the input data set
• Record parameters setting using to produce the model
• Describe the model, its special features, behavior and interpretation
58
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
59
3.2 Evaluation
• Does the model used really answer the initial question or does it need
to be adjusted?
• Evaluate Results
• Understand results. Cross verify against goals.
• Check results against knowledge base (usefulness and novelty)
• Rank results with respect to business success criterion
• State conclusions for future projects
60
3.1/2 Modeling / Evaluation
Case Study
• Confusion
Matrix
Actual Values
Positive Negative
Predicted
Values
Positive TP FP
Negative FN TN
61
3.1/2 Modeling / Evaluation
Case Study
• Analyzing the 3
models
Model Relative Cost Overall Accuracy Sensitivity ( Specificity
Y:N (% of Correct Y & N) Y Accuracy) (N Accuracy)
1 1:1 85% 45% 97%
2 9:1 49% 97% 35%
3 4:1 81% 68% 85%
62
3.1/2 Modeling / Evaluation
Case Study
• How to determine the optimal model?
• Balance true-positive rate and false-positive rate for best
model
Model Relative Cost TP Rate Specificity FP Rate
Y:N (Sensitivity) (N Accuracy) (1 - Specificity)
1 1:1 0.45 0.97 0.03
2 1.5:1 0.60 0.92 0.08
3 4:1 0.68 0.85 0.15
4 9:1 0.97 0.35 0.65
64
3.1/2 Modeling / Evaluation
Case Study
• Using ROC Curve
• Classification model
performance
• TP rate vs. FP rate
• Optimal model at max
separation
65
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
66
3.3 Deployment
• Can the model be put into practice?
• Plan Deployment
• How will the knowledge or information be propagated to users?
• How will the use of the results be monitored or its benefits measured?
• Identify possible problems when deploying the data mining results.
67
3.3 Deployment
Case Study
• Assimilate knowledge for business
• Practical understanding of the meaning of model results
• Implications of model results for designing intervention
actions
68
3.3 Deployment
Case Study
• Gathering Application Requirements
• Automated, near real-time risk assessments of CHF
inpatients
• Easy to use
• Automated data preparation and scoring
• Up-to-date risk assessment to help clinicians target
high-risk patients
69
3.3 Deployment
Case Study
• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes
70
3.3 Deployment
Case Study
• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes
71
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
72
3.4 Feedback
• Can constructive feedback be obtained to answer the question?
73
3.4 Feedback
Case Study
• Define review process
• To measure the result of applying the risk model to CHF
patient population
• Track patients who received intervention
• Actual readmission outcomes
• Measure effectiveness of intervention
• Compare re-admission rates before and after mode
implementation 74
3.4 Feedback
Case Study
• Refine Model
• Initial review after the first year of implementation
• Based on feedback data and knowledge gain
• Possibly incorporate detailed pharmaceutical data originally
deferred
75
3.4 Feedback
Case Study
• Redeploy
• Continue modeling, deployment, feedback, and refinement
throughout the life of the intervention program
76
Acknowledgement
•Material presented in these slides are adopted from various
sources including
• IBM Data Science Course.
• João Mendes Moreira lecture on CRISP-DM
75