KEMBAR78
Data Preprocessing | PDF | Machine Learning | Data
0% found this document useful (0 votes)
89 views57 pages

Data Preprocessing

Uploaded by

Vinay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views57 pages

Data Preprocessing

Uploaded by

Vinay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Preprocessing and Feature Engineering

Bergner, Borchert, da Cruz, Konak, Dr. Schapranow


Data Management for Digital Health
Winter 2019
Agenda

Medical Technology Machine


Use Cases Foundation Learning

Data
Biology Recap

Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering

Processing and Software Prediction + Data Management for


Analysis Architectures Probability Digital Health, Winter
Nephrology and 2019
Intensive Care 2
Agenda

Medical Technology Machine


Use Cases Foundation Learning

Data
Biology Recap

Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering

Processing and Software Prediction + Data Management for


Analysis Architectures Probability Digital Health, Winter
Nephrology and 2019
Intensive Care 3
Data Preparation

Data
Preparation

Test data
 Exploration
Model  Quality
requirements Training
Raw data data assessment
 Cleansing
Requirements Data Data Predictive  Labeling
Evaluation Deployment
Analysis Acquisition Preparation Modeling
 Imputation
 Feature
engineering

Preprocessing and
Feature Engineering
Roles Data Scientist Domain Expert (Data) Engineer
Data Management for
Digital Health, Winter
2019
4
Icons made by Smashicons from www.flaticon.com
What Is Data Preparation

Data preparation can make or break the predictive ability of your model
According to Kuhn and Johnson data preparation is the process of addition,
deletion or transformation of training set data
Sometimes, preprocessing of data can lead to unexpected improvements in
model accuracy
Data preparation is an important step and you should experiment with data pre-
processing steps that are appropriate for your data to see if you can get that
desirable boost in model accuracy

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
5
Data Preparation Importance
Motivation

Data in Healthcare  sparse and incomplete


Preparing the proper input dataset, compatible with
the machine learning algorithm requirements
Integral step in Machine Learning
Directly affects the ability of our model to learn
Make sure that it is in a useful Source: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says/

scale, format and even that

https://elitedatascience.com/feature-engineering
meaningful features are included
Improving the performance of Preprocessing and
machine learning models Feature Engineering
Data Management for
Digital Health, Winter
2019
6
Why Data Preparation Is so Important in Digital Health

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
7
https://www.researchgate.net/publication/332436103_Impact_of_Preprocessing_Met
hods_on_Healthcare_Predictions
Data Preparation Steps

How do I clean up the data?  Data Cleaning


How do I provide accurate data?  Data Transformation
How do I incorporate and adjust data?  Data Integration
How do I unify and scale data?  Data Normalization
How do I handle missing data?  Missing Data Imputation
How do I detect and manage noise?  Noise Identification

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
8
Data Preparation Process

Process for getting data ready for a machine


learning algorithm can be summarized
¡ Step 1: Select Data
¡ Step 2: Preprocess Data
¡ Step 3: Transform Data
Follow this process in a linear manner

Preprocessing and
Feature Engineering
https://statistik- Data Management for
dresden.de/archives/1128
Digital Health, Winter
2019
9
Select Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

There is always a strong desire for including all data that is available, that
the maxim “more is better” will hold. This may or may not be true
Consider what data you actually need to address the question or problem
you are working on
Questions to help you think:
¡ What is the extent of the data you have available?
¡ What data is not available that you wish you had available?
http://uniquerecall.com/
¡ What data don’t you need to address the problem?
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
10
Preprocess Data Step 1: Select Data
Step 2: Preprocess Data
Better Data > Fancier Algorithms Step 3: Transform Data

Formatting: Selected data may not be in a suitable format


Cleaning: Removal or fixing of missing data
¡ Incomplete and do not carry the data to address the problem
¡ Sensitive information  anonymized or removed
¡ Identifying incomplete, incorrect, inaccurate, irrelevant parts of
the data
Sampling: More selected data available than needed
¡ Longer running times for algorithms https://www.flickr.com/photos/marc_smith/1473557291/siz
es/l/

Preprocessing and
¡ Larger computational and memory requirements Feature Engineering

¡ Take smaller representative sample before considering the whole Data Management for
Digital Health, Winter
dataset 2019
11
Dummy Variables Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Transforming categorical attribute to numerical


attribute Also known
as One-Hot
Each attribute will have value either 0 or 1 Encoding!

Full Dummy Variables: Represent n categories using


n dummy variables, one variable for each level
Dummy Variables with Reference Group: Represent
the categorical variable with n categories using n-1
dummy variables
Dummy Variables for Ordered Categorical Variable Preprocessing and
with Reference Group: Assume mathematical Feature Engineering

ordering Small < Medium < Large. To indicate the Data Management for
Digital Health, Winter
ordering, use more 1s for higher categories 2019
12
https://de.mathworks.com/help/stats/dummy-indicator-variables.html
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Data transformation changes relative


differences among individual values
Types of transformation:
¡ Linear: By adding constant or
multiplying by constant
¡ Non-linear: log-transformation,
square-root transformation etc.

https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
13
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Box-Cox Step 3: Transform Data

log transformation is suitable for


strongly right-skewed data, sqrt
transformation is suitable for
slightly right-skewed data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
14
https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation
How to Handle Missing Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

There is NO good way to deal


with missing data!
Different solutions for data
imputation depending on the
kind of problem — Time series
Analysis, ML, Regression etc.
No general solution

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
15
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
(Mean/Median) Values Step 3: Transform Data

Calculating the mean/median of the non-missing values in a column

Pros Cons

Easy and fast Doesn’t factor the correlations between features.


It only works on the column level
Works well with small numerical datasets Will give poor results on encoded categorical
Preprocessing and
features (do NOT use it on categorical features) Feature Engineering
Not very accurate Data Management for
Digital Health, Winter
Doesn’t account for the uncertainty in the 2019
imputations 16
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
(Most Frequent) or (Zero/Constant) Values Step 3: Transform Data

Most Frequent statistical strategy to impute missing values


Replacing missing data with the most frequent values within each column

Pros Cons

Works well with categorical features It also doesn’t factor the correlations between
features
It can introduce bias in the data

Zero or Constant imputation replaces the missing values with either zero or any
constant value you specify
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
17
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
k-NN Step 3: Transform Data

k nearest neighbours is an algorithm that is used for simple


classification
Algorithm uses ‘feature similarity’ to predict the values of any
new data points
New point is assigned a value based on how closely it resembles
the points in the training set

Pros Cons
Preprocessing and
Can be much more accurate than the mean, Computationally expensive. KNN works by Feature Engineering
median or most frequent imputation methods (It storing the whole training dataset in memory Data Management for
depends on the dataset) Digital Health, Winter
2019
K-NN is quite sensitive to outliers in the data
18
(unlike SVM)
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
Multivariate Imputation Step 3: Transform Data

Filling the missing data multiple times


Multiple Imputations (MIs) are much better
than a single imputation as it measures the
uncertainty of the missing values in a
better way
Chained equations approach is also very
flexible and can handle different variables
of different data types

Preprocessing and
https://www.youtube.com/watch?v=zX-pacwVyvU Feature Engineering
Data Management for
Digital Health, Winter
2019
19
Data Reduction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

How do I reduce the dimensionality of data?  Feature Selection (FS)


How do I remove redundant and/or conflictive examples?  Instance Selection (IS)
How do I simplify the domain of an attribute?  Discretization
How do I fill in gaps in data?  Feature Extraction and/or Instance Generation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
20
Projection Step 1: Select Data
Step 2: Preprocess Data
Principal Component Analysis (PCA) Step 3: Transform Data

As the amount of data grows in the world, the size of


datasets available for ML development also grows
Dimensionality reduction involves the transformation of
data to new dimensions in a way that facilitates
discarding of some dimensions without losing any key
information
Large-scale problems bring about several dimensions that
can become very difficult to visualize
Some of such dimensions can be easily dropped for a https://www.dezyre.com/data-science-in-python-
tutorial/principal-component-analysis-tutorial
better visualization Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
21
Applications of PCA Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Pros Cons

Removes Correlated Features Independent variables become less


interpretable

http://setosa.io/ev/principal-component-analysis/
Improves Algorithm Performance Data standardization is must before
PCA
Reduces Overfitting Information Loss

Improves Visualization

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
22
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Fourier showed that any periodic signal s(t) can be written as a sum of sine waves
with various amplitudes, frequencies and phases

For example, the Fourier expansion of a square wave can be written as

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
23
http://mriquestions.com/fourier-transform-ft.html
Fast Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
24
https://giphy.com/gifs/fourier-transform-Km4XeiMqFNCDK
Discrete Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

N −1 Fourier series in 1822

X k = ∑ xn e
− i 2π k
N
n

n =0

http://mriquestions.com/fourier-transform-ft.html
https://de.wikipedia.org/wiki/Joseph_Fourier

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
25
Filter Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Low Pass Filter High Pass Filter

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
26
https://www.adinstruments.com/tips/data-quality
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Important signal processing tool


Used to decompose a signal into its sine and cosine
components
Output of the transformation represents the signal in
the Fourier or frequency domain
Apply mathematical operations to eliminate certain
frequency domains very easily
https://slideplayer.com/slide/4173668/

Applying the inverse Fourier transform to recover the


original time signal
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
27
Correlation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Way to understand the relationship between multiple variables and

http://www.sthda.com/english/wiki/correlation-analyses-in-r
attributes in your dataset
Using Correlation, you can get some insights such as:
¡ One or multiple attributes depend on another
¡ One or multiple attributes are associated with other attributes
Can help in predicting one attribute from another (great way to impute
missing values)
Can (sometimes) indicate the presence of a causal relationship
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
28
Autocorrelation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Heavily used in time series analysis and forecasting


Measure of the correlation between the lagged values of a time
series
Uncover hidden patterns in data
Identify seasonality and trend in our time series data

https://en.wikipedia.org/wiki/Autocorrelation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
29

https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
Transform Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Scaling: The preprocessed data may contain attributes with a


mixtures of scales for various quantities. Many machine learning
methods like data attributes to have the same scale
Decomposition: There may be features that represent a complex
concept that may be more useful to a machine learning method
when split into the constituent parts
¡ Example  Date
Aggregation: There may be features that can be aggregated into a https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
single feature
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
30
Standardization (Variance Scaling) Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

𝑥𝑥−mean 𝑥𝑥
𝑥𝑥� =
sqrt var 𝑥𝑥

It subtracts off the mean of the feature (over all data


points) and divides by the variance
It can also be called variance scaling
Feature Engineering for Machine Learning
Principles and Techniques for Data Scientists
resulting scaled feature has a mean of 0 and a Alice Zheng and Amanda Casari, O’Reilly, 2018

variance of 1
Preprocessing and
If the original feature has a Gaussian distribution, Feature Engineering
Data Management for
then the scaled feature does too Digital Health, Winter
2019
31
Min-Max Scaling Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

𝑥𝑥−min 𝑥𝑥
𝑥𝑥� =
max 𝑥𝑥 −min 𝑥𝑥

Let 𝑥𝑥 be an individual feature value (i.e., a value of


the feature in some data point)
min 𝑥𝑥 and max 𝑥𝑥 , respectively, be the minimum
Feature Engineering for Machine Learning
and maximum values of this feature over the entire Principles and Techniques for Data Scientists
Alice Zheng and Amanda Casari, O’Reilly, 2018

dataset
Preprocessing and
Min-max scaling squeezes (or stretches) all feature Feature Engineering
values to be within the range of [0, 1] Data Management for
Digital Health, Winter
2019
32
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
33
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
34
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Coming up with features is difficult, time-consuming,


requires expert knowledge. "Applied machine
learning" is basically feature engineering. ~ Andrew
Ng
The features you use influence more than Feature Engineering for Machine Learning
Principles and Techniques for Data Scientists

everything else the result. No algorithm alone, to my Alice Zheng and Amanda Casari, O’Reilly, 2018

knowledge, can supplement the information gain


given by correct feature engineering. ~ Luca
Massaron
Good data preparation and feature engineering is Preprocessing and
Feature Engineering
integral to better prediction ~ Marios Michailidis
Data Management for
(KazAnova), Kaggle GrandMaster, Kaggle #3, former Digital Health, Winter
#1 2019
35
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Data may be hard to understand and process


Conduct feature engineering to make reading of the data easier for our
machine learning models
Feature Engineering is a process of transforming the given data into a
form which is easier to interpret
In general: Features can be generated so that the data visualization
prepared for people without a data-related background can be more
Feature Engineering for Machine Learning
digestible Principles and Techniques for Data Scientists
Alice Zheng and Amanda Casari, O’Reilly, 2018

Different models often require different approaches for the different


Preprocessing and
kinds of data Feature Engineering
Data Management for
Digital Health, Winter
2019
36
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

Not possible to seperate using linear classifier

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
37
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

What if you use polar


Coordinates instead?

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
38
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
39
Iterative Process of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Brainstorm features: Really get into the problem, look at a lot of data, study
feature engineering on other problems and see what you can steal
Devise features: Depends on your problem, but you may use automatic feature
extraction, manual feature construction and mixtures of the two
Select features: Use different feature importance scorings and feature selection
methods to prepare one or more “views” for your models to operate upon
Evaluate models: Estimate model accuracy on unseen data using the chosen
features

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
40
Aspects of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Feature Engineering
Feature Selection Most useful and relevant features
are selected from the available
data
Feature Extraction Existing features are combined to
develop more useful ones
Feature Addition New features are created by
gathering new data
Preprocessing and
Feature Filtering Filter out irrelevant features to Feature Engineering
make the modeling step easy Data Management for
Digital Health, Winter
2019
41
Feature Selection Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Process where you automatically select


those features in your data that
contribute most to the prediction
variable or output in which you are
interested
Having irrelevant features in your data
can decrease the accuracy of many
models, especially linear algorithms like
linear and logistic regression
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
42
Feature Selection Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Three benefits of performing feature selection before


modeling your data are:
¡ Reduces Overfitting: Less redundant data means less
opportunity to make decisions based on noise
¡ Improves Accuracy: Less misleading data means
modeling accuracy improves
¡ Reduces Training Time: Less data means that algorithms
train faster https://quantdare.com/what-is-the-difference-between-feature-extraction-and-feature-selection/

https://towardsdatascience.com/featur
e-selection-techniques-1bfab5fe0784
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
43
Feature Extraction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Aims to reduce the number of features in a dataset by creating new


features from the existing ones (and then discarding the original
features)
New reduced set of features should then be able to summarize most
of the information contained in the original set
Create some interaction (e.g., multiply or divide) between each pair of
variables  lengthy process
Deep feature synthesis (DFS) is an algorithm which enables you to
quickly create new variables with varying depth
https://matlab1.co
m/feature-
Preprocessing and
extraction-image-
processing/
Feature Engineering
Data Management for
Digital Health, Winter
2019
44
To Know More Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
45
To Know More Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Here are some generally relevant papers:


¡ JMLR Special Issue on Variable and Feature Selection
Here are some generally relevant and interesting slides:
¡ Feature Engineering (PDF), Knowledge Discover and Data Mining 1,
by Roman Kern, Knowledge Technologies Institute
¡ Feature Engineering and Selection (PDF), CS 294: Practical Machine
Learning, Berkeley
¡ Feature Engineering Studio, Course Lecture Slides and Materials,
Columbia Preprocessing and
Feature Engineering
¡ Feature Engineering (PDF), Leon Bottou, Princeton
Data Management for
And a video for some good practical tips: Digital Health, Winter
2019
¡ Feature Engineering 46
Time Series
Let’s Compare ECG Signals

I’m comparing
What are you the curves and
doing there? try to find
similarities,
respectively
abnormalities.

https://en.wikipedia.org/wiki/Dr._Nick
Let me show you how
to do it.
https://en.wikipedia.org/wiki/Professor_Frink

https://www.cvphysiology.com/Arr
Preprocessing and
Feature Engineering

hythmias/A009.htm
Data Management for
Digital Health, Winter
2019
47
Euclidean Distance Metric
Comparing to Time Series

Let’s assume we want to compare two time series


https://en.wikipedia.org/wiki/Professor_Frink

Preprocessing and
Feature Engineering
Data Management for
About 80% of published
Digital Health, Winter
work in data mining uses 2019
Euclidean distance 48

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Data Preparation Time Series

If we naively try to measure


the distance between two
“raw” time series, we may get
very unintuitive results Euclidean distance

https://en.wikipedia.org/wiki/Dr._Nick
is very sensitive to
some “distortions”
in the data. For
https://en.wikipedia.org/wiki/Professor_Frink

most problems
these distortions
4 most common distortions are not meaningful
 should remove
¡ Offset Translation them

¡ Amplitude Scaling Preprocessing and


Feature Engineering
¡ Linear Trends Data Management for
Digital Health, Winter
¡ Noise 2019
49
Preprocessing the Data
Offset Translation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
50

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Amplitude Scaling

Zero-mean Preprocessing and


Feature Engineering
Unit-variance
Data Management for
Widely used for normalization in Digital Health, Winter
many machine learning algorithms 2019
51

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Offset Translation

Removing linear trend: Preprocessing and


Remove linear trend Feature Engineering
¡ Fit the best fitting straight line to Data Management for
Removed offset translation Digital Health, Winter
the time series, then 2019
Removed amplitude scaling 52
¡ subtract that line from the time
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Noise

Preprocessing and
The intuition behind removing Feature Engineering

noise is … Data Management for


Digital Health, Winter
2019
Average each data points value
53
with its neighbors
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Feature Engineering for Time Series

Date Time Features: These are components of the time step itself for each
observation
Lag Features: These are values at prior time steps
Window Features: These are a summary of values over a fixed window of prior time
steps

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
54
https://tsfresh.readthedocs.io/en/latest/text/introduction.html
Automated Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Why Do It? Step 3: Transform Data

We’re interested in features—we want to know


which are relevant. If we fit a model, it should
be interpretable
¡ What causes lung cancer?
– Features are aspects of a patient’s medical
history
– Binary response variable: did the patient
develop lung cancer? https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063

– Which features best predict whether lung


Preprocessing and
cancer will develop? Might want to Feature Engineering
legislate against these features. Data Management for
Digital Health, Winter
2019
55
What Next?

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
56
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
http://amid.fish/anomaly-detection-with-k-means-clustering
What to Take Home?

Data preparation allows simplification of data to make it ready for Machine Learning
and involves data selection, preprocessing, and transformation
Step 1: Data Selection Consider what data is available, what data is missing and
what data can be removed
Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and
sampling from it
Step 3: Data Transformation Transform preprocessed data ready for machine
learning by engineering features using scaling, attribute decomposition and attribute
aggregation
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
57

You might also like