KEMBAR78
machine learning workflow with data input.pptx
MACHINE LEARNING
INFORMATION ENGINEERING, KSU
JASON TSENG
Machine Learning Workflow
MACHINE LEARNING MODEL
• A formal definition of ML given by professor Mitchell − :
“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P,
improves with experience E.”
• ML is a field of AI consisting of learning algorithms that
• Improve their performance
• At executing some task
• Over time with experience
TASK
• From the perspective of problem, we may define the task T as the real-world problem to be
solved.
• The problem can be anything like finding best house price in a specific location or to find
best marketing strategy etc. On the other hand, if we talk about machine learning, the
definition of task is different because it is difficult to solve ML based tasks by conventional
programming approach.
• A task T is said to be a ML based task when it is based on the process and the system must
follow for operating on data points. The examples of ML based tasks are Classification,
Regression, Structured annotation, Clustering, Transcription etc.
EXPERIENCE E
• As name suggests, it is the knowledge gained from data points provided to the algorithm or
model.
• Once provided with the dataset, the model will run iteratively and will learn some inherent
pattern. The learning thus acquired is called experience .
• Making an analogy with human learning, we can think of this situation as in which a human
being is learning or gaining some experience from various attributes like situation, relationships
etc.
• Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience.
The experience gained by out ML model or algorithm will be used to solve the task T.
PERFORMANCE P
• An ML algorithm is supposed to perform task and gain experience with the passage
of time.
• The measure which tells whether ML algorithm is performing as per expectation or
not is its performance .
• P is basically a quantitative metric that tells how a model is performing the task, T,
using its experience, E.
• There are many metrics that help to understand the ML performance, such as
accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.
CHALLENGES IN MACHINES LEARNING
• Quality of data − Having good-quality data for ML algorithms is one of the
biggest challenges. Use of low-quality data leads to the problems related to data
preprocessing and feature extraction.
• Time-Consuming task − Another challenge faced by ML models is the
consumption of time especially for data acquisition, feature extraction and
retrieval.
• Lack of specialist persons − As ML technology is still in its infancy stage,
availability of expert resources is a tough job.
CHALLENGES IN MACHINES LEARNING
• No clear objective for formulating business problems − Having no clear objective
and well-defined goal for business problems is another key challenge for ML because
this technology is not that mature yet.
• Issue of overfitting & underfitting − If the model is overfitting or underfitting, it
cannot be represented well for the problem.
• Curse of dimensionality − Another challenge ML model faces is too many features of
data points. This can be a real hindrance.
• Difficulty in deployment − Complexity of the ML model makes it quite difficult to be
deployed in real life.
APPLICATIONS OF MACHINES LEARNING
ML can be used to solve many real-world
complex problems which cannot be solved
with traditional approach.
• Emotion analysis
• Sentiment analysis
• Error detection and prevention
• Weather forecasting and prediction
• Stock market analysis and
forecasting
• Speech synthesis, speech
recognition
• Customer segmentation, object
recognition
• Fraud detection, fraud prevention
• Recommendation of products to
customer in online shopping
TYPES OF LEARNING Is this A or B?
How much?
How many?
How is this organized?
What should I do now?
DATA LOADING FOR ML PROJECTS
• The most common format of data for ML projects is CSV (comma-separated
values), a simple file format which is used to store tabular data (number and text)
such as a spreadsheet in plain text.
• File Head: the header contains the information for each field
• Two cases related to CSV file header which must be considered −
• Case-I: When Data file is having a file header − It will automatically
assign the names to each column of data if data file is having a file header.
• Case-II: When Data file is not having a file header − We need to assign the
names to each column of data manually if data file is not having a file header.
PANDAS OR CSV OR NUMPY
UNDERSTANDING DATA WITH STATISTICS
• Looking at Raw Data
• Checking Dimensions of Data
• Statistical Summary of Data
• Reviewing Class Distribution
• Reviewing Correlation between Attributes
• Reviewing Skew of Attribute Distribution
UNDERSTANDING DATA WITH VISUALIZATION
PREPARING DATA
• This makes data preparation the most important step in ML process. Data
preparation may be defined as the procedure that makes our dataset more
appropriate for ML process.
• Data Pre-processing Techniques:
• Scaling
• Normalization
• L1 normalization
• L2 normalization
• Binarization
• Standarization
• Data labeling
SCALING
• Data rescaling makes sure that attributes
are at same scale.
• Generally, attributes are rescaled into the
range of 0 and 1. ML algorithms like
gradient descent and k-Nearest
Neighbors requires scaled data.
• We can rescale the data with the help
of MinMaxScaler class of scikit-
learn Python library.
BINARIZATION
• We can use a binary threshold for making our data
binary. The values above that threshold value will
be converted to 1 and below that threshold will be
converted to 0.
• For example, if we choose threshold value = 0.5,
then the dataset value above it will become 1 and
below this will become 0. That is why we can call
it binarizing the data or thresholding the data.
STANDARIZATION
• The standarization transforms the data attributes with a Gaussian distribution.
• It differs the mean and SD (Standard Deviation) to a standard Gaussian
distribution with a mean of 0 and a SD of 1.
• This technique is useful in ML algorithms like linear regression, logistic regression
that assumes a Gaussian distribution in input dataset and produce better results
with rescaled data.
• We can standardize the data (mean = 0 and SD =1) with the help
of StandardScaler class of scikit-learn Python library.
DATA LABELING
• It is also very important to send the data to ML algorithms having proper labeling.
• For example, in case of classification problems, lot of labels in the form of words,
numbers etc. are there on the data.
DATA FEATURE SELECTION
• The performance of machine learning model is directly proportional to the
data features used to train it.
• On the other hand, use of relevant data features can increase the accuracy
of your ML model especially linear and logistic regression.
• The following are some of the benefits of automatic feature selection
before modeling the data −
• Performing feature selection before data modeling will reduce the overfitting.
• Performing feature selection before data modeling will increases the accuracy
of ML model.
• Performing feature selection before data modeling will reduce the training
time
FEATURE SELECTION TECHNIQUES
• The followings are automatic feature selection techniques that we can use to
model ML data in Python −
• Univariate Selection
• Recursive Feature Elimination
• Principal Component Analysis (PCA)
• Feature Importance
UNIVARIATE SELECTION
• This feature selection technique is very useful in selecting those features, with the
help of statistical testing, having strongest relationship with the prediction
variables.
• We can implement univariate feature selection technique with the help of
SelectKBest0class of scikit-learn Python library.
RECURSIVE FEATURE ELIMINATION
• As the name suggests, RFE (Recursive feature elimination) feature selection
technique removes the attributes recursively and builds the model with remaining
attributes.
• We can implement RFE feature selection technique with the help of RFE class
of scikit-learn Python library.
PRINCIPAL COMPONENT ANALYSIS (PCA)
• PCA, generally called data reduction technique, is very useful feature selection
technique as it uses linear algebra to transform the dataset into a compressed
form.
• We can implement PCA feature selection technique with the help of PCA class of
scikit-learn Python library. We can select number of principal components in the
output.
FEATURE IMPORTANCE
• As the name suggests, feature importance technique is used to choose the
importance features.
• It basically uses a trained supervised classifier to select features. We can
implement this feature selection technique with the help of ExtraTreeClassifier
class of scikit-learn Python library.
machine learning workflow with data input.pptx

machine learning workflow with data input.pptx

  • 1.
  • 2.
  • 3.
    MACHINE LEARNING MODEL •A formal definition of ML given by professor Mitchell − : “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” • ML is a field of AI consisting of learning algorithms that • Improve their performance • At executing some task • Over time with experience
  • 4.
    TASK • From theperspective of problem, we may define the task T as the real-world problem to be solved. • The problem can be anything like finding best house price in a specific location or to find best marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is different because it is difficult to solve ML based tasks by conventional programming approach. • A task T is said to be a ML based task when it is based on the process and the system must follow for operating on data points. The examples of ML based tasks are Classification, Regression, Structured annotation, Clustering, Transcription etc.
  • 5.
    EXPERIENCE E • Asname suggests, it is the knowledge gained from data points provided to the algorithm or model. • Once provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learning thus acquired is called experience . • Making an analogy with human learning, we can think of this situation as in which a human being is learning or gaining some experience from various attributes like situation, relationships etc. • Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T.
  • 6.
    PERFORMANCE P • AnML algorithm is supposed to perform task and gain experience with the passage of time. • The measure which tells whether ML algorithm is performing as per expectation or not is its performance . • P is basically a quantitative metric that tells how a model is performing the task, T, using its experience, E. • There are many metrics that help to understand the ML performance, such as accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.
  • 7.
    CHALLENGES IN MACHINESLEARNING • Quality of data − Having good-quality data for ML algorithms is one of the biggest challenges. Use of low-quality data leads to the problems related to data preprocessing and feature extraction. • Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for data acquisition, feature extraction and retrieval. • Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is a tough job.
  • 8.
    CHALLENGES IN MACHINESLEARNING • No clear objective for formulating business problems − Having no clear objective and well-defined goal for business problems is another key challenge for ML because this technology is not that mature yet. • Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be represented well for the problem. • Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can be a real hindrance. • Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.
  • 9.
    APPLICATIONS OF MACHINESLEARNING ML can be used to solve many real-world complex problems which cannot be solved with traditional approach. • Emotion analysis • Sentiment analysis • Error detection and prevention • Weather forecasting and prediction • Stock market analysis and forecasting • Speech synthesis, speech recognition • Customer segmentation, object recognition • Fraud detection, fraud prevention • Recommendation of products to customer in online shopping
  • 10.
    TYPES OF LEARNINGIs this A or B? How much? How many? How is this organized? What should I do now?
  • 11.
    DATA LOADING FORML PROJECTS • The most common format of data for ML projects is CSV (comma-separated values), a simple file format which is used to store tabular data (number and text) such as a spreadsheet in plain text. • File Head: the header contains the information for each field • Two cases related to CSV file header which must be considered − • Case-I: When Data file is having a file header − It will automatically assign the names to each column of data if data file is having a file header. • Case-II: When Data file is not having a file header − We need to assign the names to each column of data manually if data file is not having a file header.
  • 12.
    PANDAS OR CSVOR NUMPY
  • 13.
    UNDERSTANDING DATA WITHSTATISTICS • Looking at Raw Data • Checking Dimensions of Data • Statistical Summary of Data • Reviewing Class Distribution • Reviewing Correlation between Attributes • Reviewing Skew of Attribute Distribution
  • 14.
  • 15.
    PREPARING DATA • Thismakes data preparation the most important step in ML process. Data preparation may be defined as the procedure that makes our dataset more appropriate for ML process. • Data Pre-processing Techniques: • Scaling • Normalization • L1 normalization • L2 normalization • Binarization • Standarization • Data labeling
  • 16.
    SCALING • Data rescalingmakes sure that attributes are at same scale. • Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. • We can rescale the data with the help of MinMaxScaler class of scikit- learn Python library.
  • 17.
    BINARIZATION • We canuse a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0. • For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data.
  • 18.
    STANDARIZATION • The standarizationtransforms the data attributes with a Gaussian distribution. • It differs the mean and SD (Standard Deviation) to a standard Gaussian distribution with a mean of 0 and a SD of 1. • This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data. • We can standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of scikit-learn Python library.
  • 20.
    DATA LABELING • Itis also very important to send the data to ML algorithms having proper labeling. • For example, in case of classification problems, lot of labels in the form of words, numbers etc. are there on the data.
  • 22.
    DATA FEATURE SELECTION •The performance of machine learning model is directly proportional to the data features used to train it. • On the other hand, use of relevant data features can increase the accuracy of your ML model especially linear and logistic regression. • The following are some of the benefits of automatic feature selection before modeling the data − • Performing feature selection before data modeling will reduce the overfitting. • Performing feature selection before data modeling will increases the accuracy of ML model. • Performing feature selection before data modeling will reduce the training time
  • 23.
    FEATURE SELECTION TECHNIQUES •The followings are automatic feature selection techniques that we can use to model ML data in Python − • Univariate Selection • Recursive Feature Elimination • Principal Component Analysis (PCA) • Feature Importance
  • 24.
    UNIVARIATE SELECTION • Thisfeature selection technique is very useful in selecting those features, with the help of statistical testing, having strongest relationship with the prediction variables. • We can implement univariate feature selection technique with the help of SelectKBest0class of scikit-learn Python library.
  • 26.
    RECURSIVE FEATURE ELIMINATION •As the name suggests, RFE (Recursive feature elimination) feature selection technique removes the attributes recursively and builds the model with remaining attributes. • We can implement RFE feature selection technique with the help of RFE class of scikit-learn Python library.
  • 28.
    PRINCIPAL COMPONENT ANALYSIS(PCA) • PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form. • We can implement PCA feature selection technique with the help of PCA class of scikit-learn Python library. We can select number of principal components in the output.
  • 30.
    FEATURE IMPORTANCE • Asthe name suggests, feature importance technique is used to choose the importance features. • It basically uses a trained supervised classifier to select features. We can implement this feature selection technique with the help of ExtraTreeClassifier class of scikit-learn Python library.