Introduction to Machine Learning
‘Syllabus
Overview of Human Learning and Machine Learning, Types of Machine Learning, Applications of
Machine Learning , Tools and Technology for Machine Learning
Contents
1.1 Overview of Human Leaming
1.2 Overview of Machine Leaming
1.3. Types of Machine Leaming
1.4 Applications of Machine Leeming
1.5. Tools and Technology for Machine Leaming
1.8 Fill in the Blanks
4.7 Multiple Choice Questions
a9‘Machine Leaming
1-2 Introduction to Machine Leaming
ERB Overview of Human Learning
Leaming is the process of acquiring new understanding, knowledge, behaviours,
skills, values, attitudes and preferences. Learning process happens when you
observe a phenomenon and recognize a pattern.
Learning is a phenomenon and process which has manifestations of various
aspects. Learning process includes gaining of new symbolic knowledge and
development of cognitive skills through instruction and practice. It is also
discovery of new facts and theories through observation and experiment.
All human learning is observing something, identifying a pattern, building a
theory (model) to explain this pattern and testing this theory to check if its fits in
most or all observations.
Fig. 1.1.1 shows human learning.
Eo,
Fig. 1.1.1 Human
ming
Both human as well as machine learning generate knowledge, one residing in the
brain the other residing in the machine.
Human leaming process varies from person to person. Once a leaning process is
set into the minds of people, it is difficult to change it.
Fig. 1.1.2 shows relation between human and machine learning.
Human learning Machine learning
Intetigence ===> models
i a a)
materials,
* Learning by creating tests
beame? = ===> Skiteam { ° Interloaving learning
* Learning by ignoring
12
TECHNICAL PUBLICATIONS®,Machine Leeming Introduction to Machine Learning
Types of human learning
«Human learning take place in following way :
1, Self-learning : Human try many times after multiple attempts, some being
unsuccessful.
Knowledge gained from expert : We build our own notion indirectly based on
what we have learnt from the expert in the past.
‘Learning directly from expert : Either somebody who is an expert in the subject
directly teaches us.
‘Humans acquire knowledge through experience either directly or shared by others.
Humans begin learning by memorizing. After few years, he realizes that mere
capability to memorize is not intelligence.
In humans, learning speed depends on individuals and in machines, learning
speed depends on the algorithm selected and the volume of examples exposed to
it,
EEE Difference between Human and Machine Learning
Be Human leaming Machine learning |
‘Humans acquire knowledge through experience Machines acquire knowledge through
either directly or shared by others. experience shared in the form of past data.
Model-free and model-based mechanisms can
be found in human learning.
Observation = Learning => Skill
Overview of Machine Learning
* Machine Leaming (ML) is a sub-field of Artificial Intelligence (AI) which concerns
with developing computational theories of leaning and building learning
machines.
Learning is a phenomenon and process which has manifestations of various
aspects. Learning process includes gaining of new symbolic knowledge and
development of cognitive skills through instruction and practice. It is also
discovery of new facts and theories through observation and experiment.
¢ Machine Learning Definition : A computer program is said to leam from
experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E.
Knowledge based learning in machine learning.
Data = Machine Learning =® Skill
TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge1-4 Introduction to Machine Le,
emg
Machine Learning,
sng computers to optimize a performan
c ‘
© criterion
Why is Machine Learning im
Machine learning 's programming oi
experience. Application of machine learnin,
& Methods
using example data OF past
to large databases i ‘called data mining
It is very hard to write PI ve problems like recognizing a hy
face, We do not know whet program to waite because we pa dade
tee does it. Instead of Writing @ rogram by hand, 1 is possible to ne
examples that specify the correct output for given input. lots of
‘h machine leaming aigorthn takes these examples and produces a pro
‘hat does the job. The PYOBTam reduced by the learning algoritun may esa
different from a typical hend-witten program. It may contain millions of numbeo,
sf we do it right, the Program orks for new cases as well as {he ones oe
it on.
Main goal of machine Teaming g algorithms that do the
ibaming automatically without ; man intervention oF assistance, The machine
Jearning paradigm can De ‘viewed as "programming by example.” Another goal is
to develop computational wtgels of human learning process and perform
computer simulations.
‘The goal of machine Tearing 1
Jearn from their experience:
‘Algorithm is used to solve
is to devise Jearnin;
to build computer systems that can adapt and
problem on computer. ‘An algorithm is a sequence of
Jnstruction. It should carry out to transform the input to output. For example, for
‘addition of four numbers is carried out by giving four number as input to the
algorithm and output is sum of all four numbers. For the same task, there may be
various algorithms. It is interested to find the most efficient one, requiring the
east number of instructions Or
For some tasks, however, we do not have an algorithm.
portant ?
Machine learning algorithms con figure
generalizing from examples.
Machine Learning provides business insight and intelligence. Decision makers me
provided with greater insights into their organizations. This adaptive te
being used by global enterprises to gain a competitive edge
Machine learning algorithms discover the relationships betwee
system (input, output and hidden) from direct samples of the system.
Following are some of the reasons =
1, Some tasks cannot be defined well, except by examples-
Recognizing people.
fe out how to perform important tasks by
a the variab
example
ForMachine Leeming 1-5 Introduction to Machine Learning
2. Relationships and correlations can be hidden within large amounts of data. To
solve these problems, machine learning and data mining may be able to find
these relationships.
3. Human designers often produce machines that do not work as well as desired
in the environments in which they are used.
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans.
5. Environments change time to time.
6. New knowledge about tasks is constantly being discovered by humans.
* Machine learning also helps us find solutions of many problems in computer
vision, speech recognition and robotics. Machine learning uses the theory of
statistics in building mathematical models, because the core task is making
inference from a sample.
How Machines Learn ?
« Machine learning typically follows three phases :
1. Training : A training set of examples of correct behavior is analyzed and some
representation of the newly learnt knowledge is stored. This is some form of rules.
2, Validation : The rules are checked and, if necessary, additional training is given.
Sometimes additional test data are used, but instead, a human expert may validate
the rules, or some other automatic knowledge - based component may be used.
The role of the tester is often called the opponent.
3. Application : The rules are used in responding to some new situation.
Fig. 1.2.1
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 1-6
Introduction to Machine Leaming
ESI How do Machine Learn?
Machine learning process in divided into three parts : Data inputs, abstraction and
‘generalization.
Fig, 1.2.2 shows machine learning process.
Information is used for future decision making.
Abstraction : Input data is represented in broader way through the underlying
algorithm.
Generalization : It forms framework for making decision.
Machine learning is a form of Artificial Intelligence (Al) that teaches computers to
think in a similar way to how humans do : Leaning and improving upon past
experiences. It works by exploring data and identifying patterns and involves
minimal human intervention.
Algorithm is used to solve a problem on computer. An algorithm is a sequence of
instruction. It should carry out to transform the input to output. For example, for
addition of four numbers is carried out by giving four number as input to the
algorithm and output is sum of all four numbers.
For the same task, there may be various algorithms. It is interested to find the
most efficient one, requiring the least number of instructions or memory or both.
Fig. 1.2.2 Machine learning process
Abstraction
During the machine leaning process, knowledge is fed in the form of input data.
Collected data is raw data. It can not used directly for processing.
Model known in machine leaning paradigm is summarized knowledge
representation of raw data. The model may be in any one of the following forms :
1, Mathematical equations.
2. Specific data structure like trees.
3. Logical grouping of similar observations.
4, Computational blocks.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 4-7 Introduction to Machine Leaming
© Choice of the model used to solve specific learning problem is the human task.
Some of the parameters are as follows :
a) Type of problem to be solved.
b) Nature of the input data.
©) Problem domain.
Well Posed Learning Problem
Definition : A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks
in T, as measured by P, improves with experience E.
* A (machine leaming) problem is well-posed if a solution to it exists, if that
sohition is unique, and if that solution depends on the data / experience but it is
not sensitive to (reasonably small) changes in the data / experience.
«Identify three features are as follows :
1. Class of tasks
2. Measure of performance to be improved
3. Source of experience
* What are T, P, E ? How do we formulate a machine learning problem ?
+ A Robot Driving Learning Problem
1. Task T : Driving on public, 4-lane highway using vision sensors.
2, Performance measure P : Average distance traveled before an error (as judged
by human overseer).
3. Training experience E : A sequence of images and steering commands
recorded while observing a human driver.
© A Handwriting Recognition Learning Problem.
1. Task T : Recognizing and classifying handwritten words within images.
2. Performance measure P : Percent of words correctly classified.
3. Training experience E : A database of handwritten words with given
classifications.
© Text Categorization Problem.
1, Task T : Assign a document to its content category.
2. Performance measure P : Precision and Recall.
3. Training experience E : Example pre-~classified documents.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 1-8 Introduction to Machine Leaming
EE] Types of Machine Learning
Learning is constructing or modifying representation of what is being experienced.
Lear means to get knowledge of by study, experience or being taught.
Machine leaming is a scientific discipline concemed with the design and
development of the algorithm that allows computers to evolve behaviours based
‘on empirical data, such as form sensors data or database.
Machine learning is usually divided into three types : Supervised, unsupervised
and reinforcement learning.
Why do machine learning ?
1. To understand and improve efficiency of human learning.
2. Discover new things or structure that is unknown to humans.
3. Fill in skeletal or incomplete specifications about a domain.
‘Machine tearning
‘Supervised leaming Unsupervised feaming Reinforcement leaming
Classification Clustering
Regression Association analysis
Fig. 1.3.1
Supervised Learning
Supervised learning is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples,
‘The task of the supervised learner is to predict the output behavior of a system for
any set of input values, after an initial training phase.
Supervised learning in which the network is trained by providing it with input
and matching output patterns. These input-output pairs are usually provided by
an external teacher.
Human learning is based on the past experiences. A computer does not have
experiences
A computer system leams from data, which represent some "past experiences’ of
an application domain.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leeming 1-9 introduction to Machine Leaming
* To lear a target function that can be used to predict the values of a discrete class
attribute, eg., approve or not-approved, and high-risk or low risk. The task is
commonly called : Supervised learning, Classification or inductive learning.
* Training data includes both the input and the desired results. For some examples
the correct results (targets) are known and are given in input to the model during
the learning process. The construction of a proper training, validation and test set
is crucial. These methods are usually fast and accurate.
* Have to be able to generalize : Give the correct results when new data are given
in input without knowing a priori the target.
© Supervised learning is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples. In
supervised leaming, each example is a pair consisting of an input object and a
desired output value.
* A supervised leaming algorithm analyzes the training data and produces an
inferred function, which is called a classifier or a regression function. Fig. 132.
shows supervised learning process.
Cc > >
k=} senee 5, =>
Training Testing
Fig. 1.3.2 Supervised learning process
* The learned model helps the system to perform task better as compared to no
learning.
* Each input vector requires a corresponding target vector.
Training Pair = (Input Vector, Target Vector)
Fig. 1.3.3
TECHNICAL PUBLICATIONS® - en up-ihrust for knowledge‘Machine Leaming 1-10 Introduction to Machine Learning
‘+ Supervised learning denotes a method in which some input vectors are collected
and presented to the network. The output computed by the net-work is observed
and the deviation from the expected answer is measured. The weights are
corrected according to the magnitude of the error in the way defined by the
learning algorithm.
* Supervised leaming is further divided into methods which use reinforcement or
error correction. The perceptron learning algorithm is an example of supervised
learning with reinforcement.
© In order to solve a given problem of supervised learning, following steps are
performed :
1. Find out the type of training examples.
2. Collect a training set.
3. Determine the input feature representation of the learned function.
4. Determine the structure of the learned function and corresponding learning
algorithm.
5. Complete the design and then run the learning algorithm on the collected
training set.
6. Evaluate the accuracy of the learned function. After parameter adjustment
and learning, the performance of the resulting function should be measured
on a test set that is separate from the training set.
Classification
© Classification predicts categorical labels (classes), prediction models.
continuous-valued functions. Classification is considered to be supervised learning.
© Classifies data based on the training set and the values in a classifying attribute
and uses it in classifying new data. Prediction means models continuous-valued
functions, ie., predicts unknown or missing values.
© Preprocessing of the data in preparation for classification and prediction can
involve data cleaning to reduce noise or handle missing values, relevance analysis
to remove irrelevant or redundant attributes, and data transformation, such as
generalizing the data to higher level concepts or normalizing data.
+ Fig. 1.3.4 shows the classification.
Aim ; To predict categorical class labels for new samples.
Input : Training set of samples, each with a class label.
Output : Classifier is based on the training set and the class labels.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
—Machine Leaming 1-11 Introduction to Mechine Leaming
Fig. 1.3.4 Classification
Prediction is similar to classification. It constructs a model and uses the model to
predict unknown or missing value.
* Classification is the process of finding a model that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on
the analysis of a set of training data.
* Classification and prediction may need to be preceded by relevance analysis,
which attempts to identify attributes that do not contribute to the classification or
prediction process.
+ Numeric prediction is the task of predicting continuous values for given input. For
example, we may wish to predict the salary of college employee with 15 years of
work experience, or the potential sales of a new product given its price.
Some of the classification methods like back-propagation, support vector machines,
and k-nearest-neighbor classifiers can be used for prediction.
Regression
* For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in your
supermarket, you are asked to predict the demand for the next month.
«Regression is concemed with the prediction of continuous quantities. Linear
regression is the oldest and most widely used predictive model in the field of
machine learning. The goal is to minimize the sum of the squared errors to fit a
straight line to a set of data points.
«For regression tasks, the typical accuracy metrics are Root Mean Square Error
(RMSE) and Mean Absolute Percentage Error (MAPE). These metrics measure the
distance between the predicted numeric target and the actual numeric answer.
TECHNICAL PUBLICATIONS® - an up-thnust for knowledgeMachine Leaming
Introduction to Machine Leaming
Regression Line
* Least squares : The least squares regression line is the line that makes the sum of
squared residuals as small as possible. Linear means “straight line".
+ Regression line is the line which gives the best estimate of one variable from the
value of any other given variable.
© The regression line gives the average relationship between the two variables in
mathematical form.
«For two variables X and Y, there are always two lines of regression.
© Regression line of X on ¥ : Gives the best estimate for the value of X for any
specific given values of Y :
X= a+by
where
a = X- intercept
b = Slope of the line
X = Dependent variable
Y = Independent variable
+ Regression line of Y on X: Gives the best estimate for the value of Y for any
specific given values of X =
Y = atbx
where
a = Y- intercept
b = Slope of the line
Y = Dependent variable
x = Independent variable
* By using the least squares method (a procedure that minimizes the vertical
deviations of plotted points surrounding a straight line) we are able to construct a
best fitting straight line to the scatter diagram points and then formulate a
regression equation in the form of :Machine Leaming 1-13 Introduction to Machine Leaming
%
Pepdaton
Population Riotiors
y- intercept nh Potdial
change :
| chan yt x8
fs ete
hangs x et erate
=Y- intercept
Fig. 1.3.5
+ Regression analysis is the art and science of fitting straight lines to pattems of
data. In a linear regression model, the variable of interest ( “dependent” variable)
is predicted from k other variables ("independent” variables) using a linear
equation. If Y denotes the dependent variable, and X;,...,X4, are the independent
variables, then the assumption is that the value of Y at time t in the data sample
is determined by the linear equation :
Y1 = Bo +81 Xz +B2 Xre +--+ Bu Xia He
where the betas are constants and the sve
epsilons are independent and identically Blaster 4a Z fam
alisuabtited: normal random variables Input rd x a
with mean zero. a ee
%
© In a regression tree the idea is
this : Since the target variable does Fig. 1.3.6
not have classes, we fit a
regression model to the target variable using each of the independent variables.
‘Then for each independent variable, the data is split at several split points.
* At each split point, the “error” between the predicted value and the actual values
is squared to get a “Sum of Squared Errors (SSE)". The split point errors across the
variables are compared and the variable/point yielding the lowest SSE is chosen
as the root node/split point. This process is recursively continued.
* Error function measures how much our predictions deviate from the desired
answers.
Mean-squared error Jy = 2
5
* Multiple linear regression is an extension of linear regression, which allows a
response variable, y, to be modeled as a linear function of two or more predictor
variables.
TECHNICAL PUBLICATIONS® - an uptrust for knowtedge‘Machine Leaming 114 Introduction to Machine Leaming
Evaluating a Regression Model
‘* Assume we want to predict a car's price using some features such as dimensions,
horsepower, engine specification, mileage etc. This is a typical regression problem,
where the target variable (price) is a continuous numeric value.
* We can fit a simple linear regression model that, given the feature values of a
certain car, can predict the price of that car. This regression model can be used to
score the same dataset we trained on. Once we have the predicted prices for all of
the cars, we can evaluate the performance of the model by looking at how much
the predictions deviate from the actual prices on average.
Advantages :
a. Training a linear regression model is usually much faster than methods such as
neural networks.
b. Linear regression models are simple and require minimum memory to implement.
© By examining the magnitude and sign of the regression coefficients you can infer
how predictor variables affect the target outcome.
Assessing Performance of Regression- Error Measures
‘+ The training error is the mean error over the training sample. The test error is the
expected prediction error over an independent test sample.
* Fig. 137 shows the relationship between training set and test set.Machine Learning 1-15 Introduction to Machine Leaming
© Unlike decision trees, regression trees and model trees are used for prediction. In
regression trees, each leaf stores a continuous-valued prediction. In model trees,
each leaf holds a regression model.
EEE] Un - Supervised Learning
+ The model is not provided with the correct results during the training. It can be
used to cluster the input data in classes on the basis of their statistical properties
only. Cluster significance and labeling.
* The labeling can be carried out even if the labels are only available for a small
number of objects representative of the desired classes. All similar inputs patterns
are grouped together as clusters.
«If matching pattern is not found, a new cluster is formed. There is no error
feedback.
+ External teacher is not used and is based upon only local information. It is also
referred to as self-organization.
* They are called unsupervised because they do not need a teacher or super-visor to
label a set of training examples. Only the original data is required to start the
analysis.
‘+ In contrast to supervised learning, unsupervised or self-organized leaming does
not require an external teacher. During the training session, the neural network
receives a number of different input patterns, discovers significant features in
these pattems and learns how to classify input data into appropriate categories.
© Unsupervised learning algorithms aim to leam rapidly and can be used in
real-time. Unsupervised learning is frequently employed for data clustering,
feature extraction etc.
* Another mode of learning called recording leaming by Zurada is typically
employed for associative memory networks. An associative memory networks is
designed by recording several idea patterns into the networks stable states.
Clustering
© Clustering of data is a method by which large sets of data are grouped into
clusters of smaller sets of similar data. Clustering can be considered the most
important unsupervised learning problem.
© A cluster is therefore a collection of objects which are “similar” between them and
are “dissimilar” to the objects belonging to other clusters. Fig. 1.3.8 shows cluster.
TECHNICAL PUBLICATIONS® - an up-trust for knowledgeMachine Leaming 1-16 Introduction to Machine
ae BB
ee ee &
Fig. 1.3.8 Cluster
«In this case we easily identify the 4 clusters into which the data can be divided;
the similarity criterion is distance : two or more objects belong to the same cluster
if they are “close” according to a given distance (in this case geometrical distance).
This is called distance-based clustering.
+ Clustering means grouping of data or dividing a large data set into smaller data
sets of some similarity.
+ A clustering algorithm attempts to find natural groups of components or data
based on some similarity. Also, the clustering algorithm finds the centroid of a
group of data sets.
+ To determine cluster membership, most algorithms evaluate the distance between
a point and the cluster centroids. The output from a clustering algorithm is
basically a statistical description of the cluster centroids with the number of
components in each cluster.
| meow fo ae
© Cluster centroid : The centroid of a cluster is a point whose parameter values are
the mean of the parameter values of all the points in the clusters. Each cluster has
a well defined centroid.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 1-47 Introduction to Machine Leaming
Distance : The distance between two points is taken as a common metric to as see
the similarity among the components of a population. The commonly used
distance measure is the Euclidean metric which defines the distance between two
points p = (py, p2,-..) and q = (q1-q2-...) is given by :
ds > (pi-ai)?
m1
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled
data. But how to decide what constitutes a good clustering 7? It can be shown that
there is no absolute "best" criterion which would be independent of the final aim
of the clustering. Consequently, it is the user which must supply this criterion, in
such a way that the result of the clustering will suit their needs.
Clustering analysis helps construct meaningful partitioning of a large set of objects.
Cluster analysis has been widely used in numerous applications, including pattern
recognition, data analysis, image processing, etc.
Clustering algorithms may be classified as listed below :
1. Exclusive clustering
2. Overlapping clustering
3. Hierarchical clustering
4. Probabilistic clustering
‘A good clustering method will produce high quality clusters with high intra-class
similarity and low inter-class similarity. The quality of a clustering result depends
‘on both the similarity measure used by the method and its implementation. The
quality of a clustering method is also measured by its ability to discover some or
all of the hidden pattems.
Examples of Clustering Applications
i
2
Marketing : Help marketers discover distinct groups in their customer bases and
then use this knowledge to develop targeted marketing programs.
Land use : Identification of areas of similar land use in an earth observation
database.
Insurance : Identifying groups of motor insurance policy holders with a high
average claim cost.
Urban planning : Identifying groups of houses according to their house type,
value, and geographical location.
Seismology : Observed earth quake epicenters should be clustered along continent
faults.
TECHNICAL PUBLICATIONS® - an up-thnist for knowledgeMachine Leeming 1-18 Introduction to Machine Loaming
EEE] Reinforcement Leaming
User will get immediate feedback in supervised leaming and no feedback from
unsupervised learning. But in the reinforced learning, you will get delayed scalar
feedback.
Reinforcement leaming is
learning what to do and how
Se Ca)
The leamer is not told which
actions to take. Fig, 139
shows concept of reinforced S™Z° .
learning.
Reinforced learning is deals %
car as et ifs [erouen )
and acts upon their
environment. Tt combines
classical Artificial Intelligence
and machine learning techniques.
It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance. Simple
reward feedback is required for the agent to leam its behavior; this is known as
the reinforcement signal.
Two most important distinguishing features of reinforcement learning is
trial-and-error and delayed reward.
With reinforcement leaming algorithms an agent can improve its performance by
using the feedback it gets from the environment. This environmental feedback is
called the reward signal.
Based on accumulated experience, the agent needs to learn which action to take in
a given situation in order to obtain a desired long term goal. Essentially actions
that lead to long term rewards need to reinforced. Reinforcement learning has
connections with control theory, Markov decision processes and game theory.
= Example of Reinforcement Learning : A mobile robot decides whether it
should enter a new room in search of more trash to collect or start trying to
find its way back to its battery recharging station. It makes its decision based
on how quickly and easily it has been able to find the recharger in the past.
Fig. 1.3.9 Reinforced leaming
TECHNICAL PUBLICATIONS® - an up-thrust for knowedge
clIntroauction to Machine Learning
Machine Leaming
EEERI Elements of Reinforcement Learning
Reinforcement learning elements are as follows t
1. Policy 2. Reward Function
3. Value Function 4. Model of the environment
Fig. 13.10 shows
Policy : Policy defines the
learning agent behavior for
given time period. It is
mapping from perceived
states of the environment to
actions to be taken when in
those states.
Reward Function : Reward
function is used to define a
goal in a reinforcement
learning problem. It also Fig, 4.3.19 : Elements of reinforcement learning
maps each perceived state of
the environment to a single
number.
Value function : Value functions specify what is good in the long run. The value
of a state is the total amount of reward an agent can expect to accumulate over
the future, starting from that state.
Model of the environment : Models are used for planning.
Credit assignment problem : Reinforcement learning algorithms learn to generate
‘an internal value for the intermediate states as to how good they are in leading to
the goal.
The learning decision maker is called the agent. The agent interacts with the
environment that includes everything outside the agent.
‘The agent has sensors to decide on its state in the environment and takes an
action that modifies its state.
The reinforcement leaming problem model is an agent continuously interacting
with an environment. The agent and the environment interact in a sequence of
time steps. At each time step t, the agent receives the state of the environment and
a scalar numerical reward for the previous action, and then the agent then selects
an action.
Reinforcement Learning is a technique for solving Markov Decision Problems.
Agent
TECHNICAL PUBLICATIONS® - an up-thrust forMechine Leaming
1-20
Introduction to Machine Leaming
+ Reinforcement learning uses a formal framework defining the interaction between
a learning agent and its environment in terms of states, actions, and rewards. This
framework is intended to be a simple way of representing essential features of
the artificial intelligence problem.
Learning
Difference between Supervised, Unsupervised and Reinforcement
Supervised learning
Bons ; :
that the target variable is well
defined and that a sufficient
number of its values are
| Supervised leaming deals with
two main tasks regression and
classification.
‘The input data in supervised
earning in labelled data.
Leams by using labelled data.
‘Maps the labeled inputs to the
known outputs.
|
|
Unsupervised learning
For unsupervised learning
typically either the target
variable is unknown oF has
only been recorded for too
small a number of cases.
Unsupervised Leaming deals
with dlustering and associative
rule mining problems.
Unsupervised learning uses
unlabelled data.
Trained using unlabelled data
without any guidance.
Understands pattems and
discovers the output.
Applications of Machine Learning
Examples of successful applications of machine learning :
1. Learning to recognize spoken words.
sees
Learning to drive an autonomous vehicle.
Learning to classify new astronomical structures.
Learning to play world-class backgammon.
Spoken language understanding: within the context of a limited domain,
Reinforcement learning
Reinforcement learning is
Jeaming what to do and how
to map situations to actions.
‘The learner is not told which
actions to take.
Reinforcement learning deals
with exploitation or
exploration, Markor's decision |
processes, policy learning,
deep learning and value
ea
‘The data is not predefined in
reinforcement
B
Works on interacting with the
environment.
Follows the trial and error
method.
|
|
determine the meaning of something uttered by a speaker to the extent that it
can be classified into one of a fixed set of categories.
Face Recognition
+ Face recognition task is effortlessly and every day we recognize our friends,
relative and family members. We also recognition by looking at the photographs.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
iMachine Learning 4-21 Introduetion to Machine Leeming
In photographs, they are in different pose, hair styles, background light, makeup
and without makeup.
* We do it subconsciously and cannot explain how we do it. Because we can't
explain how we do it, we can't write an algorithm
* Face has some structure. It is not a random collection of pixel. It is symmetric
structure. It contains predefined components like nose, mouth, eye, ears. Every
person face is a pattern composed of a particular combination of the features. By
analyzing sample face images of a person, a learning program captures the pattern
specific to that person and uses it to recognize if a new real face or new image
belongs to this specific person or not.
* Machine learning algorithm creates an optimized model of the concept being
learned based on data or past experience.
Healthcare :
* With the advent of wearable sensors and devices that use data to access health of
a patient in real time, ML is becoming a fast-growing trend in healthcare.
« Sensors in wearable provide real-time patient information, such as overall health
condition, heartbeat, blood pressure and other vital parameters.
* Doctors and medical experts can use this information to analyse the health
condition of an individual, draw a pattern from the patient history and predict the
occurrence of any ailments in the future.
«The technology also empowers medical experts to analyze data to identify trends
that facilitate better diagnoses and treatment.
Financial services :
* Companies in the financial sector are able to identify key insights in financial data
as well as prevent any occurrences of financial fraud, with the help of machine
learning technology.
‘* The technology is also used to identify opportunities for investments and trade.
‘© Usage of cyber surveillance helps in identifying those individuals or institutions
which are prone to financial risk and take necessary actions in time to prevent
fraud.
EEA Tools and Technology for Machine Learning
EEEI Python
‘© Python is a high-level scripting language which can be used for a wide variety of
text processing, system administration and internet-related tasks.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 1-22 Introduction to Machine Leaming
Python is a true object-oriented language and is available on a wide variety of
platforms.
Python was developed in the early 1990's by Guido van Rossum, then at CWI in
Amsterdam and currently at CNRI in Virginia. Python 3.0 was released in Year
2008.
Python statements do not need to end with a special character. Python relies on
modules, that is, self-contained programs which define a variety of functions and
data types.
‘A module is a file containing Python definitions and statements. The file name is
the module name with the suffix .py appended, Within a module, the module's
name (as a string) is available as the value of the global variable _name_.
If a module is executed directly however, the value of the global variable
_name__ will be "_main_".
Modules can contain executable statements aside from definitions. These are
executed only the first time the module name is encountered in an import
statement as well as if the file is executed as a script.
Integrated Development Environment (IDE) is the basic interpreter and editor
environment that you can use along with Python. This typically includes an editor
for creating and modifying programs, a translator for executing programs and a
program debugger. A debugger provides a means of taking control of the
execution of a program to aid in finding program errors.
Python is most commonly translated by use of an interpreter. It provides the very
useful ability to execute in interactive mode. The window that provides this
interaction is referred to as the Python shell.
Python support two basic modes : Normal mode and interactive mode.
Normal mode : The normal mode is the mode where the scripted and finished . py
files are run in the Python interpreter. This mode is also called as script mode.
Interactive mode is a command line shell which gives immediate feedback for each
statement, while running previously fed statements in active memory.
= Start the Python interactive interpreter by typing python with no arguments at
the command line.
= To access the Python shell, open the terminal of your operating system and
then type “python”. Press the enter key and the python shell will appear.
(C:AWindows\system32> python
Python 3.5.0(v.3.6.0:374{501f4567, Sep 13 2016, 2:27:37)[MSCv.1900 64 bit (AMD64)] on win32
Type “help’, copyright,‘credits" or ‘license* for more information.
>>>
TECHNICAL PUBLICATIONS® - an yp-hrust for knowedge
>> indicates that the Python shell is ready to execute and send your
commands to the Python intrepreter. The result is immediately displayed on the
Python shell as soon as the Python interpreter interpreters the command.
* For example, to print the text “Hello World”, we can type the following :
>>> print("Hello World’)
Hell World
>>>
«In script mode, a file must be created and saved before executing the code to get
results. In interactive mode, the result is returned immediately after pressing the
eneter key.
In script mode, you are provided with a direct way of editing your code. This is
not possible in interactive mode.
© A variable is a way of referring to a memory location used by a computer
program.
* A variable is a symbolic name for this physical location. This memory location
contains values, like numbers, text or more complicated types.
© A variable is a name that refers to a value. The equal (=) operator is used to
assign value to a variable,
© Python's data types include : Numbers, strings, lists, dictionaries, tuples and files.
© Python has no additional commands to declare a variable. As soon as the value is
assigned to it, the variable is declared.
© Rules for varibles are as follows :
a. Special characters are not allowed.
b. Variables are case sensitive.
¢. Variable can only contain aplha-numeric characters and underscores.
d. Variable name always start with character, not with number.
Features of Puython programming
. Python is a high-level, interpreted, interactive and object-oriented scripting
language.
2. It is simple and easy to learn.
3. It is portable.
4. Python is free and open source programming langauage.
5, Python can perform complex tasks using a few lines of code.
TECHNICAL PUBLICATIONS® - an up-ihrust for knowtedge1-24 Introduction to Machine Leaming
6. Python can run equally on different platforms such as Window, Linux, UNIX and
Macintosh etc.
7. It provides a vast range of libraries for the various fields such as machine learing,
web, developer and also for the scripting.
Advantages of Python
« Ease of programming.
* Minimizes the time to develop and maintain code.
‘* Modular and object-oriented.
Large community of users.
«A large standard and user-constributed library.
Disadvantages of Python
+ Interpreted and therefore slower than compiled languages.
+ Decentralized with pacakges.
R Programming Language
R is a free software environment for statistical computing and graphics. It
compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
© Ris often used for statistical computing and graphical presentation to analyse and
visualize data
* To use a function in a package, the package needs to be loaded in memory.
é Command for this is library( ), for example : library(affy).
R is case sensitive, so take care when typing in the commands. Multiple
commands can be written on the same line.
Command can have many arguments. These are always giving inside the brackets.
Numeric (1, 2, 3...) or logic (I/F) values and names of existing objects are given
for the azguments without quotes, but string values, such as file names, are always
put inside quotes.
+ For example : mas5(dat3, normalize = T, analysis = “absolute").
Vectors and matrices in R are two ways to work with a collection of objects.
Lists provide a third method. Unlike a vector or a matrix a list can hold different
kinds of objects. One entry in a list may be a number, while the next is a matrix,
while a third is a character string.
Statistical functions of R usually return the result in the form of lists. So we must
know how to unpack a list using the $ symbol.
TECHNICAL PUBLICATIONS® - an up-tnrist for mnowedgeMachine Leaming 1-25
Introduction to Machine Learning
MATLAB
MATLAB is a programming language developed by MathWorks. It started out as
a matrix programming language where linear algebra programming was simple. It
can be run both under interactive sessions and as a batch job.
MATLAB is a high-performance language for technical computing. It integrates
computation, visualization and programming environment.
MATLAB is an interactive system whose basic data element is an array that does
not require dimensioning.
The name MATLAB stands for matrix laboratory. MATLAB was originally written
to provide easy access to matrix software developed by the LINPACK and
EISPACK projects, which together represent the state-of-the-art in software for
matrix computation.
The MATLAB system consists of five main parts :
1. The MATLAB language. This is a high-level matrix/array language with control
flow statements, functions, data structures, input/output and object-oriented
programming features.
2. The MATLAB working environment. This is the set of tools and facilities that
you work with as the MATLAB user or programmer. It includes facilities for
managing the variables in your workspace and importing and exporting data.
3. It handle graphics. This is the MATLAB graphics system. It includes high-level
commands for two-dimensional and three-dimensional data visualization,
image processing, animation and presentation graphics.
4. The MATLAB mathematical function library. This is a vast collection of
computational algorithms ranging from elementary functions like sum, sine,
cosine and complex arithmetic, to more sophisticated functions like matrix
inverse, matrix eigenvalues, Bessel functions and fast Fourier transforms.
5. The MATLAB Application Program Interface (API). This is a library that allows
you to write C and Fortran programs that interact with MATLAB.
EEG Fill in the Blanks
Qi
az
Machine leaning is a sub-field of which concerns with developing
computational theories of learning and building learning machines.
______ learning in which the network is trained by providing it with input
and matching output patterns.
Both human as well as machine learning generate knowledge, one residing in
the _____ the other residing in the
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 1-26 Introduction to Machine Leaming
aa
as
as
a7
as
as
Q10
Qa
quiz
13
aia
ais
2.16
a7
Humans acquire through experience either directly or shared by others.
Supervised learning and unsupervised learning are the types of
Python is a true _______ language and is available on a wide variety of
platforms.
MATLAB is a programming language developed by _.
Vectors and matrices in R are two ways to work with a collection of
Machine learning algorithms discover the relationships between the variables of
a system from direct ____ of the system.
Human learning is based on the past
A learning algorithm analyses the training data and produces an
inferred function, which is called a classifier or a regression function.
Supervised learning deals with two main tasks
data.
and
Unsupervised learning uses
CART stands for_
can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
learning is deals with agents that must sense and act upon their
environment. It combines classical artificial intelligence and machine learning
techniques.
With reinforcement learning algorithms an agent can improve its performance by
using the feedback it gets from the environment. This environmental feedback is
called the
‘Supervised learning is also called learning.
Unsupervised learning is also called _____ learning.
When we are trying to predict a categorical or nominal variable, the problem is
known as a problem.
When we are trying to predict a real-valued variable, the’ problem falls under
the category of
EEA Muttipic Choice Questions
ai
‘A computer program is said to learn from with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
training {b! experience
[el testing {d) algorithm
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
osMachine Learning 1-27 Introduction to Machine Learning
[a2 Jarvis Patrick Clustering algorithm is a clustering technique.
[a) grid based [b) graph based
[c) density based [al all of these
@3 Which of the following is hierarchical clustering method :
[a) Agglomerative [b) Divisive clustering
[e PAM [a AandB
4 The k-means algorithm is sensitive to ______ because an object with an
extremely large value may substantially distort the distribution of data.
[a) outliers {b) text data
[ej boasting [a] duster
Q5 _____ hierarchical clustering method works by grouping data objects into a
tree of clusters.
fal PAM [b] Density-based method
[e) Hierarchical [a] Grid-Based method
6 In DIANA, all of the objects are used to form initial luster.
[al one
[e] four
Q7 If the clustering process is terminated when the distance between nearest
clusters exceeds an arbitrary threshold, it is called a .
[al dendrogram
nearest-neighbor clustering algorithm
minimal spanning tree algorithm
[@ single-linkage algorithun
Q8 Which of the following is NOT type of clusters ?
[a] Well-separated clusters [b) Prototype-based clusters
fe] Contiguity-based clusters [d) DBSCAN clusters
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming
as
aio
qa
quz
aia
quis
aie
1-28 Introduction to Machine Learning
Shared nearest neighbors is a clustering.
[a density-based |b) well-separated
fe [d) graph based
Unsupervised learning deals with ____ and ___ mining problems.
classification, regression [b) clustering, classification
clustering, associative rule [d_ label, unlabelled data
learning deals with two main tasks regression and classification.
& Reinforcement [b) Deep
© Un supervised d_ Supervised
The individual tuples making up the training set are referred to as
are selected from the database under analysis.
© contiguity based
learning tuples {b) training tuples
samples [a] database
Machine leaming is inherently a multi disciplinary field.
Inter disciplinary [bo] Multi disciplinary
Single [dl None
methods have been used to train computer-controlled vehicles to steer
correctly when driving on a variety of road types.
‘a. Machine learning {bl Data mining
Neural networks [a) Robotics
The individual tuples making up the training set are reffered to as _- and
are selected from the database under analysis.
{b) training tupes
[a learing tupes
[a database
c) sampels
Training perceptron is based on __.
supervised learning technique |b) unsupervised learning
[e|_ reinforced learning [a stochastic learning
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
a —————————jie
mt
Machine Learning
1-29 Introduction to Machine Learning
|| Policy
Q.17 List the elements of reinforcement learning.
[b] Reward function
{e] Value function {d) All of these
Answor Keys for Fill In the Blanks
Q4 | knowledge
‘MathWorks
7 |
Q.10 | experiences
Q.1 | artifical intelligence —Q.2.—Supervised Q3 brain, machine
Q.5 machine leaming 6 | object-oriented
Q.8 objects | Q.9 | samples
Qa1 | supervised Q22 | Regression Classification
13 | unlabelled
aploman
Qa6 | Reinforcement
Q49| descriptive
Q.14 Classification and Q.15 | Concept leaming
Regression 9
Q.17 reward signal =| Q.18 | predictive
Q.20 classification «| Q.21 | regression
Answer Keys for Multiple Choice Questions
aa
Qas ’
aQaQ
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgePreparing to Model
‘Syllabus
Machine Learning activities, Types of data in Machine Learning, Structures of data, Data quality
and remediation, Data Pre-Processing : Dimensionality reduction, Feature subset selection.
Contents
24
2.2
23
24
2.5
26
27
Machine Leaming Activities
Types of Data in Machine Leaming
Structures of Data
Data Quality and Remediation
Data Pre-Processing
Fill in the Blanks
Multiple Choice Questions
a9Mache Lasming 2-2 Preparing to Mode
EZ] Machine Learning Activities
Following are the typical preparation activities for model :
a) Understand the types of input data
) Find protentional issue in the data
¢) Identify the nature and quality of data
d) Find out the relationship between data
e) Apply pre-processing
Input data is divided into two parts : Training data and testing data
Machine leaming is about learning some properties of a data set and applying
them to new data. This is why a common practice in machine leaning to evaluate
an algorithm is to split the data at hand in two sets, one that we call a training set
‘on which we leam data properties and one that we call a testing set, on which we
test these properties.
In training data, data is assigning the labels. In test data, data labels are unknown
but not given. The training data consist of a set of training examples.
The real aim of supervised learning is to do well on test data that is not known
during learning. Choosing the values for the parameters that minimize the loss
function on the training data is not necessarily the best policy.
The training error is the mean error over the training sample. The test error is the
expected prediction error over an independent test sample,
Problem is that training error is not a good estimator for test error. Training error
can be reduced by making the hypothesis more sensitive to training data, but this
may lead to over fitting and poor generalization.
Training set : A set of examples used for leaming, where the target value is
known.
Test set : It is used only to assess the performances of a classifier. It is never used
during the training process so that the error on the test set provides an unbiased
estimate of the generalization error.
Training data is the knowledge about the data source which we use to construct
the classifier.
Fig. 2.1.1 shows four step process of machine learning.
I:2-3 Preparing to Mode!
© Understand the types of input data
i * Find protentional issue in the data
Identify the nature and quality of data
t + Find out the relationship between data
+ Apply pre-processing, 4
+ Data partitioning |
| ‘+ Model selection |
| __ = Cross-validation |
\ Step 3 Performance + Examine model performance |
{ __ evaluation © Visualize performance
seep 4 Performance «Tuning model
| improvement + Ensembling
EEE] Types of Data in Machine Learning
* Data set is collection of related records or information. The information may be on
some entity or some subject area.
* Collection of data objects and their attributes. Attributes captures the basic
characteristics of an object.
© Each row of a data set is called a record. Each data set also has multiple
attributes, each of which gives information on a specific characteristic.
* Following is an example of data set.
wee ee,
{
|
ale
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leeming 2-4 Preparing to Mode)
* For example, in the data set on Emp, there are four attributes namely Emp-ID,
Name, Department and Age, each of which understandably is a specific
characteristic about the employee entity.
# Attributes can also be termed as feature, variable, dimension or field. A row or
record represents a point in the four-dimensional data space as each row has
specific values for each of the four attributes or features.
EAI qualitative and Quantitative Data
© Data can broadly be divided into following two types :
1, Qualitative data
2 Quantitative data
Quaitatve/Categorical Quantiative/Numeric
t | J |
Nominal Ordinal Intervel Ratio
Fig. 224
Quatttative data :
© Qualitative data provides information about the quality of an object or
information which cannot be measured. Qualitative data cannot be expressed as a
number. Data fist represent nominal scales such as gender, economic status,
religious preference are usually considered to be qualitative data.
© Qualitative data is data concerned with descriptions, which can be observed but
b camnot be computed. Qualitative data is also called categorical data. Qualitative
deta con be farther subdivided into two types as follows :
1 Nominal data
Mos 2 Ordinal daz
Nominal data
© A nomena! data is the 1* level of measurement scale in which the numbers serve
28 “tags” ce “labels” to classify or identify the objects.
TECHBOCAL PUBLICATIONS” - an opens for inowiedge
isMachhne Leeming 2-6 Propering to Model
¢ A nominal data usually deals with the non-numeric variables or the numbers that
do not have any value, While developing statistical models, nominal data are
usually transformed before building the model.
«It is also known as categorical variables
Charactoristics of nominal data :
1, A nominal data variable is classified into two or more categories. In this
measurement mechanism, the answer should fall into either of the classes.
2. It is qualitative. The numbers are used here to identify the objects.
3, The numbers don't define the object characteristics. The only permissible aspect
of numbers in the nominal scale is “counting”
© Example :
1. Gender : Male, Female, Other.
2. Hair color : Brown, Black, Blonde, Red, Other.
Ordinal data
* Ordinal data is a variable in which the value of the data is captured from an
ordered set, which is recorded in the order of magnitude.
* Ordinal represents the “order.” Ordinal data is known as qualitative data or
categorical data. It can be grouped, named and also ranked.
© Characteristics of the ordinal data :
a) The ordinal data shows the relative ranking of the variables.
b) It identifies and describes the magnitude of a variable.
¢) Along with the information provided by the nominal scale, ordinal scales give
the rankings of those variables.
4) The interval properties are not known.
e) The surveyors can quickly analyze the degree of agreement conceming the
identified order of variables.
© Examples:
a) University ranking : 1*, 9, 87°...
'b) Socioeconomic status : Poor, middle class, rich.
©) Level of agreement : Yes, maybe, no.
d) Time of day : Dawn, moming, noon, afternoon, evening, night.
Quantitative data
© Quantitative data is the one that focuses on numbers and mathematical
calculations and can be calculated and computed.
TECHNICAL PUBLICATIONS® - an uptinust for imowtedgeMachne Leeming 2-6 Preparing to Mode!
© Quantitative data are anything that can be expressed as a number, or quantified,
Examples of quantitative data are scores on achievement tests, number of hours of
study, or weight of a subject. These data may be represented by ordinal, interval
oF ratio scales and lend themselves to most statistical manipulation.
« There are two types of quantitative data : Interval data and Ratio data
Interval data :
«Interval data corresponds to a variable in which the value is chosen from an
interval set.
* It is defined as a quantitative measurement scale in which the difference between
the two variables is meaningful. In other words, the variables are measured in an
exact manner, not as in a relative way in which the presence of zero is arbitrary.
© Characteristics of interval data :
a) The interval data is quantitative as it can quantify the difference between the
values.
b) It allows calculating the mean and median of the variables
©) To understand the difference between the variables, you can subtract the
values between the variables
d) The interval scale is the preferred scale in statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calendar types, etc.
© Examples :
1. Celsius temperature.
2. Fahrenheit temperature.
3. Time on a clock with hands.
Ratio data :
Any variable for which the ratios can be computed and are meaningful is called
ratio data.
It is a type of variable measurement scale. It allows researchers to compare the
differences or intervals. The ratio scale has a unique feature. It possesses the
character of the origin or zero points.
© Characteristics of ratio data :
a) Ratio scale has a feature of absolute zero.
b) It doesn’t have negative numbers, because of its zero - point feature.
©) It affords unique opportunities for statistical analysis. The variables can be
orderly added, subtracted, multiplied, divided. Mean, median, and mode can
be calculated using the ratio scale.
TECHNICAL PUBLICATION? init for trowicon‘Machine Learning
7 Preparing to Model
) Ratio data has unique and useful properties. One such feature is that it allows
unit conversions like kilogram - calories, gram - calories, etc.
+ Examples : Age, Weight, Height, Ruler measurements, Number of children
Ez Difference between Qualitative and Quantitative Data
Qualitative data
(Qualitative data pravides information about the Quantitative data relates to information about
cannot be measured
j
| quality of an object or information which
i Types : Nominal data and Ordinal data
Narratives offen make use of adjectives and
other descriptive words to refer to data on
appearance, color, texture, and other qualities
Quantitative data
the quantity of an object; hence it can be
measured —
‘Types : Interval data and Ratio data
Measure’s quantities such as length, size,
amount, price, and even duration.
‘They are descriptive rather than numerical in Expressed in numerical form.
nature
| For ecample: For example :
| «The team is well prepared.
= The leaf feels waxy.
+The team has 7 players.
+The leaf weighs 2 ounces.
«The river is 25 miles long.
= The river is peaceful.
Ea Structures of Data
+ A data dictionary is a centralized repository of metadata. Metadata is data about
data.
© A data dictionary is a repository of names, definitions, and attributes that provides
contextual information about data. A data dictionary traditionally refers to a
database dictionary, metadata repository or business glossary. It primarily focuses
on the meaning or definition of all columns in a data table.
* In case the data dictionary is not available, we need to use standard library
function of the machine learning tool that we are using and get the details.
EERE Exploring Numerical Data
© There are two most effective mathematical plots to explore numerical data ; Box
plot and histogram
1) Understanding central tendency :
© Central tendency is a descriptive summary of a dataset through a single value that
reflects the center of the data distribution
TECHNICAL PUBLICATIONS® - an up-ttrust for knowledge|
}
|
|
|
|
Machine Leaming 2-8 Preparing to Modey
© To understand the nature of numeric variables, we can apply the measures of
central tendency of data, ie. mean and median
Let x1,x2,%3/..Xq be the set ‘n! values of the variate, then arithmetic mean or mean
is given as,
mL XL FXQ XG tet Xe
= eee
Median :
Let the values of the variable are arranged in the ascending order of magnitude. Then
median is the middle item, if number of values are odd and median will be mean of
two middle terms if the number of values in even.
* Median is the mid-value that divide total frequency in two equal parts.
© Example : Below is the data set of pizza price is given cities. Find Mean and
Median of both the cities .
a
A B co)
Solution :
Mean of New Delhi pizza price = 142+3+3+4+54647+9411+66 _ 19656
nN
Mean of New Lucknow pizza price = 1+2+3+4+5+6+7+8+9410 _ 55
10
peMachine Leaming 2-9 Preparing to Model
Median of New Delhi pizza price = N*4obs = “Mobs = 6! obs
Here 6" obs = 5
Median =
Median of Lucknow pizza price = (N/2) + ((N+)/2) = (5/2) + (6/2) =55
Median = 5.5
© The mean has one main disadvantage : It is particularly susceptible to the
influence of outliers. These are values that are unusual compared to the rest of the
data set by being especially small or large in numerical value. For example,
consider the wages of staff at a factory below :
caf 1 (2 [3 4 5 |6 17 8 |o ww |
+ |
/
Salary 15K 18K | 1K 14K 15K | 15K | 2K 7K | OK | 95K |
© The mean salary for these ten staff is $30.7 K. However, inspecting the raw data
suggests that this mean value might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries in the $12 K to 18 K range.
© The mean is being skewed by the two large salaries. Therefore, in this situation,
we would like to have a better measure of central tendency. As we will find out
later, taking the median would be a better measure of central tendency in this
situation..
2) Understanding data spread :
Definition : It is the scatteredness or spread of data about an average value.
It gives an idea about how individual values difffer from the central value, i.c.
whether they are closely packed around central value or widely scattered away
from it.
Variance Range ‘Standard Coefficients of
deviation Variation
Fig. 2.3.1 Measures of dispersion
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 2-10 Preparing tp
+ The magnitude of the variation is called dispersion.
* Fig. 23.1 shows measures of dispersion.
Variance :
The second central moment is called varidation. It is given as,
02 = Vath) = eo-myy? = Jem? hxtde
wad = BX? ]-mj
o a
© Variance can also be given as,
ee BEA? Here N = 4
Let ‘A’ be assumed mean, hi’ be the
Then mem s= A+ BEM
‘Standard deviation =
i the measure of spread over the values of "X' relative to mean value. It is given
* Standard deviation of a data is measured as follows :
Standard deviation (x) = Variance (@
© Larger value of variance or standard deviation indicates more dispersion in the
data and vice versa.
GERBER) Consiter the data values of two attributes.
Abtrabate 1 values : 4, £6, 48, 45, $7 Calculate variance
Solution =2
Machine Leaming 2-94
_ 442 4467 +48? 4.452 4472
3
_ 193642116+ 2304 + 2025+ 2209 (
5
= ao? o25
Preparing to Modal
H+ 464 48445447)
5
Difference benyess: standard aayieton and arenes
Standard deviation
Standard deviation is a measure of dispersion
| of the values of a data set from their mean.
Itis a common term in statistical theory to
calculate central tendency
At measures the absolute variability of the
dispersion
| It is calculated by taking the square root of the
variance.
‘The standard deviation is symbolized by the
Greek letter sigma “0” as in lower case sigma
=f S&- Mein
where M = Mean, x = A values in a data set,
and n = Number of values
‘Used in finance sector as a measure of market
Variance
It is the statistical measure of how far the
numbers are spread in a data set from their
average.
Variance is primarily used for statistical i
probability distribution to measure volatility
from the mean
It helps determine the size of the data spread.
It is calculated by taking the average of the
squared deviation of each value in the data set
from the mean
The notation for the variance of a variable is |
sigma squared |
= S(x-M} in
where M = Mean, x = Each value in the data
set, n = Number of values in the data set
Used in asset allocation
Seep wee ay
[EEE] Plotting and Exploring Numerical Data
1. Box plots
* The box plot is a useful graphical display for describing the behaviour of the data
in the middle as well as at the ends of the distributions. The box plot uses the
median and the lower and upper quartiles. If the lower quartile is Q, and the
upper quartile is Q,, then the difference (Q; - Q,) is called the interquartile range
or 10.Machine Leaming de Preparing to Mode!
* Box plot is also called whisker plot. It shows data using the middle value of the
| data and the quartiles, or 25 % divisions of the data.
* Box plot shows the fivenumber summary of a set of data : Minimum, lower
quartile, median, upper quartile and maximum.
Lower quartile Upper quartile
a, Median Q, am
Wrsker ‘Whisker
interquartile range
Fig. 23.2
FREY Construct a box plot for the following data :
12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Solution :
Step 1: Arrange the data in ascending order,
Step 2: Find the median, lower, upper quartile
5 7, 1% 16 15, 22,25, 80,96, A,B
{ t {
Lower quartile Median Upper quartile
Median (middle value) = 22
Lower quartile (middle value of the lower half) = 12
Upper quartile (middle value of the upper half) = 36
Step 3: Draw a number line that will include the smallest and the largest data.
5 1 1% 2 2 30 35 40
580
Step 4: Draw three vertical lines at the lower quartile (12), median (22) and the
upper quartile (36), just above the number line.
TECHNICAL PUBLICATIONS”. an up-thrust for knowledgeMachine Leaming 2-13 Proparing to Modal
Step 5 Join the lines for the lower quartile and the upper quartile to form a box.
CT
tt ttt
5 1 18 20 2 30 35 40 45 50
Step 6 + Draw a line from the smallest value (5) to the left side of the box and draw
a line from the right side of the box to the biggest value (53).
ee
RS eee eee Se
5 10 15 2 2 30 3 40 45 £60
Histogram :
« Ina histogram, the data are grouped into ranges (eg. 10 - 19, 20 - 29) and then
plotted as connected bars. Each bar represents a range of data.
© The width of each bar is proportional to the width of each category, and the
height is proportional to the frequency or percentage of that category.
«Fig. 2.3.3 shows distributions of a Histogram.
Fig. 2.3.3 (a) Normal distribution Fig. 2.3.3 (b) Blmodal distribution
|
|
{
Fig. 2.3.3 (c) Right-skewed distribution | Fig. 2.3.3 (d) Left-skewed distribution |
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 204 Preparing to Mode!
(o) Random distribution
A normal distribution : In a normal distribution, points on one side of the
average are as likely to occur as on the other side of the average.
A bimodal distribution : In a bimodal distribution, there are two peaks. In a
bimodal distribution, the data should be separated and analyzed as separate
normal distributions.
A right-skewed distribution : A right-skewed distribution is also called a
positively skewed distribution. In a right-skewed distribution, a large number
of data values occur on the left side with a fewer number of data values on
the right side. A right-skewed distribution usually occurs when the data has a
range boundary on the left-hand side of the histogram. For example, a
boundary of 0.
A left-skewed distribution : A left-skewed distribution is also called a
negatively skewed distribution. In a left-skewed distribution, a large number of
data values occur on the right side with a fewer number of data values on the
left side. A right-skewed distribution usually occurs when the data has a range
boundary on the right-hand side of the histogram. For example, a boundary
such as 100.
5. A random distribution : A random distribution lacks an apparent pattern and
has several peaks. In a random distribution histogram, it can be the case that
different data properties were combined. Therefore, the data should be
separated and analyzed separately.
*Machine Leaming 2-15 Preparing to Model
EEE Exploring Relationship between Variables
Scatter plot :
i @ It displays collection of all the points for the set of data limited only for two
| values. It also called scatter plot, X-Y graph.
While working with statistical data it is often observed that there are connections
between sets of data. For example, the mass and height of persons are related, the
taller the person the greater his/her mass.
® To find out whether or not two sets of data are connected scatter diagrams can be
used. Fig. 2.3.4 shows scatter diagram.
160
150
140
130
Height
120
110
t 100
Fig, 2.3.4 Scatter dlagram
© Scatter diagram shows the relationship between children's age and height. A
scatter diagram is a tool for analyzing relationship between two variables. One
variable is plotted on the horizontal axis and the other is plotted on the vertical
axis.
© The patter of their intersecting points can graphically show relationship patterns.
Commonly a scatter diagram is used to prove or disprove cause-and-effect
relationships.
* While scatter diagram shows relationships, it does not by itself prove that one
variable causes other. In addition to showing possible cause and effect
relationships, a scatter diagram can show that two variables are from a common
cause that is unknown or that one variable can be used as a surrogate for the
other.Mechine Leaming 2-16 Preparing to Mode}
Two - way cross - tabulations
«© Two - way cross - tabulations is also called cross - tab or contingency table. It is
used to understand the relationship of two categorical attributes in a concise way
Data Quality and Remediation
* Data remediation is the process of cleansing, organizing and migrating data so
that it's properly protected and best serves its intended purpose
Data Quality
« A data which has the right quality helps to achieve better prediction accuracy, in
case of supervised learning. Data quality problems are
1. Certain data elements without a value or data with a missing value.
2. Data elements having value surprisingly different from the other elements,
which we term as outliers
There are multiple factors which lead to these data quality issues.
a) Incorrect sample set selection
b) Errors in data collection
Measuring data quality levels can help organizations identify data errors that need
to be resolved and assess whether the data in their IT systems is fit to serve its
intended purpose.
.
Data Remediation
© Outliers are data elements with an abnormally high value which may impact
prediction accuracy, especially in regression models.
® An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
© First quartile (Q;) : The first quartile is the value, where 25 % of the values are
smaller than Qy and 75 % are larger.
© Third quartile (Q;) : The third quartile is the value, where 75 % of the values are
smaller than Q3 and 25 % are larger.
Outliers are data elements with an abnormally high value which may impact
prediction accuracy, especially in regression models
Outlier detection is the process of detecting and subsequently excluding outliers
from a given set of data.
Fig. 24.1 shows outliers detection. Here O, and O, seem outliers from the rest.
TECHNICAL puaucaTions® ‘an up-thrust for knowledge‘Machine Leeming 2-17 Preparing to Model
Fig. 2.4.1 Outilers detection
* An outlier may be defined as a piece of data or observation that deviates
drastically from the given norm or average of the data set. An outlier may be
caused simply by chance, but it may also indicate measurement error or that the
given data set has a heavy - tailed distribution
Handling missing values
© Ina data set, one or more data elements may have missing values in multiple
records.
«These dirty data will affects on miming procedure and Jed to unreliable and poor
output. Therefore it is important for some data cleaning routines.
How to handle noisy data in data mining 7
* Following methods are used for handling noisy data :
1. Ignore the tuple : Usually done when the class label is missing. This method
is not good unless the tuple contains several attributes with missing values.
2. Fill in the missing value manually : It is time-consuming and not suitable for
a large data set with many missing values.
3. Use a global constant to fill in the missing value : Replace all missing
attribute values by the same constant.
4. Use the attribute mean to fill in the missing value : For example, suppose that
the average salary of staff is Rs 65000/- . Use this value to replace the missing
value for salary.
5. Use the attribute mean for all samples belonging to the same class as the given
tuple
6. Use the most probable value to fill in the missing value
Data Pre-Processing
© Data pre-processing is a data mining technique that involves transforming raw
data into an understandable format. Aim to reduce the data size, find the relation
=Machine Leeming 2-18 Preparing to Mode)
between data and normalized them. Data pre-processing is a proven method of
resolving such issues. Data preprocessing prepares raw data for further
Processing.
Data which capture from various source is not pure. It contains some noise. It is
called dirty data or incomplete data. In this data, there is lacking attribute values,
lacking certain attributes of interest, or containing only aggregate data. For
example : occupation="
Noisy data which contains errors ot outliers, For example : Salary="-10"
* Inconsistent data which contains discrepancies in codes or names. For example :
Age = "51" Birthday="03/08/1998"
* Incomplete, noisy, and inconsistent data are commonplace properties of large
real-world databases and data warehouses. Incomplete data can occur for a
number of reasons
EER] pimensionality Reduction
© Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables.
© Most machine learning and data mining techniques may not be effective for
high-dimensional data. Query accuracy and efficiency degrade rapidly as the
dimension increases.
The “dimensionality” simply refers to the number of features (ie. input variables)
in your dataset.
* When the number of features is very large relative to the number of observations
in your dataset, certain algorithms struggle to train effective models. This is called
the "Curse of Dimensionality,” and it's especially relevant for clustering algorithms
that rely on distance calculations.
© Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables. It can be
divided into feature selection and feature extraction.
It reduces the time and storage space required. Removal of multi-collinearity
improves the interpretation of the parameters of the machine leaming model.
« There are many methods to perform dimension reduction.
1. Missing values : While exploring data, if we encounter missing values, what
we do ? Our first step should be to identify the reason then impute missing
values / drop variables using appropriate methods. But, what if we have too
many missing values ? Should we impute missing values or drop the
variables ?Machine Leaming 2-19 Preparing to Model
2. Low variance : Let's think of a scenario where we have a constant variable in
our data set.
3. Desicion trees : It can be used as a ultimate solution tackle multiple challenges
like missing values, outliers and identifying significant variavbles.
4, Random forest : Similar to decision tree is random forest.
5. High coreelation : Dimensions exhitbiting higher correlation can lower down
the performance of model. Moreover, it is not good to have multipule variables
of similar information or variation also known as "Multicollinearity”.
‘Advantagos of dimensionality reduction
* It helps in data compression, and hence reduced storage space.
« It reduces computation time.
It also helps remove redundant features, if any.
Disadvantages of dimensionality reduction
« It may lead to some amount of data loss.
PCA tends to find linear correlations between variables, which is sometimes
undesirable.
* PCA fails in cases where mean and covariance are not to define datasets.
* We may not know how many principal components to keep in practice, some
thumb rules are applied.
Principal Component Analysis
* If the original data can be reconstructed from the compressed data without any
loss of information, the data reduction is called lossless. If, instead, we can
reconstruct only an approximation of the original data, then the data reduction is
called lossy.
* Lossy dimensionality reduction methods are Principal Components Analysis (PCA)
and wavelet transforms.
* Principal Component Analysis (PCA) is to reduce the dimensionality of a data set
by finding a new set of variables, smaller than the original set of variables, retains
most of the sample's information and useful for the compression and classification
of data.
* In PCA, it is assumed that the information is carried in the variance of the
features, that is, the higher the variation in a feature, the more information that
feature carries.
‘+ Hence, PCA employs a linear transformation that is based on preserving the most
variance in the data using the least number of dimensions.
TECHNICAL PUBLICATIONS® - an up-thrust for inowtedgeMachine Leeming 2-20 Preparing to Mode)
¢ The most common approach for dimensionality reduction is known as Principal
Component Analysis (PCA). PCA is a statistical technique to convert a set of
correlated variables into a set of transformed, uncorrelated variables called
principal components. The principal components are a linear combination of the
original variables.
« A Discrete Wavelet Transform (DWT) is a transform that decomposes a given
signal into a number of sets, where each set is a time series of coefficients
describing the time evolution of the signal in the corresponding frequency band.
+ Another commonly used technique which is used for dimensionality reduction is
Singular Value Decomposition (SVD).
Feature Subset Selection
* A good feature representation is central to achieving high performance in any
machine learning task.
© Consider an example of text categorization. Assume that we need to train a model
for classifying a given document as spam and not spam. If we represent a
document as a bag of words, the feature space consists of a vocabulary of all
unique words present in all the documents in the training set.
* For a collection of 100,000 to 1,000,000 documents, we can easily expect hundreds
of thousands of features. If we further extend this document model to include all
possible bigrams and trigrams, we could easily get over a million features.
© A feature tree is a tree such that each internal node is labelled with a feature, and
each edge emanating from an internal node is labelled with a literal. The set of
f literals at a node is called a split.
+ Each leaf of the tree represents a logical expression, which is the conjunction of
Eterals encountered on the path from the root of the tree to the leaf. The extension
of that conjunction is called the instance space segment associated with the leaf.
‘Two features are redundant if they are highly correlated, regardless of whether
they are correlated with the task or not.
| Feature construction and transformation
| © Feature construction involves transforming a given set of input features to
| generate a new set of more powerful features which can then use for prediction.
* Feature construction methods may be applied to pursue two distinct goals :
Reducing data dimensionality and improving prediction performance.
© Steps:
1. Start with an initial feature space Fy
} TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 2-21 Preparing to Mode!
2, Transform Fo to construct a new feature space Fjy
3. Select a subset of features F; from Fy
4. If some terminating criteria is achieved : Go back to step 3 otherwise set
Fr =F
5. Fr is the newly constructed feature space
The initial feature space Fy consists of manually constructed features that often
encode some basic domain knowledge.
The task of constructing appropriate features is often highly application specific
and labour intensive. Thus, building auto-mated feature construction methods that
require minimal user effort is challenging. In particular we want methods that :
1. Generate a set of features that help improve prediction accuracy.
2. Are computationally efficient.
3. Are generalizable to different classifiers.
4. Allow for easy addition of domain knowledge.
Genetic programming is an evolutionary algorithm - based technique that starts
with a population of individuals, evaluates them based on some fitness function
and constructs a new population by applying a set of mutation and crossover
operators on high scoring individuals and eliminating the low scoring ones.
In the feature construction paradigm, genetic programming is used to derive a
new feature set from the original one.
Individuals are often tree like representations of features, the fitness function is
usually based on the prediction performance of the classifier trained on these
features while the operators can be applications specific.
The method essentially performs a search in the new feature space and helps
generate a high performing subset of features. The newly generated features may
often be more comprehensible and intuitive than the original feature set, which
makes GP-related methods well-suited for such tasks.
Ih decision trees, the model explicitly selects features that are highly correlated
with the label. In particular, by limiting the depth of the decision tree, one can at
least hope that the model will be able to throw away irrelevant features.
In the case of K-nearest neighbours, the situation is perhaps more terrible. Since
KNN weighs each feature just as much as another feature, the introduction of
irrelevant features can completely mess up KNN prediction.
Feature extraction is a process that extracts a set of new features from the original
features through some functional mapping.
TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeMacnne Lose zz Preparing to Meas
© Trensfommetion stadies ways of mapping original attributes to new feature,
Didferent mappongs can be exployed to extract features.
© ih gene the mappings cn be categorized into linear or nonlinesr
transformations. One could categorize trensformations along two dimensions linear
and labeled, near and non labeled, nonlinear and labeled, nonlinear and non
labeled.
Feature selection
¢ Feanme selection is 2 process thet chooses a subset of features from the original
feanmes so that the feature space is optimally reduced according to a certain
cciterion
© Festre selection is a critical step in the feature construction process. In text
categorization problems, some words simply do not appear very often.
Pechaps the word “groovy” appears in exactly one training document, which is
Positive Is it really worth keeping this word around as a feature ? It's a
dangerous endeavour because it's hard to tell with just one training example if it
is really correlated with the positive class, or is it just noise.
‘You could hope that your learning algorithm is smart enough to figure it out. Or
you could just remove it.
There are three general classes of feature selection algorithms : Filter methods,
wrapper methods and embedded methods.
«The role of feature selection is as follows:
1. To reduce the dimensionality of feature space.
2 To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4 To improve the comprehensibility of the learning results.
¢ Features selection algorithms are as follows :
1. Instance based approaches : There is no explicit procedure for feature subset
generation. Many small data samples are sampled from the data. Features are
weighted according to their roles in differentiating instances of different classes
for a data sample. Features with higher weights can be selected.
2. Nondeterministic approaches : Genetic algorithms and simulated annealing are
also used in feature selection.
3. Exhaustive complete approaches : Branch and bound evaluates estimated
accuracy and ABB checks an inconsistency measure that is monotonic. Both
start with a full feature set until the pre-set bound cannot be maintained.
TECHNICAL PUBLICATIONS® - an ups! for knowedigeMachine Leeming 2-23 Preperng to Mode!
EA Fill in the Bianks
= set is collection of related records or information.
iQ2_ Each row of a data set is called a
Qualitative data is also called data.
laa _____ data provides information about the quality of an object or information
which cannot be measured.
1.5 Dimensionality reduction helps in reducing irrelevance and in features.
126 Dimensionality reduction refers to the techniques of reducing the dimensionality of
a data set by creating new attributes by combining the original
1.7 Lossy dimensionality reduction methods are and wavelet transforms.
lag. An__ is an observation that lies an abnormal distance from other values in
a random sample from a population.
la9 Exploration of numerical data can be best done using and
a.10 Data can be broadly divided into___ data and data.
Multiple Choice Questions
(ai Data can be broadly divided into -
‘a. qualitative data ‘D quantitative data
[qualitative and Quantitative data d ratio data
22 Feature selection tries to eliminate features which are
[a tich iB
[el irrelevant
redundant
@ relevant
3 Principal component analysis is used for
[a dimensionality Enhancement
LU decomposition
QR decomposition
dimensionality reduction
=
{e
(a.
f
1.4 Which of the following methods to perform dimension reduction ?
[a Missing values [b) Decision tree
[e Random forest id) All of these
TECHNICAL PUBLICATIONS® - an up-tnrust for knowiedgeMachine Leaming 2-04 Preparing to Mo)
Answer Keys for Fill In the Blanks
Qi Data
|
Fs
i
|
|
Qoo0
TECHNICAL PUBLICATIONS® - an yphrut for knowledge