KEMBAR78
Lecture 2 Final | PDF | Machine Learning | Accuracy And Precision
0% found this document useful (0 votes)
14 views90 pages

Lecture 2 Final

The document discusses machine learning concepts including classification, K-nearest neighbors algorithm, and data collection and preparation. It provides examples and explanations of classification tasks, prediction error, performance measures, and the stages of machine learning.

Uploaded by

somsonengda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views90 pages

Lecture 2 Final

The document discusses machine learning concepts including classification, K-nearest neighbors algorithm, and data collection and preparation. It provides examples and explanations of classification tasks, prediction error, performance measures, and the stages of machine learning.

Uploaded by

somsonengda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

1

Wollo University ,Kombolicha Institute of Technology

Department of Software Engineering

Fundamental of Machine Learning

By Ashenafi Workie(MSc.)
KIOT@SE by Ashenafi Workie
Major chapters outlines

1 Chapter 1: Introduction to Machine Learning


2 Chapter 2: Classification based Supervised Learning
3 Chapter 3: Regression based Supervised Learning
4 Chapter 4: Unsupervised Learning
5 Chapter 5: Reinforcement Learning
6 Chapter 6: Advanced Machine Learning

3
Classification

▪ Classification is a task that require the use of machine learning


algorithms to learn how to assign a class label to a given data..

4
Classification

▪ Forexample if we are given the task of classify at a given bunch


of fruits and vegetables on the basis of category i.e. Fruits are
groped together and vegetables are to be grouped together.

5
Classification

▪ Forexample if we are given the task of classify a given bunch of


fruits and vegetables on the basis of category i.e. Fruits are
groped together and vegetables are to be grouped together.

6
Real world examples

7
Real world examples

8
Prediction Error(overfitting)

Overfitting: Good performance on the training data, poor


generalization to other data. a model with high variance
and little bias will overfit the target.

9
Prediction Error(underfitting)
Underfitting: Poor performance on the training data and
poor generalization to other data. A model that exhibits
small variance and high bias will underfit the target,

10
Prediction Error(underfitting)
Underfitting: Poor performance on the training data and
poor generalization to other data. A model that exhibits
small variance and high bias will underfit the target,

11
Measure the performance of model

12
Measure the performance of model

13
Confusion matrix

14
Confusion matrix

15
Confusion matrix
▪ A true positive is an outcome where the model correctly predicts the
positive class. Similarly,
▪ a true negative is an outcome where the model correctly predicts the
negative class.
▪ A false positive is an outcome where the model incorrectly predicts the
positive class.
▪ a false negative is an outcome where the model incorrectly predicts the
negative class.

16
Confusion matrix

False negative

17
1. Accuracy

Accuracy is one metric for evaluating classification models. Informally,


accuracy is the fraction of predictions our model got right. Formally, accuracy
has the following definition:

For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:

18
Accuracy
Let's try calculating accuracy for the following model that classified 100 tumors as malignant
(the positive class) or benign (the negative class):

19
2. Precision
Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?
Precision is defined as follows:

20
3. Recall
Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
Mathematically, recall is defined as follows:

21
3. Recall

22
Stages of ML
❖ The •first key step in preparing to explore and exploit AI(ML) is
to understand the basic stages involved.

23
Stages of ML

▪ Machine Learning Tasks and Subtasks:

24
ML challenges

▪ It requires considerable data and compute power.


▪ It requires knowledgeable data science specialists or teams.
▪ It adds complexity to the organization's data integration strategy. (data-
driven culture)

▪ Learning AI(ML) algorithms is challenging without an advanced math


background.
▪ The context of data often changes. (private data Vs public data)
▪ Algorithmic bias, privacy and ethical concerns may be overlooked.

25
Data collection and preparations

26
Data collection and preparations
❖ Training set and Testing sets represent 80, 20% or 70,30% of the data.
(cross validation)
❖ The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.

70 /30
80/20
85/15
90/10
27
Data collection and preparations
▪ The most successful AI projects are those that integrate a data collection
strategy during the service/product life-cycle.
▪ It must be built into the core product itself.
▪ Basically, every time a user engages with the product/service, you want
to collect data from the interaction.
▪ The goal is to use this constant new data flow to improve your
product/service.

28
Data collection and preparations
❖ Solving the right problem:
❖ Understand the purpose for a model.
❖ Ask about who, what, when, where and why?
❖ Is the problem viable for machine learning (AI)?

29
Classification with K Nearest Neighbors(KNN)

30
K Nearest Neighbors(KNN)

▪ It is an easy to grasp (understand and implement) and very


effective (powerful tool).
▪ The model for kNN is the entire training dataset.

▪ Pros: High accuracy, insensitive to outliers, no assumptions about


data.
▪ Cons: computationally expensive, requires a lot of memory.
▪ Works with: Numeric values, nominal values. (Classification and
regression)

31
K Nearest Neighbors(KNN)

▪ We have an existing set of example data (training set).


▪ We know what class each piece of the data should fall into.

▪ When we’re given a new piece of data without a label.


▪ We compare that new piece of data to the existing data, every
piece of existing data.
▪ We then take the most similar pieces of data (the nearest
neighbors) and look at their labels.

32
K Nearest Neighbors(KNN)

▪ Classifying movies into romance or action movies.


▪ The number of kisses and kicks in each movie (features)

▪ Now, you find a movie you haven’t seen yet and want to know if
it’s a romance movie or an action movie.
▪ To determine this, we’ll use the kNN algorithm.

33
K Nearest Neighbors(KNN)

▪ We find the movie in question and see how many


kicks and kisses it has.

▪ Classifying movies by plotting the # kicks and kisses in each movie

34
K Nearest Neighbors(KNN)
Example

Movies with the # of kicks, # of kisses along with their class 35


K Nearest Neighbors(KNN)
Example
▪ We don’t know what type of movie the question mark movie is.
▪ First, we calculate the distance to all the other movies.

▪ Distance b/n each movie and the unknown movie


36
K Nearest Neighbors(KNN)
Example

37
K Nearest Neighbors(KNN)

▪ Advantage:
▪ It remembers
▪ Fast (no learning time)
▪ Simple and straight forward
▪ Down side :
▪ No generalization
▪ Over-fitting (noise)
▪ Computationally expensive for large datasets

38
K Nearest Neighbors(KNN)
▪ Given:
▪ Training data D = (xi, yi)
▪ Distance metric d(q, x): domain knowledge important
▪ Number of neighbors K: domain knowledge important
▪ Query point q

▪ KNN = {i : d(q, xi) k smallest }


▪ Return:
▪ Classification: Vote of the yi.
▪ Regression: mean of the yi.

39
K Nearest Neighbors(KNN)

❖ The similarity measure is dependent on the type of the data:


❖ Real-valued data: Euclidean distance
❖ Hamming distance: categorical or binary data (P-norm; when p=0)

X1, X2 y
❖d(): k Average
1, 6 7
2, 4 8
❖Euclidian:1-NN _______
3, 7 16 ❖ 3-NN _______
6, 8 44
7, 1 50 ❖Manhattan 1-NN _______
8, 4 68
❖ 3-NN _______
Q = 4, 2, y = ??

40
K Nearest Neighbors(KNN)

❖ d(): k Average
Regression ❖ Euclidian: 1-NN ___8___
X1, X2 y ED
❖ 3-NN ___42__
1, 6 7 25
2, 4 8 8 ❖ Manhattan 1-NN _______
3, 7 16 26 ❖ 3-NN _______
6, 8 44 40
7, 1 50 10
8, 4 68 20 Euclidian = (X1i – q1)2 +(X2i – q2)2
Q = 4, 2, y = ???

41
K Nearest Neighbors(KNN)

▪ d(): k Average
Regression ▪ Euclidian: 1-NN _______
X1, X2 y mD ▪ 3-NN _______
1, 6 7 7
2, 4 8 4 ▪ Manhattan 1-NN ___29__
3, 7 16 6
▪ 3-NN __35.5__
6, 8 44 8
7, 1 50 4
8, 4 68 6 ▪ Manhattan = (|X1i – q1|) + (|X2i - q1|)
Q = 4, 2, y =
???
42
K Nearest Neighbors(KNN) Bais
▪ Preference Bias?
▪ Our believe about what makes a good hypothesis.
▪ Locality: near points are similar (distance function / domain)
▪ Smoothness: averaging
▪ All features matter equally
▪ Best practices for Data preparation
▪ Rescale data: normalizing the data to the range [0, 1] is a good idea.
▪ Address missing data: excluded or imputed the missing values.
▪ Lower dimensionality: KNN is suitable for lower dimensional data

43
K Nearest Neighbors(KNN)
▪ What is needed to select a KNN model?
▪ How to measure closeness of neighbors.
▪ Correct value for K.

▪ d(x, q) = Euclidian, Manhattan, weighted etc…


▪ The choice of the distance function matters.
▪ K value
▪ K = n (the average of all data / no need of query)
▪ K = n (weighted average) [Locally weighted regression]

44
Classification with Decision Tree(DT)

45
Decision Tree(DT)
▪ Decision trees: splitting datasets one features at a time.
▪ The decision tree is one of the most commonly used classification
technique.
▪ It has a decision blocks (rectangles).
▪ Termination block (ovals).
▪ The right and left arrows are called branches.

▪ The kNN algorithm can do a grate job of classification, but it didn’t lead to
any major insight about the data.

46
Decision Tree(DT)

47
Decision Tree(DT)

48
Decision Tree(DT)

▪ The best part of the DT (decision tree) algorithm is that humans can
easily understand the data:
▪ The DT algorithm:
▪ Takes a set of data. (training examples)
▪ Build a decision tree (model), and draw it.

▪ It can also be re-represented as sets of if-then rules to improve human


readability.
▪ The DT does a grate job of distilling data into knowledge.
▪ Takes a set of unfamiliar data and extract a set of rules.
▪ DT is often used in expert system development.
49
Decision Tree(DT)
▪ The pros and cons of DT:
▪ Pros:
▪ Computationally cheap to use,
▪ Easy for humans to understand the learned results,
▪ Missing values OK (robust to errors),
▪ Can deal with irrelevant features.

▪ Cons:
▪ Prone to overfitting.
▪ Work with: Numeric values, nominal values.
50
Decision Tree(DT)
▪ Appropriate problems for DT learning:
▪ Instance are represented by attribute-value pairs (fixed set of
attributes and their values),
▪ The target function has discrete output values,
▪ Disjunctive descriptions may be required,
▪ The training data may contain errors,
▪ The training data may contain missing attribute values.

51
Decision Tree(DT)
▪ The mathematics that is used by DT to split the dataset is called
information theory:
▪ The first decision, you need to make is:
▪ Which feature shall be used to split the data.
▪ You need to try every feature and measure which split will give the best
result.
▪ Then split the dataset into subsets.
▪ The subsets will then traverse down the branches of the decision node.
▪ If the data on the branch is the same, stop; else repeat the splitting.

52
Decision Tree(DT)

Figure 2: Pseudo-code for the splitting function

53
Decision Tree(DT)

▪ We would like to classify the following animals into two classes:


▪ Fish and not Fish

Table 1: Marine animal data

54
Decision Tree(DT)
▪ Need to decide whether we should split the data based on the first feature
or the second feature:
▪ To make more organize the unorganized data.
▪ One way to do this is to measure the information.
▪ Measure the information before and after the split.

▪ Information theory is a branch of science that is concerned with


quantifying information.
▪ The change in information before and after the split is known as the
information gain.

55
Decision Tree(DT)
▪ The split with the highest information gain is the best option.
▪ The measure of information of a set is known as the Shannon
entropy or entropy.
▪ One way to do this is to measure the information.

▪ The change in information before and after the split is known as the
information gain.

56
Decision Tree(DT)

▪ To calculate entropy, you need the expected value of all the


information of all possible values of our class.
▪ This is given by:

▪ Where n is the number of classes:

57
Decision Tree(DT)
▪ The higher the entropy, the more mixed up the data.
▪ Another common measure of disorder in a set is the Gini impurity.
▪ Which is the probability of choosing an item from the set and the
probability of that data item being misclassified.

▪ Calculate the shannon entropy of a dataset.


▪ Dataset splitting on a given feature.
▪ Choosing the best feature to split on.

58
Decision Tree(DT)

▪ Recursively building the tree.


▪ Start with dataset and split it based on the best attribute to split.
▪ The data will traverse down the branches of the tree to another node.
▪ This node will then split the data again (recursively)
▪ Stop under the following conditions: run out of attributes or if the
instances in a branch are the same class.

59
60
Decision Tree(DT)

61
Decision Tree(DT)

Table 2: Example training sets

62
Decision Tree(DT)

Figure 3: Data path while splitting

63
Decision Tree(DT)
▪ ID3 uses the information gain measure to select among the
candidate attributes.
▪ Start with dataset and split it based on the best attribute to split.
▪ Given a collection S, containing positive and negative examples
of some target.

▪ The entropy of S relative to this Boolean classification is:


▪ Entropy(S) = -p1Xlog2p1+ - p2Xlog2p2

64
Decision Tree(DT)
▪ Example:
▪ The target attribute is PlayTennis. (yes/no)

Table 3: Example training sets 65


Decision Tree(DT)
▪ Example:
▪ The target attribute is PlayTennis. (yes/no)

Table 3: Example training sets 66


Decision Tree(DT)

Table 3: Example training sets 67


Decision Tree(DT)

Table 3: Example training sets 68


Decision Tree(DT)

Table 3: Example training sets 69


Decision Tree(DT)

Table 3: Example training sets 70


Decision Tree(DT)

Table 3: Example training sets 71


Decision Tree(DT)

72
Decision Tree(DT)

73
Decision Tree(DT)

74
Decision Tree(DT)

❖The DT can be expressed using the following expression:


❖ (Outlook = Sunny ˄ Humidity = Normal) → Yes
❖ ˅ (Outlook = Overcast) → Yes
❖˅ (Outlook = Rain ˄ Wind = Weak) → Yes

75
Decision Tree(DT)

❖Suppose S is a collection of 14 examples of some Boolean


concept, including 9 positive and 5 negative examples.
❖Then the entropy of S relative to Boolean classification is:

76
Decision Tree(DT)

▪ Note that the entropy is 0 if all members of S belong to the same class.
▪ For example: if all the members are positive (p+ = 1), then (p- = 0).
▪ Entropy (s) = -1.log2(1) – 0.log2(0) = 0

▪ Note the entropy is one (1) when the collection contain an equal number
of positive and negative examples.
▪ If the collection contain unequal number of positive and negative the
entropy is b/n 0 and 1.

77
Decision Tree(DT)
▪ Suppose S is a collection of training-example days described by
attributes Wind. (weak, strong)
▪ The information gain is the measure used by ID3 to select the best
attribute at each step in growing the tree.

78
Decision Tree(DT)
▪ Information gain of the two attributes: Humidity and Wind.

79
Decision Tree(DT)
▪ ID3 will determines the information gain for each
attribute. (Outlook, Temperature, Humidity and Wind)
▪ Then select the one with the highest information gain.
▪ The information gain values for all four attributes are:
▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.048
▪ Gain(S, Temperature) = 0.029

▪ Outlook provides grater information gain than the


other.

80
Decision Tree(DT)
▪ According to the information gain measure, the Outlook attribute selected as the
root node.
▪ Branches are created below the root for each of its possible values. (Sunny,
Overcast, and Rain)

❖The partially learned


decision tree resulting
from the first step of ID3

81
Decision Tree(DT)

82
Decision Tree(DT)
❖The overcast descendant has only positive examples and therefore becomes a leaf node
with classification Yes:
❖The other two nodes will be expand by selecting the attribute with the highest
information gain relative to the new subsets.
❖Decision Tree learning can be:
❖Classification tree: finite set values target variables
❖Regression tree: continuous values target variable

❖ There are many specific DT algorithms:


❖ID3 (Iterative of ID3)
❖C4.5 (Successor of ID3)
❖CART(classification and Regression Tree)
❖CHAID (Chi – squared Automatic Interaction Detector)
❖MARS: extends DT to handle numerical data better 83
Decision Tree(DT)
❖Different DT algorithms use different metrics for measuring the “best attribute” :
❖Information gain: used by ID3, C4.5 and C5.0
❖Gini impurity: used by CART
❖ID3 in terms of its search space and search strategy:
❖ID3’s hypothesis space of all decision tree is a complete space of finite discrete-valued
functions.
❖ID3’s maintains only a single current hypothesis as it searches through the space of
decision trees.

❖ID3 in its pure form perform no backtracking in its search. (post-pruning the decision
tree)
❖ID3 uses all training examples at each step in the search to make statistically based
decisions regarding how to refine its current hypothesis. (much less sensitive to error)

84
Decision Tree(DT)
❖Inductive bias in DT learning (ID3):
❖Inductive bias are the set of assumption.
❖ID3 selects in favor of shorter trees over longer ones. (breadth first approach)
❖Selects trees that place the attributes with the highest information gain closest to the root.
❖Issues in DT learning:
❖How deeply to grow the decision tree
❖Handling continuous attributes
❖Choosing an appropriate attributes selection measure
❖Handling training data with missing attribute values
❖Handling attributes with differing costs and
❖Improve computational efficiency
❖ID3 extended to address most of these issues to C4.5.

85
Decision Tree(DT)
❖Avoiding over fitting the Data:
❖Noisy data and too small training examples are problems.
❖Over fitting is a practical problem for DT and many of the learning algorithms.
❖Over fitting was found to decrease the accuracy of the learned tree by 10-25%.

❖Approach to avoid over fitting:


❖ Stop growing the tree, before it over fitting. (direct but less practical)
❖Allow the tree to over fitting, and then post-prune (the most successful in practice)

❖Incorporating continuous-value attributes:


❖Initial definition of ID3, attributes and target value must be discrete set of values.

❖The attributes tested in the decision nodes of the tree must be discrete value :
❖ Create a new Boolean attribute for the continuous value or
❖ Multiple interval rather than just two interval.
86
Decision Tree(DT)

❖Alternative measure for selecting attributes:


❖Information gain favor attributes with many values.
❖One alternative measure that has been used successfully is the
gain ratio.

87
Assignment II

❖Answer the given questions by considering the following set of


training examples.

11/29/2022
88
88
Assignment II

❖ (a) What is the entropy of this collection of training examples with respect to the target function classification?
❖ (b) What is the information gain of a2 relative to these training examples?

11/29/2022
89
89
End ….

90

You might also like