KEMBAR78
Machine Learning-Classification | PDF
0% found this document useful (0 votes)
21 views52 pages

Machine Learning-Classification

Classification is a data analysis technique that builds models, known as classifiers, to predict categorical labels based on historical data. The process involves two main steps: constructing a classification model using training data and then evaluating its accuracy with test data before applying it to new data. Various classification methods, such as decision trees and Bayesian classifiers, are discussed, along with their applications in fields like fraud detection and medical diagnosis.

Uploaded by

22b81a05y0.2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
21 views52 pages

Machine Learning-Classification

Classification is a data analysis technique that builds models, known as classifiers, to predict categorical labels based on historical data. The process involves two main steps: constructing a classification model using training data and then evaluating its accuracy with test data before applying it to new data. Various classification methods, such as decision trees and Bayesian classifiers, are discussed, along with their applications in fields like fraud detection and medical diagnosis.

Uploaded by

22b81a05y0.2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 52
Classification: Basic Concepts Classification is a form of data analysis that extracts models describing important data classes. Sach models, called classifiers, predict categorical (discrete, unordered) clas labels. For ‘example, we can build a classification model to categorize bank loan applications as either safe or risky, Such analysis can help provide us with a better understanding ofthe data at large. Many classification methods have been proposed by researchers in machine learn- ing, pattern recognition, and statistics. Most algorithms are memory resident, typically assuming a small data size. Recent data mining research has built on such work, develop- ing scalable classification and prediction techniques capable of handling large amounts of disk-resident data, Classification has numerous applications, including fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis. We start off by introducing the main ideas of classification in Section 8.1. the rest of this chapter, you will learn the basic techniques for data classification such as how to build decision tree classifiers (Section 8.2), Bayesian classifiers (Section 8.3), and rule-based classifiers (Section 8.4). Section 8.5 discusses how (0 evaluate and compare different classifiers. Various measures of accuracy are given as well as techniques for obtaining reliable sented in Section 8.6, including cases for when the data set is class imbalanced (ie. where the main class of interes is rare). .ceuracy estimates, Methods for increasing classifier accuracy are pre- Basic Concepts We introduce the concept of classification in Section 8,1. Section 8.1.2 describes the ‘general approach to classification as a two-step process, In the first step, we build a las- sification model based on previous data, In the second step, we determine ifthe model's accuracy is acceptable, and if so, we use the model to classify new data 8.1.1 What Is Classification? A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and which are “risky” for the bank. A marketing manager at AllElectranics needs data aang mer 327 328 Chapter 8 Classification: Basic Concepts 8.1.2 analysis to help guess whether a customer with a given profile will buy a new computer. A medical researcher wants to analyze breast cancer data to predict which one of three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where a model or elassifier is constructed to predict clas (categor- ical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A? “treatment By” or “treatment C” for the medical data. ‘These categories can be represented by discrete values, where the ordering among values has no meaning. For example, the values 1,2, and 3 may be used to represent treatments A,B, and C, where there is no ordering implied among this group of treatment regimes, Suppose that the marketing manager wants to predict how much a given customer will spend during sale at Allélectronics. This data analysis ask isan example of numeric prediction, where the model constructed predicts a continuous-valued function, oF ordered value, as opposed to a class label. This model is a predictor. Regression analysis is a statistical methodology that is most often used for numeric prediction; hence the two terms tend to be used synonymously, although other methods for numeric predic tion exist. Classification and aumeric prediction are the two major ty problems. This chapter focuses on classification. General Approach to Classification “How does clasifcation work?” Data classification isa two-step process, consisting of a learning step (were aclasiiation model is constructed) and & classification step (where the model is used to predict class labels for given data). The proces is shown for the Joan application data of Figure 8.1. (The data aze simplified for lustrative purposes In realty, we may expect many more atributes to be considered, In the first step, a classifier i built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated clas labels. A tuple, X, i represented by an eimensional attribute vector, X= (x1, x2... %-)» depicting n measurements made on the tuple from n database attributes, respectively, Ay, A2y---, Ay! Each tuple, X, is assumed to ‘belong toa predefined class as determined by another database attribute called the elass label attribute, The cass label atribute is discrete-valued and unordered. It is categor- ‘eal (or nominal) in that each value serves asa category or class. The individual tuples raking up the training set are refered to as training tuples and are randomly sam- pled from the database under analysis. Inthe context of classification, data tuples can be referred toas samples examples, instances, datapoints, or objects? ach attribute represents “feature” of X. Hence, the pattern recognition iterature uses the term fr ture vector rather than atribute vector. In our discussion, we use the term attribute vecor and in xr notation, any variable representing a vector is shown in bold italic font measurements depicting the vector are shown in italic font (eo X-= (4,5) un the machine Jerning literature, training tuples are commonly refesed to as training samples “Throughout this tet, we prefer to use the term tuples instead of samples. 8.1 Basic Concepts 329 __-» [Casiticationalgorit fname ae Income loan decision Sandy Jones youh ow ekg Billce youth low —_sisky [Caroline Fox iiddle_aged high safe Rick Field middle-aged low risky Susan Lake senior low safe (Clare Phips senior medium safe Joe Smith middle aged high safe IB age = youth THEN lnan decision = risky IR income high THEN lnan-decxion ~ ee IF age ~ middle aged AND income ~ lowe @ (conten is Testa frame age —_tacome Toa decison ‘tn eoy, ites oo) Dan Belo venir tow ae bean dso Sylvia Crest middle aged low sky ‘Anne Vee middle-aged high safe riaky © The data classification process: (a) Learning: Training data are analyzed by a classification algorithm, Here, the class label attributes loan-decsion, and the learned model or classifier is represented in the form of classification rules, (b) Clasfcation: Test data are used to estimate the accuracy ofthe clasification rules, Ifthe accuracy is considered acceptable, the rules can Figure 8.1 be applied to the classification of new data tuples, 330 Chapter 8 Classification: Basic Concepts Because the class label of each training tuple is provided, this step is also known as supervised learning (ic. the learning of the classifier is “supervised” in that itis told to which class each training tuple belongs). It contrasts with unsupervised learning (or

You might also like