Classification: Basic Concepts
Classification is a form of data analysis that extracts models describing important data classes.
Sach models, called classifiers, predict categorical (discrete, unordered) clas labels. For
‘example, we can build a classification model to categorize bank loan applications as either
safe or risky, Such analysis can help provide us with a better understanding ofthe data at
large. Many classification methods have been proposed by researchers in machine learn-
ing, pattern recognition, and statistics. Most algorithms are memory resident, typically
assuming a small data size. Recent data mining research has built on such work, develop-
ing scalable classification and prediction techniques capable of handling large amounts of
disk-resident data, Classification has numerous applications, including fraud detection,
target marketing, performance prediction, manufacturing, and medical diagnosis.
We start off by introducing the main ideas of classification in Section 8.1.
the
rest of this chapter, you will learn the basic techniques for data classification such as
how to build decision tree classifiers (Section 8.2), Bayesian classifiers (Section 8.3), and
rule-based classifiers (Section 8.4). Section 8.5 discusses how (0 evaluate and compare
different classifiers. Various measures of accuracy are given as well as techniques for
obtaining reliable
sented in Section 8.6, including cases for when the data set is class imbalanced (ie.
where the main class of interes is rare).
.ceuracy estimates, Methods for increasing classifier accuracy are pre-
Basic Concepts
We introduce the concept of classification in Section 8,1. Section 8.1.2 describes the
‘general approach to classification as a two-step process, In the first step, we build a las-
sification model based on previous data, In the second step, we determine ifthe model's
accuracy is acceptable, and if so, we use the model to classify new data
8.1.1 What Is Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe”
and which are “risky” for the bank. A marketing manager at AllElectranics needs data
aang mer 327328
Chapter 8 Classification: Basic Concepts
8.1.2
analysis to help guess whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data to predict which one of three
specific treatments a patient should receive. In each of these examples, the data analysis
task is classification, where a model or elassifier is constructed to predict clas (categor-
ical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the
marketing data; or “treatment A? “treatment By” or “treatment C” for the medical data.
‘These categories can be represented by discrete values, where the ordering among values
has no meaning. For example, the values 1,2, and 3 may be used to represent treatments
A,B, and C, where there is no ordering implied among this group of treatment regimes,
Suppose that the marketing manager wants to predict how much a given customer
will spend during sale at Allélectronics. This data analysis ask isan example of numeric
prediction, where the model constructed predicts a continuous-valued function, oF
ordered value, as opposed to a class label. This model is a predictor. Regression analysis
is a statistical methodology that is most often used for numeric prediction; hence the
two terms tend to be used synonymously, although other methods for numeric predic
tion exist. Classification and aumeric prediction are the two major ty
problems. This chapter focuses on classification.
General Approach to Classification
“How does clasifcation work?” Data classification isa two-step process, consisting of a
learning step (were aclasiiation model is constructed) and & classification step (where
the model is used to predict class labels for given data). The proces is shown for the
Joan application data of Figure 8.1. (The data aze simplified for lustrative purposes
In realty, we may expect many more atributes to be considered,
In the first step, a classifier i built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated clas labels. A tuple, X, i represented by an eimensional
attribute vector, X= (x1, x2... %-)» depicting n measurements made on the tuple
from n database attributes, respectively, Ay, A2y---, Ay! Each tuple, X, is assumed to
‘belong toa predefined class as determined by another database attribute called the elass
label attribute, The cass label atribute is discrete-valued and unordered. It is categor-
‘eal (or nominal) in that each value serves asa category or class. The individual tuples
raking up the training set are refered to as training tuples and are randomly sam-
pled from the database under analysis. Inthe context of classification, data tuples can be
referred toas samples examples, instances, datapoints, or objects?
ach attribute represents “feature” of X. Hence, the pattern recognition iterature uses the term fr
ture vector rather than atribute vector. In our discussion, we use the term attribute vecor and in xr
notation, any variable representing a vector is shown in bold italic font measurements depicting the
vector are shown in italic font (eo X-= (4,5)
un the machine Jerning literature, training tuples are commonly refesed to as training samples
“Throughout this tet, we prefer to use the term tuples instead of samples.8.1 Basic Concepts 329
__-» [Casiticationalgorit
fname ae Income loan decision
Sandy Jones youh ow ekg
Billce youth low —_sisky
[Caroline Fox iiddle_aged high safe
Rick Field middle-aged low risky
Susan Lake senior low safe
(Clare Phips senior medium safe
Joe Smith middle aged high safe
IB age = youth THEN lnan decision = risky
IR income high THEN lnan-decxion ~ ee
IF age ~ middle aged AND income ~ lowe
@
(conten is
Testa
frame age —_tacome Toa decison ‘tn eoy, ites oo)
Dan Belo venir tow ae bean dso
Sylvia Crest middle aged low sky
‘Anne Vee middle-aged high safe
riaky
©
The data classification process: (a) Learning: Training data are analyzed by a classification
algorithm, Here, the class label attributes loan-decsion, and the learned model or classifier is
represented in the form of classification rules, (b) Clasfcation: Test data are used to estimate
the accuracy ofthe clasification rules, Ifthe accuracy is considered acceptable, the rules can
Figure 8.1
be applied to the classification of new data tuples,330
Chapter 8 Classification: Basic Concepts
Because the class label of each training tuple is provided, this step is also known as
supervised learning (ic. the learning of the classifier is “supervised” in that itis told
to which class each training tuple belongs). It contrasts with unsupervised learning (or
You might also like Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur PDF
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages