CHAPTER I : INTRODUCTION TO DATA MINING
This chapter provides an introduction to the field of data mining. We will explain its purpose,
applications and basic concepts.
1.1 WHAT IS DATA-MINING ?
Data mining is a discipline at the crossroads of statistics and information technology
(databases, artificial intelligence, learning, etc.) whose aim is to discover knowledge in a vast
amount of data.
There are several definitions of data mining. In this book, we will limit ourselves to these two
definitions.
Definition 1 (Fayyad et al. [1] ) :
"Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data”.
In french : Le datamining est "l’extraction d’informations intéressantes (non triviales,
implicites, préalablement inconnues et potentiellement utiles) à partir de grandes bases de
données".
Definition 2 (David J. Hand [2] ) :
"Data Mining is the discovery of interesting, unexpected, or valuable structures in large data
sets”.
There's a metaphor that sums up the purpose of data mining: there are treasures or nuggets
hidden under mountains of data that can be discovered using specialised tools. ([3]).
The emergence of data mining has been supported by :
- The evolution of Data bases towards business intelligence with data warehouses.
- The appearance of increasingly large databases in several areas: credit card transactions,
telephone calls, supermarket invoices: where terabytes of data are collected automatically.
- Greater availability of data thanks to networks (intranet and internet).
- Development of data mining models and software.
1.2 GOALS OF DATA MINING
Data mining looks for two types of structure in data: models and patterns [1].
Pattern: is a characteristic structure possessed by a small number of observations: a group of
high-value customers, or on the contrary, high-risk customers.
Methods for extracting patterns can include: classification, Principal Component Analysis
(PCA), association rules, clustering, etc.
Models: a model is an overall summary of the relationships between variables, enabling
phenomena to be understood and forecasts to be made.
                                                10
1.3 AREAS OF APPLICATION
Data mining is a field of interest for three types of people:
- Scientists: to understand certain phenomena.
- Analysts: to produce reports for decision-makers.
- Decision-makers: to support decision-making.
Data mining applications are emerging in many different areas such as: e-commerce, health
care, finance, telecommunications, … etc.
- E-commerce: sales companies use data mining to improve customer relationship
management. The aim is not just to find out "how many customers bought this product during
this period", but to find out "what their profile is", "what other products will interest them"
and "when will they be interested again".
- Health care: In this field, data mining can be used to analyse risk factors and combinations
of symptoms in order to detect diseases at an early stage and develop treatment strategies.
- Finance: Financial institutions (banks) use data mining to assess the creditworthiness and
default risk of borrowers and to support credit decisions.
- Telecommunications: Analysis of customer behavioural data in order to identify possible
factors of customer dissatisfaction and take targeted measures to build customer loyalty.
- Fraud detection: To tackle the problem of fraud, organisations (customs, tax departments,
...., etc.) are using data mining to exploit massive volumes of data and identify these risks. By
implementing a fraud detection system, suspicious behaviour can be quickly identified and
proactive action taken.
1.5 DATA MINING TASKS:
Data mining is a set of techniques dedicated to different tasks generally grouped into two
main categories: descriptive and predictive tasks [2]. Tasks in the first category aim to to
describe phenomena or trends in the data, while those in the second of the second class are
concerned with estimating the future values of variables by taking into account other
historical values.
The tasks related to data mining are [5]:
1) Classification
2) Regression
3) Prediction
4) Analysis of association rules
5) Cluster analysis
6) Description.
Each task is explained below, with an example.
                                                 11
1.5.1 Classification:
Classification is the most common task in data mining, and one that seems to be a human
obligation. In order to understand our daily lives, we are constantly being classified,
categorised and evaluated [5].
Classification involves studying the characteristics of a new object in order to assign it a
predefined class. The objects to be classified are generally records in a dataset, and
classification involves updating each record by determining a class field. The classification
task is characterised by a precise class definition and a set of previously classified examples.
The aim is to create a model that can be applied to unclassified data in order to classify it.
Some examples of the use of classification tasks in the fields of research and commerce are as
follows [5]:
    Determining whether the use of a credit card is fraudulent.
    Diagnosing whether a certain disease is present. [ 5]
    Determining which telephone numbers correspond to fax machines.
    Determining which telephone lines are used for Internet access.
1.5.2 Regression:
Regression (or estimation) is similar to classification except that the output variable is
numeric rather than categorical. Depending on the other fields in the record, estimation
involves filling in a missing value in a particular field. For example, we want to estimate the
systolic blood pressure reading of a patient in a hospital, based on the patient's age, gender,
body mass index and blood sodium level. The relationship between the systolic pressure and
the other data will provide an estimation model. This model can then be applied to other cases
[5].
Some examples of the use of estimation tasks in research and commerce are as follows [5]:
- Estimate the number of children in a family.
- Estimating the amount of money that a family of four members chosen. Randomly will
spend on going back to school.
- Estimating the value of a piece of real estate.
1.5.3 Prediction:
Prediction is the same as classification and estimation, except that in prediction records are
classified according to predicted (estimated) criteria (or values). The main reason that
prediction differs from classification and estimation is that in the creation of the predictive
model the temporal relationship between the input variables and the output variables is taken
into account. [5]
Some examples of the use of prediction tasks in research and commerce are as follows:
- Predicting the price of shares on the stock market over the next three months.
- Predicting the World Cup football champion based on a comparison of team statistics.
                                                 12
1.5.4 Association Rules Analysis:
Association rules analysis consists of determining which attributes "go together". The most
common task in the business world, where it is called affinity analysis or market basket
analysis, is association research to measure the relationship between two or more attributes.
Association rules are of the form "If antecedent, then consequent."
Some examples of the use of association rule analysis tasks in the fields of research and
commerce are as follows:
-Find out which products in a supermarket are bought together and which are never bought
together.
-Determine the proportion of cases in which a new drug can generate dangerous effects [5].
1.5.5 Cluster analysis:
Cluster analysis or clustering is the grouping of records or observations into classes of similar
objects. A cluster is a collection of records that are similar to each other and different from
those in other clusters.
The difference between clustering and classification is that in clustering there are no outgoing
variables. The clustering task does not classify, estimate or predict the value of an output
variable. Instead, clustering algorithms aim to segment the entire data set into relatively
homogeneous sub-groups. They maximise homogeneity within each group and minimise it
between them [5].
Clustering algorithms can be applied in a variety of areas, such as :
- Discovering groups of customers with similar behaviours.
- Classifying plants and animals given their characteristics.
- Segmenting observations of epicentres to identify dangerous areas [5].
1.5.6 Description :
Sometimes the aim of datamining is simply to describe what is happening in a complicated
dataset by explaining the relationships that exist in the data, in order to gain the best possible
understanding of the individuals, products and processes present in the dataset. A good
description of behaviour often implies a good explanation of it. For example, in American
society the simple description, "women support the Democratic Party more than men", can
provoke a great deal of interest and promote studies by journalists, sociologists, economists
and political specialists [5].
1.6 KNOWLEDGE DISCOVERY AND DATA MINING:
KDD (Knowledge Discovery from Data) is often presented as an integrated set of tools which
involves knowledge extraction. The process can be summarised as follows (fig 1.1) :
                        Fig 1.1 KDD (Knowledge Discovery from data) process
                                                13
Data mining is therefore a stage in the process of extracting knowledge (knowledge discovery
from data), which consists of applying data analysis algorithms" [4] (fig 1.2).
                                  Fig 1.2 Data-mining and KDD [4]
Here is a brief explanation of the stages in the process [4-5]:
- Understanding the field of application: An understanding of the field of application and
relevant prior knowledge, and identifying the objective of the KDD process from the
customer's point of view.
- Creating a dataset (selection): Selecting a dataset, or focusing on a subset of variables or
data samples, on which discovery is to be performed.
- Data cleaning and pre-processing: Basic operations such as removing noise where necessary,
gathering the information needed to model or account for noise, deciding on strategies for
managing data fields, taking account of time sequence information and known changes (may
take up 60% of the effort).
- Choice of data mining model (classification, consolidation, regression, association,
clustering).
- Choice of data mining algorithm(s): Selection of the algorithm to be used to search for
patterns in the data. This includes choosing the algorithms and parameters that may be
appropriate (for example, algorithms for categorical data are different from algorithms for
vectors over reals) and matching a particular data mining method with the general criteria of
the KDD process.
- Data Mining: Searching for interesting patterns: This involves implementing algorithms to
search for and extract patterns of interest in a particular form of representation or set of such
representations, for example: classification rules or trees, regression, clustering, etc.
Sometimes we have to perform the previous steps several times correctly to get the desired
result.
                                                14
- Pattern evaluation and presentation (visualisation, transformation, removal of redundant
patterns, etc).
- Use of the extracted knowledge
1.7 CONCEPT OF LEARNING
Learning is part of our lives. Every day we acquire new knowledge, skills, understanding,
values and preferences. This can be done explicitly by taking courses (to learn a new
language) or workshops (to learn a new trade). We can also learn implicitly and repeatedly,
for example by watching the news or talking to colleagues about their experiences.
A.Ambrose [6] defines learning like this :
Definition 3 : learning
Learning as a process that leads to change, which occurs as a result of experience and
increases the potential for improved performance and future learning.
This definition means that Learning is a process, not a product, and it involves change in
knowledge, beliefs, behaviours, or attitudes [6].
In data mining, models use learning to implement and train algorithms on large amount of
data so that they can make predictions new data.
There are two types of learning: supervised and unsupervised.
1.7.1 Supervised Learning
In the case of supervised learning, the robustness of the algorithm will depend on the
accuracy of its training. An algorithm that learns from supervised content produces an internal
map that can be reused to classify new quantities of data.
For example, if an algorithm detects faces, a user will have to show it what is a face and what
is not so that it can learn and predict whether the next photos are or are not. In short, the
algorithm learns from examples, and in this case the examples need to be labelled to ensure
that it learns effectively.
1.7.2 Unsupervised Learning
In the case of unsupervised learning, there is no need for human intervention, as the algorithm
will work out for itself how to differentiate a face from a landscape by looking for similarities.
Since an algorithm cannot simply know what constitutes a face, the unsupervised method will
partition and classify the data into homogeneous groups ("clustering").
Fig 1.3 summarises the main differences between supervised and unsupervised learning.
                                               15
                                Supervised learning                      Unsupervised learning
   Methodology                   Need for an expert                       No need for an expert
       Data             Need for training data labelled by the          No need for labelled data
                                       expert
 Learning process       Learning is based on the data labelled      Learning is based on unlabelled
                         by the expert to build a prediction        data by looking for similarities
                                       model.                            between these data.
    Advantages            The results obtained are generally           No need for labelled data. So
                            reliable (high accuracy score)                   less expensive.
  Disadvantages            Preparing the labelled data sets            The results are generally less
                         required for learning is very costly             reliable than those of
                                   (time, money).                        unsupervised learning
   Examples of            SVM, KNN, Random-Forest, …                       Kmeans, Dbscan, …
    methods
                         Fig 1.3 Supervised/Unsupervised learning comparison
1.8 DATA AND DATA MINING
Data preparation and selection play a very important role in data mining. The performance of
the projected systems depends on it. Data is organised into data sets. There are three types of
data set handled in data mining: learning data sets, test data sets and validation data sets.
A dataset is a collection of data. It is a coherent set of data that can be presented in different
formats: numerical, textual, video, image or sound data. The dataset is an important part of
machine learning. It is used to teach a model to perform a task or make a prediction.
Several    websites   offer      free       datasets      of     all      kinds,      for    example:
https://www.kaggle.com/datasets.
For the purposes of this course, we are going to use Ross Quinlan's dataset (table 1.1), to
experiment with the different data mining models to be presented.
                    Table 1.1 Quinlan Dataset « weather » (R.Quinlan, 1993)
                    #     Sky Temperature (in F) Humidity Wind                 Play
                    1    sunny      75             70     Yes                  Yes
                    2    sunny      80             90     Yes                   No
                    3    sunny      85             85      No                   No
                    4    sunny      72             95      No                   No
                    5    sunny      69             70      No                  Yes
                    6   overcast    72             90     Yes                  Yes
                    7   overcast    83             78      No                  Yes
                    8   overcast    64             65     Yes                  Yes
                    9   overcast    81             75      No                  Yes
                                                 16
                  #      Sky    Temperature (in F) Humidity Wind       Play
                  10    rainy         71             80     Yes         No
                  11    rainy         65             70     Yes         No
                  12    rainy         75             80      No        Yes
                  13    rainy         68             80      No        Yes
                  14    rainy         70             96      No        Yes
Quinlan's table contains data on the opinions of a group of 14 tennis players. These players
were asked whether they would be willing to play in certain weather conditions (sky
conditions, temperature, humidity and wind). Their opinion (Yes or No) appears in the last
column of the table. Quinlan introduced this table to present the decision tree model, but this
dataset is currently used in several data mining models.
1.8.1 Training Dataset :
The purpose of the training dataset is to teach a machine learning model to make a prediction
or perform a task. In supervised learning, the dataset is made up of one or more variables
(attributes) and an output variable (or target). The aim is to teach the model to correlate the
two.
During this phase, the scientist will adjust the model parameters on the basis of the
comparison between the results generated and the expected result.
1.8.2 Validation dataset :
In machine learning, the purpose of the validation dataset is to validate the architecture of a
learning model. At each iteration of training, it is used to adjust the model. In the case of
classification, the validation dataset can also be used to compare the behaviour of several
types of classifiers with a view to selecting the one with the best performance.
1.8.3 Test Dataset :
As the name suggests, the purpose of the test dataset is to evaluate the final performance of a
machine learning model that has been trained. The predictions obtained are compared with
those expected.
1.8.4 Validation and Cross-validation:
Often, a dataset is divided into two parts: a training part (which may contain 70-80%), and a
test part (containing the remaining rows, i.e. 30-20%).
Validation consists of asking whether, if the model built using the training part (70-80% of
data), will be able to predict the 30-20% of data representing the test data (fig 1.4).
                                              17
                                     Fig 1.4 Validation principle
It is important to note that the 70-30 percent split is not an immutable rule. We can also
choose 80-20 or 75-25 percent, or whatever. But in all cases, the training part must be at least
equal to 70 percent.
Cross-validation consists of repeating the operation described above several times, changing
the content of the training set (70%) and the test set (30%) each time. The aim is to obtain the
best performance from the model to be built.
In concrete terms, cross-validation works as follows [7]:
   1. Split the data into k folds.
   2. Train the model on k-1 of the folds (i.e., train the model on all of the folds, except
      one).
   3. Evaluate the model on the kth holdout fold by computing performance on a metric.
   4. Rotate the folds, and repeat steps 2 and 3, with a new holdout fold. Repeat steps 2 and
      3 until all of the k folds have been used as the holdout fold exactly 1 time.
   5. Average the model performance across all iterations.
Fig 1.5 gives an illustration of a cross-validation with 5 folds.
                                  Fig 1.5 Cross-validation principle
                                                 18
1.9 EVALUATION MEASURES
In the fields of data mining, pattern recognition, information retrieval and automatic
classification, the performance of a method/system is measured by: precision, recall and F-
score. These measures are defined below.
1.9.1 Precision
Precision is one of the most widely used measures of a system's performance. It measures a
system's ability to produce the right predictions. Let's introduce it with an example.
Example: A database contains 80 medical images corresponding to 80 patients. It is assumed
that 40 of these images concern ‘sick’ patients and the other 40 concern ‘healthy’ patients. A
query is run on an automatic diagnostic system to "search for all images concerning sick
patients ". Let's suppose that the system returns 25 images, of which only 20 are relevant (true
positive cases: the patients are really sick), and 5 are not (false positive cases: the patient are
not sick).
We can present the results using what is known as a confusion matrix (which will be defined
below). This is a 2x2 matrix where the rows represent the actual data (exact class data) and
the columns represent the prediction results (fig 1.6).
                                                           Predicted
                                               Positive                Negative
                             Positive    TP (True Positive) FN (False Negative)
                    Actual
                             Negative FP (False Positive) TN (True Negative)
                               Fig 1.6 True-False and Positive-Negative results
Let's define the terms of the matrix:
    TP (True Positive): Predicted class and actual class are both positive.
    TN (True Negative): Predicted class and actual class are both negative
    FP (False Positive): Predicted class is positive but actual class is negative (in other
     words, there is a prediction error).
    FN (False Negative): Predicted class is negative but actual class is positive (there is a
     prediction error.
In the above, terms True | False tells us whether the prediction was right or wrong. The
second part Positive | Negative tells us what was the prediction.
Based on these considerations, precision can be defined as follows:
Definition : Precision
Precision is defined as the proportion of correct (relevant) responses returned out of all the
responses returned (correct or false).
                                                                                               (1)
                                                     19
Using the notation introduced above (TP, TN, FP and FN), we can define Precision in another
equivalent way:
                                                                                       (2)
In our example (medical images), since we have 25 responses produced by the system, 20 of
which are correct and 20 incorrect, we can calculate the precision as follows:
1.9.2 Recall
Recall is another important measure. It measures the completeness of a system's responses.
Has the system produced all the expected correct results?. Here is its definition:
Definition : Recall
Recall is defined as the proportion of correct(relevant) responses proposed out of all correct
and expected responses.
                                                                                       (3)
Using the notation introduced above (TP, TN, FP and FN), we can define Recall in another
equivalent way:
                                                                                       (4)
In our previous example (medical images), we have 20 correct responses, but the all-correct
expected responses are 40. So, recall is equal to:
1.9.3 F-score
Taken separately, measures of precision and recall are not sufficient to evaluate system
performance. That's why another measure has been proposed: the F-score. Here is its
definition:
Definition : Fscore
The f-score measurement combines precision and recall measurements. It is equal to :
                                                                                        (5)
Generally, we take & equal to 1, which is referred to as the F1score. F1 score is harmonic
mean of Precision and recall, which gives the same importance to precision as recall.
In our example (medical images), Recall is equal to :
                                              20
1.9.4 Noise and Silence
Noisy data are data with a large amount of additional meaningless information called noise . it
is calculated as follows :
                                       Noise = 1-Precision.
Conversely, the expression calculated by the following formula is called silence:
                                       Silence = 1 – Recall.
In our example (medical images), noise is equal to (1-0.8) or 20%. Silence is equal to (1-0.5)
or 50%.
1.9.5 Confusion Matrix
The confusion matrix is a summary of the prediction results for a particular classification
problem. It compares the actual data for a target variable with those predicted by a model. The
right and wrong predictions are recorded in a table and broken down by class, allowing them
to be compared with defined values.
In the confusion matrix, each row indicates the number of occurrences of a reference (or real)
class. The columns represent the number of occurrences of an estimated class. Each column
therefore contains a class predicted by the Machine Learning algorithm as well as the rows of
the actual classes.
Example: We used a dataset of 12 photos representing trees. The trees are of three classes: fir
(in French “sapin”), oak (“chêne”) and olive (“Olivier”). There are 4 photos for each class of
tree. A computerised recognition system was used: the photos in the dataset were entered one
by one, and the system gave its prediction of the class of tree: fir, oak or olive. The results are
recorded in the following confusion matrix:
                              Tableau 2 : Example of confusion matrix
         Predicted class
                                   Fir tree (sapin)   Oak tree (chêne)    Olive tree (olivier)
            Real class
          Fir tree (sapin)                1                    1                   2
         Oak tree (chêne)                 3                    0                   1
       Olive tree (olivier)               0                    1                   3
This matrix can be read as follows:
- Remember that the real (exact) classes correspond to the rows, and the estimated (or
predicted) classes correspond to the columns.
- Values on the diagonal correspond to correct results: the predicted class is identical to the
actual class.
- Values outside the diagonal correspond to incorrect predictions. For example, a value of 2 in
the matrix means that the system has assigned 2 photos to the 'Olive Tree' class, which would
normally belong to the 'Fir Tree' class.
The following metrics can be calculated from the matrix:
                                                 21
- The precision for a class is equal to the value on the diagonal divided by the total column.
The precision for "Fir tree" is 1/4 (25%), for "Oak tree" is 0/2 (0%) and for "Olive tree" is 3/6
(75%).
- The recall for a class is equal to the value on the diagonal divided by the row total. The
recall for "Fir tree" is ¼ (25%) , for "Oak tree" is 0/4 (0%), for "Olive tree" is 3/4 (75%).
- The overall precision of the system is equal to the average of the class precisions,
(25+0+50)/3, i.e. 25%.
- The overall recall of the system is equal to the average of the class recalls: (25+0+75)/3, i.e.
33.33%.
The presentation of the confusion matrix introduces two other widely used performance
measures: Accuracy and Error rate.
Definition : Accuracy
Accuracy is equal to the number of correct predictions out of the total number of predictions.
                                                                                    (6)
In our example, Accuracy is equal to : 4/12, i.e 33.33%.
Definition : Error rate
The error rate is equal to the number of incorrect predictions out of the total number of
predictions.
                                                                                    (7)
which can also be written more easily:
                                                                                    (8)
In our example, the error rate: is equal to i.e. 8/12 = 66.67%.
CONCLUSION OF THE CHAPTER
This chapter was an opportunity to present the field of data mining, its objectives and areas of
application. Basic concepts were defined, such as learning, with its two types: supervised and
unsupervised, and also the performance measures used in the literature.
In the next few chapters, we will present some of the most commonly used models in data
mining.
EXERCISES
…….
                                               22