Training Data Selection for Support Vector
Machines
?
Jigang Wang, Predrag Neskovic, and Leon N Cooper
Institute for Brain and Neural Systems,
Physics Department,
Brown University, Providence RI 02912, USA
jigang@brown.edu,pedja@brown.edu, Leon Cooper@brown.edu,
http://physics.brown.edu/physics/researchpages/Ibns/index.html
Abstract. In recent years, support vector machines (SVMs) have be-
come a popular tool for pattern recognition and machine learning. Train-
ing a SVM involves solving a constrained quadratic programming prob-
lem, which requires large memory and enormous amounts of training
time for large-scale problems. In contrast, the SVM decision function is
fully determined by a small subset of the training data, called support
vectors. Therefore, it is desirable to remove from the training set the data
that is irrelevant to the final decision function. In this paper we propose
two new methods that select a subset of data for SVM training. Using
real-world datasets, we compare the effectiveness of the proposed data
selection strategies in terms of their ability to reduce the training set size
while maintaining the generalization performance of the resulting SVM
classifiers. Our experimental results show that a significant amount of
training data can be removed by our proposed methods without degrad-
ing the performance of the resulting SVM classifiers.
1 Introduction
Support vector machines (SVMs), introduced by Vapnik and coworkers in the
structural risk minimization (SRM) framework [1–3], have gained wide accep-
tance due to their solid statistical foundation and good generalization perfor-
mance that has been demonstrated in a wide range of applications.
Training a SVM involves solving a constrained quadratic programming (QP)
problem, which requires large memory and takes enormous amounts of train-
ing time for large-scale applications [4]. On the other hand, the SVM decision
function depends only on a small subset of the training data, called support
vectors. Therefore, if one knows in advance which patterns correspond to the
support vectors, the same solution can be obtained by solving a much smaller
QP problem that involves only the support vectors. The problem is then how
to select training examples that are likely to be support vectors. Recently, there
?
This work is partially supported by ARO under grant W911NF-04-1-0357. Jigang
Wang is supported by a dissertation fellowship from Brown University.
has been considerable research on data selection for SVM training. For exam-
ple, Shin and Cho proposed a method that selects patterns near the decision
boundary based on the neighborhood properties [5]. In [6–8], k-means clustering
is employed to select patterns from the training set. In [9], Zhang and King pro-
posed a β-skeleton algorithm to identify support vectors. In [10], Abe and Inoue
used Mahalanobis distance to estimate boundary points. In the reduced SVM
(RSVM) setting, Lee and Mangasarian chose a subset of training examples using
random sampling [11]. In [12], it was shown that uniform random sampling is
the optimal robust selection scheme in terms of several statistical criteria.
In this paper, we introduce two new data selection methods for SVM training.
The first method selects training data based on a statistical confidence measure
that we will describe later. The second method uses the minimal distance from
a training example to the training examples of a different class as a criterion
to select patterns near the decision boundary. This method is motivated by
the geometrical interpretation of SVMs based on the (reduced) convex hulls.
To understand how effective these strategies are in terms of their ability to
reduce the training set size while maintaining the generalization performance, we
compare the results obtained by the SVM classifiers trained with data selected
by these two new methods, by random sampling, and by the data selection
method that is based on the distance from a training example to the desired
optimal separating hyperplane. Our comparative study shows that a significant
amount of training data can be removed from the training set by our methods
without degrading the performance of the resulting SVM classifier. We also find
that, despite its simplicity, random sampling performs well and often provides
results comparable to those obtained by the method based on the desired SVM
outputs. Furthermore, in our experiments, we find that incorporating the class
distribution information in the training set often improves the efficiency of the
data selection methods.
The remainder of the paper is organized as follows. In section 2, we give a
brief overview of support vector machines for classification and the correspond-
ing training problem. In section 3, we present the two new methods that select
subsets of training examples for training SVMs. In section 4 we report the exper-
imental results on several real-world datasets. Concluding remarks are provided
in section 5.
2 Related Background
Given a set of training data {(x1, y1), . . . , (xn, yn )}, where xi ∈ IRd and yi ∈
{−1, 1}, support vector machines seek to construct an optimal separating hy-
perplane by solving the following quadratic optimization problem:
1 X n
min hw, wi + C ξn (1)
w,b 2
i=1
subject to the constraints:
yi (hw, xi i + b) ≥ 1 − ξi ∀i = 1, . . . , n , (2)
where ξi ≥ 0 for i = 1, . . ., n are slack variables introduced to handle the non-
separable case [2]. The constant C > 0 is a parameter that controls the trade-off
between the separation margin and the number of training errors. Using the
Lagrange multiplier method, one can easily obtain the following Wolfe dual form
of the primal quadratic programming problem:
n n
1 X X
min αi αj yi yj hxi , xj i − αi (3)
αi ,i=1,...,n 2
i,j=1 i=1
subject to
X
n
0 ≤ αi ≤ C ∀i = 1, . . . , n and αiyi = 0 . (4)
i=1
Solving the dual problem, one obtains the multipliers αi , i = 1, . . ., n, which give
w as an expansion
Xn
w= αi yi xi . (5)
i=1
According to the Karush-Kuhn-Tucker (KKT) optimality conditions, we have
αi = 0 ⇒ yi (hw, xii + b) ≥ 1 and ξi = 0
0 < αi < C ⇒ yi (hw, xii + b) = 1 and ξi = 0
αi = C ⇒ yi (hw, xii + b) ≤ 1 and ξi ≥ 0 .
Therefore, only αi that correspond to training examples xi which lie either on
the margin or inside the margin area are non-zero. All the remaining αi are zero
and the corresponding training examples are irrelevant to the final solution.
Knowing the normal vector w, the bias term b can be determined from the
KKT conditions yi (hw, xii + b) = 1 for P 0 < αi < C. This subsequently leads to
n
the linear decision function f(x) = sgn( i=1 αi yi hx, xii + b).
In practice, linear decision functions are generally not rich enough for pattern
separation. To allow for more general decision surfaces, one can apply the kernel
trick by replacing the inner products hxi , xj i in the dual problem with suitable
kernel functions k(xi , xj ). Effectively, support vector machines implicitly map
training vectors xi in IRd to feature vectors Φ(xi) in some high dimensional
feature space IF such that inner products in IF are defined as hΦ(xi ), Φ(xj )i =
k(xi, xj ). Consequently, the optimal hyperplane in the feature space IF represents
a nonlinear decision functions of the form
X
n
f(x) = sgn( αi yi k(x, xi) + b) . (6)
i=1
To train a SVM classifier, one therefore needs to solve the dual quadratic
programming problem (3) under the constraints (4). For a small training set,
standard QP solvers, such as CPLEX, LOQO, MINOS and Matlab QP routines,
can be readily used to obtain the solution. However, for a large training set, they
quickly become intractable because of the large memory requirements and the
enormous amounts of training time involved. To alleviate the problem, a number
of solutions have been proposed by exploiting the sparsity of the SVM solution
and the KKT conditions.
The first such solution, known as chunking [13], uses the fact that only the
support vectors are relevant for the final solution. At each step, chunking solves
a QP problem that consists of all non-zero Lagrange multipliers αi from the last
step and some of the αi that violate the KKT conditions. The size of the QP
problem varies but finally equals the number of non-zero Lagrange multipliers.
At the last step, the entire set of non-zero Lagrange multipliers are identified and
the QP problem is solved. Another solution, proposed in [14], solves the large
QP problem by breaking it down into a series of smaller QP sub-problems. This
decomposition method is justified by the observation that solving a sequence of
QP sub-problems that always contain at least one training example that violates
the KKT conditions will eventually lead to the optimal solution. Recently, a
method called sequential minimal optimization (SMO) was proposed by Platt
[15], which approaches the problem by iteratively solving a QP sub-problem of
size 2. The key idea is that a QP sub-problem of size 2 can be solved analytically
without invoking a quadratic optimizer. This method has been reported to be
several orders of magnitude faster than the classical chunking algorithm.
All the above training methods make use of the whole training set. However,
according to the KKT optimality conditions, the final separating hyperplane is
fully determined by the support vectors. In many real-world applications, the
number of support vectors is expected to be much smaller than the total number
of training examples. Therefore, the speed of SVM training will be significantly
improved if only the set of support vectors is used for training, and the solution
will be exactly the same as if the whole training set was used.
In theory, one has to solve the full QP problem in order to identify the sup-
port vectors. However, it is easy to see that the support vectors are training
examples that are close to decision boundaries. Therefore, if there exists a com-
putationally efficient way to find a small set of training data such that with high
probability it contains the desired support vectors, the speed of SVM training
will be improved without degrading the generalization performance. The size
of the reduced training set can still be larger than the set of desired support
vectors. However, as long as its size is much smaller than the size of the total
training set, the SVM training speed will be significantly improved because most
SVM training algorithms scales quadratically on many problems [4]. In the next
section, we propose two new data selection strategies to explore the possibility.
3 Training Data Selection for Support Vector Machines
3.1 Data Selection based on Confidence Measure
A good heuristic for identifying boundary points is the number of training ex-
amples that are contained in the largest sphere centered at a training example
without covering an example of a different class.
Centered at each training example xi , let us draw a sphere that is as large as
possible without covering a training example of a different class and count the
number of training examples that fall inside the sphere. We denote this number
by N (xi). Obviously, the larger the number N (xi ), the more training examples
(of the same class as xi ) will be scattered around xi, the less likely xi will be close
to the decision boundary, and the less likely xi will be a support vector. Hence,
this number can be used as a criterion to decide which training examples should
belong to the reduced training set. For each training example xi, we compute
the number N (xi) and sort the training data according to the corresponding
value of N (xi) and choose a subset of data with the smallest numbers N (xi ) as
the reduced training set. It can be shown that N (xi) is related to the statistical
confidence that can be associated with the class label yi of the training example
xi. For this reason, we call this data selection scheme the confidence measure-
based training set selection.
3.2 Data Selection based on Hausdorff Distance
Our second data selection strategy is based on the Hausdorff distance. In the
separable case, it has been shown that the optimal SVM separating hyperplane
is identical to the hyperplane that bisects the line segment which connects the
two closest points of the convex hulls of the positive and of the negative training
examples [16, 17]. The problem of finding the two closest points in the convex
hulls can be formulated as
min kz + − z − k2 (7)
z + ,z −
subject to X X
z+ = α i xi and z− = α i xi , (8)
i:yi =1 i:yi =−1
P P
where αi ≥ 0 satisfies the constraints i:yi =1 αi = 1 and i:yi =−1 αi = 1.
Based on this geometrical interpretation, the support vectors are the training
examples that are vertices of the convex hulls that are closest to the convex hull
of the training examples from the opposite class. For the non-separable case, a
similar result holds by replacing the convex hulls with the reduced convex hulls
[16, 17]. Therefore, a good heuristic that can be used to determine whether a
training example is likely to be a support vector is the distance to the convex
hull of the training examples of the opposite class. Computing the distance from
a training example xi to the convex hull of the training examples of the opposite
class involves solving a smaller quadratic programming problem. To simplify
the computation, the distance from a training example to the closest training
examples of the opposite class can be used as an approximation. We denote the
minimal distance as
d(xi) = min kxi − xj k , (9)
j:yj 6=yi
which is also the Hausdorff distance between the training example xi and the set
of training examples that belong to a different class. To select a subset of training
examples, we sort the training set according to d(xi) and select examples with
the smallest Hausdorff distances d(xi) as the reduced training set. This method
will be referred to as the Hausdorff distance-based selection method.
3.3 Data Selection based on Random Sampling and Desired SVM
Outputs
To study the effectiveness of the proposed data selection strategies, we compare
them to two other strategies. One is random sampling and the other is a data
selection strategy based on the distance from the training examples to the desired
separating hyperplane.
The random sampling strategy simply selects a small portion of the training
data to form the reduced training set uniformly at random. This method is
straightforward to implement and requires no extra computation. The other
data selection strategy we compare our methods to is implemented as follows.
Given the training set and the parameter setting, we solve the full QP problem
to obtained the desired separating hyperplane. Then for each training example
xi, we compute its distance to the desired separating hyperplane as:
X
n
f(xi ) = yi ( αj yj k(xi, xj ) + b) . (10)
j=1
Note that Eq. (10) has taken into account the class information and training
examples that are misclassified by the desired separating hyperplane will have
negative distances. According to the KKT conditions, support vectors are train-
ing examples that have relatively small values of distance f(xi ). We sort the
training examples according to their distances to the separating hyperplane and
select a subset of training examples with the smallest distances as the reduced
training set. This strategy, although impractical because one needs to solve the
full QP problem first, is ideal for comparison purposes as the distance from
a training example to the desired separating hyperplane provides the optimal
criterion for selecting the support vectors.
4 Results and Discussion
In this section we report experimental results on several real-world datasets from
the UCI Machine Learning repository [18]. The SVM training algorithm was
implemented based on the SMO method. For all datasets, Gaussian kernels were
used and the generalization error of the SVMs was estimated using the 5-fold
cross-validation method. For each training set, according to the data selection
method used, a portion of the training set (ranging from 10 to 100 percent) was
selected as the reduced training set to train the SVM classifier. The error rate
reported is the average error rate of the resulting SVM classifiers on the test sets
over the 5 iterations. Due to the space limit, only results on three datasets will
be presented.
Note that when the data selection method is based on the desired SVM
outputs, the SVM training procedure has to be run twice in each iteration. The
first time a SVM classifier is trained with the training set to obtain the desired
separating hyperplane. Then a portion of the training examples in the training
set is selected to form the reduced training set based on their distances to the
desired separating hyperplane (see Eq. (10)). The second time a SVM classifier
is trained with the reduced training set.
Given a training set and a particular data selection criterion, there are two
ways to form the reduced training set. One can either select training examples
regardless of which classes they belong to or select training examples from each
class separately while maintaining the class distribution. It was found in our
experiments that selecting training examples from each class separately often
improves the classification accuracy of the resulting SVM classifiers. Therefore,
we only report results in this case.
Table 1 shows the error rates of SVMs on the Wisconsin Breast Cancer
dataset when trained with the reduced training sets of various sizes selected by
the four different data selection methods. This dataset consists of 683 examples
from two classes (excluding the 16 examples with missing attribute values). Each
example has 8 attributes. The size of the training set in each iteration is 547 and
the size of the test set is 136. The average number of support vectors is 238.6,
which is 43.62% of the training set size.
Table 1. Error rates of SVMs on the Breast Cancer dataset when trained with reduced
training sets of various sizes
Percent Confidence Hausdorff Random SVM
10 34.26 5.44 5.44 33.38
20 4.12 7.65 5.15 4.56
30 3.53 5.59 4.71 3.97
40 3.82 5.44 5.00 3.68
50 3.82 5.44 5.00 3.82
60 3.97 5.15 4.41 3.97
70 3.97 4.85 4.12 3.97
80 4.12 4.85 4.26 3.97
90 3.82 4.56 4.41 3.82
100 3.82 3.82 3.82 3.82
From Table 1 one can see that a significant amount data can be removed
from the training set without degrading the performance of the resulting SVM
classifier. When more than 10% of the training data is selected, the confidence-
based data selection method outperforms the other two methods. Its performance
is actually as good as that of the method based on the desired SVM outputs.
The method based on the Hausdorff distance gives the worst results. When the
data reduction rate is high, e.g., when less than 10 percent of the training data
is selected, the results obtained by the Hausdorff distance-based method and
random sampling are much better than those based on the confidence measure
and the desired SVM outputs.
Table 2 shows the corresponding results obtained on the BUPA Liver dataset,
which consists of 345 examples, with each example having 6 attributes. The sizes
of the training and test sets in each iteration are 276 and 69, respectively. The
average number of support vectors is 222.2, which is 80.51% of the size of the
training sets. Interestingly, as we can see, the method based on the desired
SVM outputs has the worst overall results. When less than 80% of the data is
selected for training, the Hausdorff distance-based method and random sampling
have similar performance and outperform the methods based on the confidence
measure and the desired SVM outputs.
Table 2. Results on the BUPA Liver dataset
Percent Confidence Hausdorff Random SVM
10 42.90 39.71 39.13 63.19
20 44.06 38.55 33.33 62.90
30 41.16 33.62 33.33 51.01
40 40.00 33.62 30.43 45.80
50 40.00 33.62 31.30 42.61
60 35.94 32.75 32.75 42.32
70 33.91 33.33 32.17 37.68
80 31.01 31.88 32.46 32.46
90 31.59 30.72 33.04 31.30
100 31.30 31.30 31.30 31.30
Table 3 provides the results on the Ionosphere dataset, which has a total of
351 examples, with each example having 34 attributes. The sizes of the training
and test sets in each iteration are 281 and 70, respectively. The average number
of support vectors is 159.8, which is 56.87% of the size of the training sets. From
Table 3 we see that the data selection method based on the desired SVM outputs
gives the best results when more than 20% of the data is selected. When more
than 50% of the data is selected, the results of the confidence-based method are
very close to the best achievable results. However, when the reduction rate is
high, the performance of random sampling is the best. The Hausdorff distance-
based method has the worst overall results.
An interesting finding of the experiments is that the performance of the
SVM classifiers deteriorates significantly when the reduction rate is high, e.g.,
when the size of the reduced training set is much smaller than the number of
the desired support vectors. This is especially true for data selection strategies
that are based on the desired SVM outputs and the proposed heuristics. On the
other hand, the effect is less significant for random sampling, as we have seen
Table 3. Results on the Ionosphere dataset
Percent Confidence Hausdorff Random SVM
10 26.29 35.71 16.29 33.14
20 21.43 25.71 11.14 22.57
30 18.57 24.00 8.57 6.86
40 11.43 24.00 8.00 6.00
50 7.43 21.43 7.14 5.71
60 6.00 18.86 7.14 5.71
70 5.71 16.00 6.57 6.00
80 5.14 10.29 6.00 6.00
90 6.00 6.57 6.00 5.71
100 5.71 5.71 5.71 5.71
that random sampling usually has better relative performance at higher data
reduction rates. From a theoretical point of view, this is not surprising because
when only a subset of the support vectors is chosen as the reduced training set,
there is no guarantee that the solution of the reduced QP problem will still be
the same. In fact, if the reduction rate is high and the criterion is based on
the desired SVM outputs or the proposed heuristics, the reduced training set
is likely to be dominated by ’outliers’, therefore leading to worse classification
performance. To overcome this problem, we can remove those training examples
that lie far inside the margin area since they are likely to be ’outliers’. For the
data selection strategy based on the desired SVM outputs, it means that we can
discard part of the training data that has extremely small values of the distance
to the desired separating hyperplane (see Eq. (10)). For the methods based on
the confidence measure and Hausdorff distance, we can similarly discard the part
of the training data that has extremely small values of N (xi ) and the Hausdorff
distance.
In Table 4 we show the results of the proposed solution on the Breast Cancer
dataset. Comparing Tables 1 and 4, it is easy to see that, when only a very small
subset of the training data (compared to the number of the desired support vec-
tors) is selected for SVM training, removing training patterns that are extremely
close to the decision boundary according to the confidence measure or accord-
ing to the underlying SVM outputs significantly improves the performance of
the resulting SVM classifiers. The effect is less obvious for the methods based
on the Hausdorff measure and random sampling. Similar results have also been
observed on other datasets but will not be reported here due to the space limit.
5 Conclusion
In this paper we presented two new data selection methods for SVM training.
To analyze their effectiveness in terms of their ability to reduce the training data
Table 4. Results on the Breast Cancer dataset
Percent Confidence Hausdorff Random SVM
10 5.74 7.94 5.88 4.56
20 4.26 5.59 4.71 4.71
30 4.12 5.44 4.71 4.71
40 4.12 5.15 4.85 4.56
50 4.26 5.74 5.15 4.26
60 4.12 5.15 4.56 4.41
70 3.97 5.29 4.26 4.26
80 3.82 5.29 4.41 4.26
90 3.82 4.71 4.41 4.12
100 3.82 3.82 3.82 3.82
while maintaining the generalization performance of the resulting SVM classi-
fiers, we conducted a comparative study using several real-world datasets. More
specifically, we compared the results obtained by these two new methods with
the results of the simple random sampling scheme and the results obtained by the
selection method based on the desired SVM outputs. Through our experiments,
several important observations have been made: (1) In many applications, signif-
icant data reduction can be achieved without degrading the performance of the
SVM classifiers. For that purpose, the performance of the confidence measure-
based selection method is often comparable to or better than the performance of
the method based on the desired SVM outputs. (2) When the reduction rate is
high, some of training examples that are ‘extremely’ close to the decision bound-
ary have to be removed in order to maintain the generalization performance of
the resulting SVM classifiers. (3) In spite of its simplicity, random sampling per-
forms consistently well, especially when the reduction rate is high. However, at
low reduction rates, random sampling performs noticeably worse compared to
the confidence measure-based method. (4) When conducting training data se-
lection, sampling training data from each class separately according to the class
distribution often improves the performance of the resulting SVM classifiers.
By directly comparing various data selection schemes with the scheme based
on the desired SVM outputs, we are able to conclude that the confidence measure
provides a criterion for training data selection that is almost as good as the
optimal criterion based on the desired SVM outputs. At high reduction rates, by
removing training data that are likely to be outliers, we boost the performance of
the resulting SVM classifiers. Random sampling performs consistently well in our
experiments, which is consistent with the results obtained by Syed et al. in [19]
and the theoretical analysis of Huang and Lee in [12]. The robustness of random
sampling at high reduction rates suggests that, although an SVM classifier is
fully determined by the support vectors, the generalization performance of an
SVM is less reliant on the choice of training data than it appears to be.
References
1. Boser, B. E., Guyon, I. M., Vapnik, V. N.: A training algorithm for optimal margin
classifiers. In: Haussler, D. (ed.): Proceedings of the 5th Annual ACM Workshop
on Computational Learning Theory (1992) 144–152
2. Cortes, C., Vapnik, V. N.: Support vector networks. Machine Learning. 20 (1995)
273–297
3. Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, NY (1998)
4. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges,
C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods - Support Vector Learn-
ing. MIT Press, Cambridge, MA (1999) 169–184
5. Shin, H. J., Cho, S. Z.: Fast pattern selection for support vector classifiers. In:
Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data
Mining. Lecture Notes in Artificial Intelligence (LNAI 2637) (2003) 376–387
6. Almeida, M. B., Braga, A. P., Braga, J. P.: SVM-KM: speeding SVMs learning
with a priori cluster selection and k-means. In: Proceedings of the 6th Brazilian
Symposium on Neural Networks (2000) 162–167
7. Zheng, S. F., Lu, X. F., Zheng, N. N., Xu, W. P.: Unsupervised clustering based re-
duced support vector machines. In: Proceedings of IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP) 2 (2003) 821–824
8. Koggalage, R., Halgamuge, S.: Reducing the number of training samples for fast
support vector machine classification. Neural Information Processing - Letters and
Reviews 2(3) (2004) 57–65
9. Zhang, W., King, I.: Locating support vectors via β-skeleton technique. In:
Proceedings of the International Conference on Neural Information Processing
(ICONIP) (2002) 1423–1427
10. Abe, S., Inoue, T.: Fast training of support vector machines by extracting boundary
data. In: Proceedings of the International Conference on Artificial Neural Networks
(ICANN) (2001) 308–313
11. Lee, Y. J., Mangasarian, O. L.: RSVM: Reduced support vector machines. In:
Proceedings of the First SIAM International Conference on Data Mining (2001)
12. Huang, S. Y., Lee, Y. J.: Reduced support vector machines: a statistical the-
ory. Technical report, Institute of Statistical Science, Academia Sinica, Taiwan.
http://www.stat.sinica.edu.tw/syhuang/ (2004)
13. Vapnik, V. N.: Estimation of Dependence Based on Empirical Data. Springer-
Verlag, Berlin (1982)
14. Osuna, E., Freund, R., Girosi, R.: Support vector machines: training and applica-
tions. A.I. Memo AIM - 1602, MIT A.I. Lab. (1996)
15. Platt, J.: Fast training of support vector machines using sequential minimal op-
timization. In: Schölkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in
Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999)
185–208
16. Bennett, K. P., Bredensteiner, E. J.: Duality and geometry in SVM classifiers. In:
Proceedings of 17th International Conference on Machine Learning. (2000) 57–64
17. Crisp, D. J., Burges, C. J. C.: A geometric interpretation of nu-svm classifiers.
Advances in Neural Information Processing Systems. 12 (1999)
18. Blake, C. L., Merz, C. J.: UCI Repository of machine learning databases.
http://www.ics.uci.edu/∼mlearn/MLRepository.html (1998)
19. Syed, N. A., Liu, H., Sung, K. K.: A study of support vectors on model independent
example selection. In: Proceedings of the Workshop on Support Vector Machines
at the International Joint Conference on Artificial Intelligence. (1999)