IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 1055
Support Vector Machines for
Histogram-Based Image Classiﬁcation
Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik
Abstract— Traditional classiﬁcation approaches generalize
poorly on image classiﬁcation tasks, because of the high
dimensionality of the feature space. This paper shows that
support vector machines (SVM’s) can generalize well on difﬁcult
image classiﬁcation problems where the only features are
high dimensional histograms. Heavy-tailed RBF kernels of
the form K(x; y) = e
0 jx 0y j with a  1 and b  2
are evaluated on the classiﬁcation of images extracted from
the Corel stock photo collection and shown to far outperform
traditional polynomial or Gaussian radial basis function (RBF)
kernels. Moreover, we observed that a simple remapping of the
input xi ! x
a
i improves the performance of linear SVM’s to
such an extend that it makes them, for this problem, a valid
alternative to RBF kernels.
Index Terms— Corel, image classiﬁcation, image histogram,
radial basis functions, support vector machines.
I. INTRODUCTION
LARGE collections of images are becoming available to
the public, from photo collections to Web pages or even
video databases. To index or retrieve them is a challenge which
is the focus of many research projects (for instance IBM’s
QBIC [1]). A large part of this research work is devoted to
ﬁnding suitable representations for the images, and retrieval
generally involves comparisons of images. In this paper, we
choose to use color histograms as an image representation
because of the reasonable performance that can be obtained
in spite of their extreme simplicity [2]. Using this histogram
representation, our initial goal is to perform generic object
classiﬁcation with a “winner takes all” approach: ﬁnd the one
category of object that is the most likely to be present in a
given image.
From classiﬁcation trees to neural networks, there are many
possible choices for what classiﬁer to use. The support vector
machine (SVM) approach is considered a good candidate
because of its high generalization performance without the
need to add a priori knowledge, even when the dimension of
the input space is very high.
Intuitively, given a set of points which belongs to either
one of two classes, a linear SVM ﬁnds the hyperplane leaving
the largest possible fraction of points of the same class on the
same side, while maximizing the distance of either class from
the hyperplane. According to [3], this hyperplane minimizes
the risk of misclassifying examples of the test set.
Manuscript received January 21, 1999; revised April 30, 1999.
The authors are with the Speech and Image Processing Services Research
Laboratory, AT&T Labs-Research, Red Bank, NJ 07701 USA.
Publisher Item Identiﬁer S 1045-9227(99)07269-0.
This paper follows an experimental approach, and its organization
unfolds as increasingly better results are obtained
through modiﬁcations of the SVM architecture. Section II
provides a brief introduction to SVM’s. Section III describes
the image recognition problem on Corel photo images. Section
IV compares SVM and KNN-based recognition techniques
which are inspired by previous work. From these results,
Section V explores novel techniques, by either selecting the
SVM kernel, or remapping the input, that provide high image
recognition performance with low computational requirements.
II. SUPPORT VECTOR MACHINES
A. Optimal Separating Hyperplanes
We give in this section a very brief introduction to SVM’s.
Let be a set of training examples, each example
being the dimension of the input space, belongs
to a class labeled by . The aim is to deﬁne a
hyperplane which divides the set of examples such that all
the points with the same label are on the same side of the
hyperplane. This amounts to ﬁnding and so that
(1)
If there exists a hyperplane satisfying (1), the set is said
to be linearly separable. In this case, it is always possible to
rescale and so that
i.e., so that the distance between the closest point to the
hyperplane is . Then, (1) becomes
(2)
Among the separating hyperplanes, the one for which the
distance to the closest point is maximal is called optimal
separating hyperplane (OSH). Since the distance to the closest
point is , ﬁnding the OSH amounts to minimizing
under constraints (2).
The quantity is called the margin, and thus the OSH
is the separating hyperplane which maximizes the margin. The
margin can be seen as a measure of the generalization ability:
the larger the margin, the better the generalization is expected
to be [4], [5].
Since is convex, minimizing it under linear constraints
(2) can be achieved with Lagrange multipliers. If we denote
1045–9227/99$10.00 © 1999 IEEE
1056 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
by the non negative Lagrange multipliers
associated with constraints (2), our optimization problem
amounts to maximizing
(3)
with and under constraint . This can
be achieved by the use of standard quadratic programming
methods [6].
Once the vector solution of the maximization
problem (3) has been found, the OSH has
the following expansion:
(4)
The support vectors are the points for which satisfy
(2) with equality.
Considering the expansion (4) of , the hyperplane decision
function can thus be written as
(5)
B. Linearly Nonseparable Case
When the data is not linearly separable, we introduce slack
variables with [7] such that
(6)
to allow the possibility of examples that violate (2). The
purpose of the variables is to allow misclassiﬁed points,
which have their corresponding . Therefore is an
upper bound on the number of training errors. The generalized
OSH is then regarded as the solution of the following problem:
minimize
(7)
subject to constraints (6) and . The ﬁrst term is
minimized to control the learning capacity as in the separable
case; the purpose of the second term is to control the number of
misclassiﬁed points. The parameter is chosen by the user, a
larger corresponding to assigning a higher penalty to errors.
SVM training requires to ﬁx in (7), the penalty term
for misclassiﬁcations. When dealing with images, most of
the time, the dimension of the input space is large ( 1000)
compared to the size of the training set, so that the training
data is generally linearly separable. Consequently, the value
of has in this case little impact on performance.
C. Nonlinear Support Vector Machines
The input data is mapped into a high-dimensional feature
space through some nonlinear mapping chosen a priori [8]. In
this feature space, the OSH is constructed.
If we replace by its mapping in the feature space ,
(3) becomes
If we have , then only is
needed in the training algorithm and the mapping is never
explicitly used. Conversely, given a symmetric positive kernel
, Mercer’s theorem [3] indicates us that there exists a
mapping such that
Once a kernel satisfying Mercer’s condition has been
chosen, the training algorithm consists of minimizing
(8)
and the decision function becomes
(9)
D. Multiclass Learning
SVM’s are designed for binary classiﬁcation. When dealing
with several classes, as in object recognition and image
classiﬁcation, one needs an appropriate multiclass method.
Different possibilities include the following.
• Modify the design of the SVM, as in [9], in order
to incorporate the multiclass learning directly in the
quadratic solving algorithm.
• Combine several binary classiﬁers: “One against one”
[10] applies pairwise comparisons between classes, while
“One against the others” [11] compares a given class with
all the others put together.
According to a comparison study [9], the accuracies of these
methods are almost the same. As a consequence, we chose the
one with the lowest complexity, which is “one against the
others.”
In the “one against the others” algorithm, hyperplanes are
constructed, where is the number of classes. Each hyperplane
separates one class from the other classes. In this way, we get
decision functions of the form (5). The class of
a new point is given by , i.e., the class with
the largest decision function.
We made the assumption that every point has a single label.
Nevertheless, in image classiﬁcation, an image may belong
to several classes as its content is not unique. It would be
possible to make multiclass learning more robust, and extend
it to handle multilabel classiﬁcation problems by using error
correcting codes [12]. This more complex approach has not
been experimented in this paper.
III. THE DATA AND ITS REPRESENTATION
Among the many possible features that can be extracted
from an image, we restrict ourselves to ones which are global
and low-level (the segmentation of the image into regions,
objects or relations is not in the scope of the present paper).
CHAPELLE et al.: SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1057
The simplest way to represent an image is to consider its
bitmap representation. Assuming the sizes of the images in
the database are ﬁxed to ( for the height and
for the width), then the input data for the SVM are vectors
of size for grey-level images and 3 for color
images. Each component of the vector is associated to a pixel
in the image. Some major drawbacks of this representation
are its large size and its lack of invariance with respect
to translations. For these reasons, our ﬁrst choice was the
histogram representation which is described presently.
A. Color Histograms
In spite of the fact that the color histogram technique is a
very simple and low-level method, it has shown good results in
practice [2] especially for image indexing and retrieval tasks,
where feature extraction has to be as simple and as fast as
possible. Spatial features are lost, meaning that spatial relations
between parts of an image cannot be used. This also ensures
full translation and rotation invariance.
A color is represented by a three dimensional vector corresponding
to a position in a color space. This leaves us to select
the color space and the quantization steps in this color space.
As a color space, we chose the hue-saturation-value (HSV)
space, which is in bijection with the red–green–blue (RGB)
space. The reason for the choice of HSV is that it is widely
used in the literature.
HSV is attractive in theory. It is considered more suitable
since it separates the color components (HS) from the luminance
component (V) and is less sensitive to illumination
changes. Note also that distances in the HSV space correspond
to perceptual differences in color in a more consistent way
than in the RGB space.
However, this does not seem to matter in practice. All the
experiments reported in the paper use the HSV space. For
the sake of comparison, we have selected a few experiments
and used the RGB space instead of the HSV space, while
keeping the other conditions identical: the impact of the choice
of the color space on performance was found to be minimal
compared to the impacts of the other experimental conditions
(choice of the kernel, remapping of the input). An explanation
for this fact is that, after quantization into bins, no information
about the color space is used by the classiﬁer.
The number of bins per color component has been ﬁxed
to 16, and the dimension of each histogram is .
Some experiments with a smaller number of bins have been
undertaken, but the best results have been reached with 16
bins. We have not tried to increase this number, because it
is computationally too intensive. It is preferable to compute
the histogram from the highest spatial resolution available.
Subsampling the image too much results in signiﬁcant losses
in performance. This may be explained by the fact that by
subsampling, the histogram loses its sharp peaks, as pixel
colors turn into averages (aliasing).
B. Selecting Classes of Images in the Corel
Stock Photo Collection
The Corel stock photo collection consists of a set of
photographs divided into about 200 categories, each one with
100 images. For our experiments, the original 200 categories
have been reduced using two different labeling approaches.
In the ﬁrst one, named Corel14, we chose to keep the categories
deﬁned by Corel. For the sake of comparison, we
chose the same subset of categories as [13], which are:
air shows, bears, elephants, tigers, Arabian horses, polar
bears, African specialty animals, cheetahs-leopards-jaguars,
bald eagles, mountains, ﬁelds, deserts, sunrises-sunsets, night
scenes. It is important to note that we had no inﬂuence on the
choices made in Corel14: the classes were selected by [13]
and the examples illustrating a class are the 100 images we
found in a Corel category. In [13], some images which were
visually deemed inconsistent with the rest of their category
were removed. In the results reported in this paper, we use all
100 images in each category and kept many obvious outliers:
see for instance, in Fig. 2, the “polar bear alert” sign which is
considered to be an image of a polar bear. With 14 categories,
this results in a database of 1400 images. Note that some Corel
categories come from the same batch of photographs: a system
trained to classify them may only have to classify color and
exposure idiosyncracies.
In an attempt to avoid these potential problems and to
move toward a more generic classiﬁcation, we also deﬁned
a second labeling approach, Corel7, in which we designed our
own seven categories: airplanes, birds, boats, buildings, ﬁsh,
people, vehicles. The number of images in each category varies
from 300 to 625 for a total of 2670 samples.
For each category images were hand-picked from several
original Corel categories. For example, the airplanes category
includes images of air shows, aviation photography, ﬁghter jets
and WW-II planes. The representation of what is an airplane
is then more general. Table I shows the origin of the images
for each category.
IV. SELECTING THE KERNEL
A. Introduction
The design of the SVM classiﬁer architecture is very simple
and mainly requires the choice of the kernel (the only other
parameter is ). Nevertheless, it has to be chosen carefully
since an inappropriate kernel can lead to poor performance.
There are currently no techniques available to “learn” the form
of the kernel; as a consequence, the ﬁrst kernels investigated
were borrowed from the pattern recognition literature. The
kernel products between input vectors and are
results in a classiﬁer which has a polynomial decision
function. gives a Gaussian radial basis function
(RBF) classiﬁer. In the Gaussian RBF case, the number of
centers (number of support vectors), the centers themselves
(the support vectors), the weights and the threshold
are all produced automatically by the SVM training and give
excellent results compared to RBF’s trained with non-SVM’s
methods [14].
1058 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
Fig. 1. Corel14: each row includes images from the following seven categories: air shows, bears, Arabian horses, night scenes, elephants, bald eagles,
cheetahs-leopards-jaguars.
Encouraged by the positive results obtained with ,
we looked at generalized forms of RBF kernels
where can be chosen to be any distance in the input
space. In the case of images as input, the norm seems to
be quite meaningful. But as histograms are discrete densities,
more suitable comparison functions exist, especially the
function, which has been used extensively for histogram
comparisons [15]. We use here a symmetrized approximation
of
It is not known if the kernel satisﬁes Mercer’s condition.1
Another obvious alternative is the distance, which gives
a Laplacian RBF
1 It is still possible apply the SVM training procedure to kernels that do not
satisfy Mercer’s condition. What is no longer guaranteed is that the optimal
hyperplane maximizes some margin in a hidden space.
CHAPELLE et al.: SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1059
Fig. 2. Corel14: each row includes images from the following seven categories: Tigers, African specialty animals, mountains, ﬁelds, deserts, sunrises-sunsets,
polar bears.
B. Experiments
The ﬁrst series of experiments are designed to roughly
assess the performance of the aforementioned input representations
and SVM kernels on our two Corel tasks. The 1400
examples of Corel14 were divided into 924 training examples
and 476 test examples. The 2670 examples of Corel7 were
split evenly between 1375 training and test examples. The
SVM error penalty parameter was set to 100, which can be
considered in most cases as “large.” However, in this series
of experiments, this parameter setting was found to enforce
full separability for all types of kernels except the linear one.
In the cases of the RBF kernels, the values were selected
heuristically. More rigorous procedures will be described in
the second series of experiments.
Table II shows very similar results for both the RBG
and HSV histogram representations, and also, with HSV
histograms, similar behaviors between Corel14 and Corel7.
The “leap” in performance does not happen, as normally
expected by using RBF kernels but with the proper choice of
metric within the RBF kernel. Laplacian or RBF kernels
reduce the Gaussian RBF error rate from around 30% down
to 15–20%.
This improved performance is not only due to the choice of
the appropriate metric, but also to the good generalization of
1060 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
TABLE I
HAND-LABELED CATEGORIES USED WITH THE COREL DATABASE
TABLE II
ERROR RATES USING THE FOLLOWING KERNELS: LINEAR, POLYNOMIAL OF
DEGREE 2, GAUSSIAN RBF, LAPLACIAN RBF AND 2 RBF
TABLE III
ERROR RATES WITH KNN
SVM’s. To demonstrate this, we conducted some experiments
of image histogram classiﬁcation with a K-nearest neighbors
(KNN) algorithm with the distances and . gave
the best results. Table III presents the results. As expected, the
distance is better suited; the -based SVM is still roughly
twice as good as the -based KNN.
We also did some experiments using the pixel image as input
to SVM classiﬁers with 96 64 images. Except in the linear
case, the convergence of the support vector search process was
problematic, often ﬁnding a hyperplane where every sample is
a support vector. The error rate never dropped below 45%.
The same database has been used by [13] with a decision
tree classiﬁer and the error rate was about 50%, similar to the
47.7% error rate obtained with the traditional combination of
an HSV histogram and a KNN classiﬁer. The 14.7% error rate
obtained with the Laplacian or RBF represents a nearly
four-fold reduction.
One partial explanation for the superior performance of
or Laplacian RBF kernels comes from the speciﬁc nature of
the histogram representation. Let us start with an example: in
many images, the largest coordinate in the histogram vector
corresponds to the blue of the sky. A small shift in the color
of the sky, which does not affect the nature of the object to
be recognized (for instance plane or bird) results into a large
distance.
Suppose a -pixel bin in the histogram accounts for a single
uniform color region in the image (with histogram ). A small
change of color in this region can move the pixels to a
neighboring bin, resulting in a slightly different histogram
. If we assume that this neighboring bin was empty in the
original histogram , the kernel values are
The kernel has a linear exponential decay in the Laplacian
and cases, while it has a quadratic exponential decay in
the Gaussian case.
V. KERNEL DESIGN VERSUS INPUT REMAPPING
The experiments performed in the previous section show
that non-Gaussian RBF kernels with exponential decay rates
that are less than quadratic can lead to remarkable SVM
classiﬁcation performances on image histograms. This section
explores two ways to reduce the decay rate of RBF kernels. It
shows that one of them amounts to a simple remapping of the
input, in which case the use of the kernel trick is not always
necessary.
A. Non-Gaussian RBF Kernels
We introduce kernels of the form
with
The decay rate around zero is given by
. In the case of Gaussian RBF kernels, :
decreasing the value of would provide for a slower decay.
A data-generating interpretation of RBF’s is that they correspond
to a mixture of local densities (generally Gaussian): in
this case, lowering the value of amounts to using heavy-tailed
distributions. Such distributions have been observed in speech
recognition and improved performances have been obtained
by moving from (Gaussian) to (Laplacian) or
even (Sublinear) [16]. Note that if we assume that
histograms are often distributed around zero (only a few bins
have nonzero values), decreasing the value of should have
roughly the same impact as lowering .2
2 An even more general type of Kernel is K(x;y) = e0d (x;y)
with
da;b;c(x;y) =
i
xa
i 0ya
i
b
c
:
Decreasing the value of c does not improve performance as much as
decreasing a and b, and signiﬁcantly increases the number of support vectors.
CHAPELLE et al.: SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1061
Fig. 3. Corel7: each row includes images from the following categories: airplanes, birds, boats, buildings, ﬁsh, people, cars.
The choice of has no impact on Mercer’s condition as it
amounts to a change of input variables.
satisﬁes Mercer’s condition if and only if
([4] page 434).
B. Nonlinear Remapping of the Input
The exponentiation of each component of the input vector
by does not have to be interpreted in terms of kernel
products. One can see it as the simplest possible nonlinear
remapping of the input that does not affect the dimension.
The following gives us reasons to believe that exponentiation
may improve robustness with respect to
changes in scale. Imagine that the histogram component
is caused by the presence of color col in some object. If we
increase the size of the object by some scaling factor , the
number of pixels is multiplied by , and is multiplied
by the same factor. The -exponentiation could lower this
quadratic scaling effect to a more reasonable , with .
An interesting case is , which transforms all the
components which are not zero to one (we assume that
).
C. Experimental Setup
To avoid a combinatorial explosion of kernel/remapping
combinations, it is important to restrict the number of kernels
we try. We chose three types of RBF kernels: Gaussian
, Laplacian and Sublinear . As a
basis for comparison, we also kept the linear SVM’s.
1062 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
For the reasons stated in Section III.A, the only image
representation we consider here is the 16 16 16 HSV
histogram.
Our second series of experiments attempts to deﬁne a rigorous
procedure to choose and . Because we are only testing
linear and RBF kernels, we can reduce these two choices to
one, a multiplicative renormalization of the input data.
In the case of RBF kernels, we observed experimentally
that full separability was always desirable on both Corel7 and
Corel14. As a consequence, has to be chosen large enough
compared to the diameter of the sphere containing
the input data (The distance between and is equal to
, which is always
smaller than 2.0). However, RBF kernels still do not specify
what value to choose for . With proper renormalization of
the input data, we can set as
with
In the linear case, the diameter of the data depends on the
way it is normalized. The choice of is equivalent to the
choice of a multiplicative factor for the input data.
If, in (6), we replace with and with , (7)
becomes
(10)
Similar experimental conditions are applied to both Corel7
and Corel14. Each category is divided into three sets, each
containing one third of the images, used as training, validation
and test sets. For each value of the input renormalization,
support vectors are obtained from the training set and tested
on the validation set. The renormalization for which we
obtain the best result is then used to obtain a set of support
vectors from both, the training and the validation sets. Each
Corel image contains usable pixels: the 4096
histogram vector components range from 0 to and sum up to
. They were renormalized with 10 10 10 .
Usually, the optimal values are 10 or 1. Nonoptimal
values increase the error rate by values ranging from 0.5% to
5%. This very sparse sampling rate was found to be sufﬁcient
for all kernels except Gaussian RBF’s. In the latter case, we
chose 10 0.0025, 0.01, 0.04, 0.16, 1, 10 .
The ﬁnal performance result is measured on the test set.
To obtain more test samples, we applied this procedure three
times, each time with a different test set: the number of testing
samples is the total number of data (1400 for Corel14 and 2670
for Corel7). On Corel14, each of the three training sessions
used 933 examples and required between 52 and 528 support
vectors to separate one class from the others. On Corel7, each
of the three training sessions used 1780 examples and required
between 254 and 1008 support vectors to separate one class
from the others. The algorithms and the software used to train
the SVM’s were designed by Osuna [17], [18].
D. Computation Requirements
We also measured the computation required to classify one
image histogram. Since the number of cycles required to
TABLE IV
AVERAGE ERROR RATES ON COREL14. EACH COLUMN CORRESPONDS TO A
DIFFERENT KERNEL. THE FIRST LINE REPORTS THE AVERAGE NUMBER OF
SUPPORT VECTORS REQUIRED FOR THE FULL RECOGNIZER (i.e., 14 “ONE
AGAINST THE OTHERS” SVM CLASSIFIERS). THE NEXT LINES REPORT THE
ERROR RATES USING NONLINEAR INPUT REMAPPINGS (EXPONENTIATION BY a)
TABLE V
AVERAGE ERROR RATES ON COREL7
perform an operation depends on the machine, we count the
three main types of operations we ﬁnd in our SVM classiﬁers.
flt basic ﬂoating point operation such as the multiply-add
or the computation of the absolute value of the difference
between two vector components. This is the central operation
of the kernel dot product. This operation can be avoided
if both components are zero, but we assume that verifying
this condition usually takes more time than the operation
itself. The computation of can be reduced to a
multiply-add as and can be computed in advance.
sqrt square root
exp exponential
Except in the sublinear RBF case, the number of flt is
the dominating factor. In the linear case, the decision function
(5) allows the support vectors to be linearly combined: there
is only one flt per class and component. In the RBF case,
there is one flt per class, component and support vector.
Because of the normalization by 7 4096, the number that
appears on the table equals the number of support vectors.
Fluctuations of this number are mostly caused by changes in
the input normalization .
In the sublinear RBF case, the number of sqrt is dominating.
sqrt is in theory required for each component of
the kernel product: this is the number we report. It is a
pessimistic upper bound since computations can be avoided
for components with value zero.
E. Observations
The analysis of the Tables IV–VI shows the following
characteristics that apply consistently to both Corel14 and
Corel7:
CHAPELLE et al.: SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1063
TABLE VI
COMPUTATIONAL REQUIREMENTS FOR COREL7, REPORTED AS THE NUMBER OF OPERATIONS FOR THE RECOGNITION OF ONE EXAMPLE, DIVIDED BY 7 2 4096
• As anticipated, decreasing has roughly the same impact
as decreasing . (compare column to line ,
on both Tables IV and V).
• For both, Corel14 and Corel7, the best performance is
achieved with and .
• For histogram classiﬁcation, Gaussian RBF kernels are
hardly better than linear SVM’s and require around NSV
(number of support vectors) times more computations at
recognition time.
• Sublinear RBF kernels are no better than Laplacian RBF
kernels (provided that ) and are too computationally
intensive: a time-consuming square root is required for
nonzero components of every support vector.
• For the practical use of RBF kernels, memory requirements
may also be an issue. A full ﬂoating point representation
of 5000 support vectors, each with 4096
components, requires 80 Megabytes of memory.
• Reducing to 0.25 makes linear SVM’s a very attractive
solution for many applications: its error rate is only 30%
higher than the best RBF-based SVM, while its computational
and memory requirements are several orders of
magnitude smaller than for the most efﬁcient RBF-based
SVM.
• Experiments with yield surprisingly good results,
and show that what is important about a histogram bin
is not its value, but whether it contains any pixel at all.
Note that in this case, Gaussian, Laplacian, and sublinear
RBF’s are exactly equivalent.
• The input space has 4096 dimensions: this is high enough
to enforce full separability in the linear case. However,
when optimizing for with the validation set, a solution
with training misclassiﬁcations was preferred (around 1%
error on the case of Corel14 and 5% error in the case of
Corel7).
Table VII presents the class-confusion matrix corresponding
to the use of the Laplacian kernel on Corel7 with
and (these values yield the best results for both Corel7
and Corel14). The most common confusions happen between
birds and airplanes, which is consistent.
VI. SUMMARY
In this paper, we have shown that it is possible to push
the classiﬁcation performance obtained on image histograms
to surprisingly high levels with error rates as low as 11%
for the classiﬁcation of 14 Corel categories and 16% for a
more generic set of objects. This is achieved without any other
knowledge about the task than the fact that the input is some
sort of color histogram or discrete density.
TABLE VII
CLASS-CONFUSION MATRIX FOR a = 0:25 AND b = 1:0. FOR EXAMPLE, ROW
(1) INDICATES THAT ON THE 386 IMAGES OF THE AIRPLANES CATEGORY, 341
HAVE BEEN CORRECTLY CLASSIFIED, 22 HAVE BEEN CLASSIFIED IN
BIRDS, SEVEN IN BOATS, FOUR IN BUILDINGS, AND 12 IN VEHICLES
This extremely good performance is due to the superior
generalization ability of SVM’s in high-dimensional spaces
to the use of heavy-tailed RBF’s as kernels and to nonlinear
transformations applied to the histogram bin values. We
studied how the choice of the distance used in a RBF
kernel affects performance on histogram classiﬁcation, and
found Laplacian RBF kernels to be superior to the standard
Gaussian RBF kernels. As a nonlinear transformation of the
bin values, we used -exponentiation with ranging from 1
down to 0. In the case of RBF kernels, the lowering of and
have similar effects, and their combined inﬂuence yields the
best performance.
The lowering of improves the performance of linear
SVM’s to such an extent that it makes them a valid alternative
to RBF kernels, giving comparable performance for a fraction
of the computational and memory requirements. This suggests
a new strategy for the use of SVM’s when the dimension
of the input space is extremely high. Rather than introducing
kernels intended at making this dimension even higher, which
may not be useful, it is recommended to ﬁrst try nonlinear
transformations of the input components in combination with
linear SVM’s. The computations may be orders of magnitude
faster and the performances comparable.
This work can be extended in several ways. Higher-level
spatial features can be added to the histogram features. Allowing
for the detection of multiple objects in a single image
would make this classiﬁcation-based technique usable for
image retrieval: an image would be described by the list of
objects it contains. Histograms are used to characterize other
types of data than images, and can be used, for instance,
for fraud detection applications. It would be interesting to
investigate if the same type of kernel brings the same gains
in performance.
REFERENCES
[1] W. Niblack, R. Barber, W. Equitz, M. Flickner, D. Glasman, D.
Petkovic, and P. Yanker, “The qbic project: Querying image by content
1064 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
using color, texture, and shape,” SPIE, vol. 1908, pp. 173–187, Feb.
1993.
[2] M. Swain and D. Ballard, “Indexing via color histograms,” Int. J.
Comput. Vision, vol. 7, pp. 11–32, 1991.
[3] V. Vapnik, The Nature of Statistical Learning Theory. New York:
Springer-Verlag, 1995.
[4] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[5] P. Bartlett and J. Shawe-Taylor, “Generalization performance of support
vector machines and other pattern classiﬁers,” in Advances in Kernel
Methods—Support Vector Learning. Cambridge, MA: MIT Press,
1998.
[6] M. Bazaraa and C. M. Shetty, Nonlinear Programming New York:
Wiley, 1979.
[7] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning,
vol. 20, pp. 1–25, 1995.
[8] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for
optimal margin classiﬁer,” in Proc. 5th ACM Wkshp. Comput. Learning
Theory, Pittsburgh, PA, July 1992, pp. 144–152.
[9] J. Weston and C. Watkins, “Multiclass support vector machines,” Univ.
London, U.K., Tech. Rep. CSD-TR-98-04, 1998.
[10] M. Pontil and A. Verri, “Support vector machines for 3-d object
recognition,” in Pattern Anal. Machine Intell., vol. 20, June 1998.
[11] V. Blanz, B. Sch¨olkopf, H. B¨ulthoff, C. Burges, V. Vapnik, and T.
Vetter, “Comparison of view-based object recognition algorithms using
realistic 3d models,” in Artiﬁcial Neural Networks—ICANN’96, Berlin,
Germany, 1996, pp. 251–256.
[12] R. Schapire and Y. Singer, “Improved boosting algorithms using
conﬁdence-rated predictions,” in Proc. 1998 Wkshp. Comput. Learning
Theory, 1998.
[13] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Color- and
texture-based images segmentation using em and its application to image
querying and classiﬁcation,” submitted to Pattern Anal. Machine Intell.,
1998.
[14] B. Sch¨olkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and
V. Vapnik, “Comparing support vector machines with Gaussian kernels
to radial basis function classiﬁers,” Massachusetts Inst. Technol., A.I.
Memo 1599, 1996.
[15] B. Schiele and J. L. Crowley, “Object recognition using multidimensional
receptive ﬁeld histograms,” in ECCV’96, 4th European Conf.
Comput. Vision, vol. I, 1996, pp. 610–619.
[16] S. Basu and C. A. Micchelli, “Parametric density estimation for the
classiﬁcation of acoustic feature vectors in speech recognition,” in
Nonlinear Modeling: Advanced Black-Box Techniques, J. A. K. Suykens
and J. Vandewalle, Eds. Boston, MA: Kluwer, 1998.
[17] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines:
An application to face detection,” in IEEE CVPR’97, Puerto Rico, June
17–19, 1997.
[18] E. Osuna, R. Freund, and F. Girosi, “Improved training algorithm for
support vector machines,” in IEEE NNSP’97, Amelia Island, FL, Sept.
24–26, 1997.
Olivier Chapelle received the B.Sc. degree in computer
science from the Ecole Normale Sup´erieure
de Lyon, France, in 1998. He is currently studying
the M.Sc. degree in computer vision at the Ecole
Normale Superieure de Cachan, France.
He has been a visiting scholar in the MOVI
computer vision team at Inria, Grenoble, France, in
1997 and at AT&T Research Labs, Red Bank, NJ,
in the summer of 1998. For the last three months, he
has been working at AT&T Research Laboratories
with V. Vapnik in the ﬁeld of machine learning.
His research interests include learning theory, computer vision, and support
vector machines.
Patrick Haffner received the bachelor’s degree
from Ecole Polytechnique, Paris, France, in
1987 and from Ecole Nationale Sup´erieure des
T´el´ecommunications (ENST), Paris, France, in
1989. He received the Ph.D. degree in speech
and signal processing from ENST in 1994.
In 1988 and 1990, he worked with A. Waibel
on the design of the TDNN and the MS-TDNN
architectures at ATR, Japan, and Carnegie Mellon
University, Pittsburgh, PA. From 1989 to 1995, as
a Research Scientist for CNET/France-T´el´ecom in
Lannion, France, he developed connectionist learning algorithms for telephone
speech recognition. In 1995, he joined AT&T Bell Laboratories and worked
on the application of Optical Character Recognition and transducers to the
processing of ﬁnancial documents. In 1997, he joined the Image Processing
Services Research Department at AT&T Labs-Research. His research interests
include statistical and connectionist models for sequence recognition, machine
learning, speech and image recognition, and information theory.
Vladimir N. Vapnik, for a photograph and biography, see this issue, p. 999.