Auto-WEKA: Combined Selection and Hyperparameter
Optimization of Classiﬁcation Algorithms
Chris Thornton Frank Hutter Holger H. Hoos Kevin Leyton-Brown
Department of Computer Science, University of British Columbia
201-2366 Main Mall, Vancouver BC, V6T 1Z4, Canada
{cwthornt, hutter, hoos, kevinlb}@cs.ubc.ca
ABSTRACT
Many diﬀerent machine learning algorithms exist; taking
into account each algorithm’s hyperparameters, there is a
staggeringly large number of possible alternatives overall. We
consider the problem of simultaneously selecting a learning
algorithm and setting its hyperparameters, going beyond
previous work that addresses these issues in isolation. We
show that this problem can be addressed by a fully automated
approach, leveraging recent innovations in Bayesian
optimization. Speciﬁcally, we consider a wide range of feature
selection techniques (combining 3 search and 8 evaluator
methods) and all classiﬁcation approaches implemented in
WEKA, spanning 2 ensemble methods, 10 meta-methods,
27 base classiﬁers, and hyperparameter settings for each
classiﬁer. On each of 21 popular datasets from the UCI
repository, the KDD Cup 09, variants of the MNIST dataset
and CIFAR-10, we show classiﬁcation performance often
much better than using standard selection/hyperparameter
optimization methods. We hope that our approach will help
non-expert users to more eﬀectively identify machine learning
algorithms and hyperparameter settings appropriate to their
applications, and hence to achieve improved performance.
Keywords
model selection, hyperparameter optimization, WEKA
1. INTRODUCTION
Increasingly, users of machine learning tools are nonexperts
who require oﬀ-the-shelf solutions. The machine
learning community has much aided such users by making
available a wide variety of sophisticated learning algorithms
and feature selection methods through open source packages,
such as WEKA [13] and PyBrain [22]. Each of these packages
asks a user to make two kinds of choices: selecting a learning
algorithm and customizing it by setting its hyperparameters
(which also control any feature selection being performed).
It can be challenging to make the right choice when faced
with these degrees of freedom, leaving many users to select
algorithms based on reputation or intuitive appeal, and/or
to leave hyperparameters set to default values. Of course,
this approach can yield performance far worse than that of
the best method and hyperparameter settings.
This suggests a natural challenge for machine learning:
given a dataset, to automatically and simultaneously choose
a learning algorithm and set its hyperparameters to optimize
empirical performance. We dub this the combined algorithm
selection and hyperparameter optimization problem (short:
CASH); we formally deﬁne it in Section 3. Despite the
practical importance of this problem, we are surprised to
ﬁnd no evidence that it has previously been considered in
the literature. A likely explanation is that the combined
space of learning algorithms and their hyperparameters is
very challenging to search: the response function is noisy and
the space is high dimensional, involves both categorical and
continuous choices, and contains hierarchical dependencies
(e.g., the hyperparameters of a learning algorithm are only
meaningful if that algorithm is chosen; the algorithm choices
in an ensemble method are only meaningful if that ensemble
method is chosen; etc). In contrast, we do note that there
has been considerable past work separately addressing model
selection [e.g., 1, 5, 6, 7, 9, 20, 21, 29] and hyperparameter
optimization [e.g., 2, 3, 4, 12, 25, 27].
In what follows, we demonstrate that CASH can be viewed
as a single hierarchical hyperparameter optimization problem,
in which even the choice of algorithm itself is considered
a hyperparameter. We also show that — based on this problem
formulation — recent Bayesian optimization methods
can obtain high quality results in reasonable time and with
minimal human eﬀort. After discussing some preliminaries
(Section 2), we deﬁne the CASH problem and discuss methods
for tackling it (Section 3). We then deﬁne a concrete
CASH problem encompassing the full range of classiﬁers
and feature selectors in the open source package WEKA
(Section 4), and show that a search in the combined space
of algorithms and hyperparameters yields better-performing
models than standard algorithm selection/hyperparameter
optimization methods (Section 5). More speciﬁcally, we show
that the recent Bayesian optimization procedures TPE [3]
and SMAC [14] ﬁnd combinations of algorithms and hyperparameters
that often outperform existing baseline methods,
especially on large datasets.
2. PRELIMINARIES
This work focuses on classiﬁcation problems: learning a
function f : X → Y with ﬁnite Y. A learning algorithm A
maps a set {d1, . . . , dn} of training data points di = (xi, yi) ∈
X × Y to such a function, which is often expressed via a
vector of model parameters. Most learning algorithms A
further expose hyperparameters λ ∈ Λ, which change the
way the learning algorithm Aλ itself works. For example,
hyperparameters are used to describe a description-length
penalty, the number of neurons in a hidden layer, the number
of data points that a leaf in a decision tree must contain
to be eligible for splitting, etc. These hyperparameters are
typically optimized in an “outer loop” that evaluates the
arXiv:1208.3719v2[cs.LG]6Mar2013
performance of each hyperparameter conﬁguration using
cross-validation.
2.1 Model Selection
Given a set of learning algorithms A and a limited amount
of training data D = {(x1, y1), . . . , (xn, yn)}, the goal of
model selection is to determine the algorithm A∗
∈ A with
optimal generalization performance. Generalization performance
is estimated by splitting D into disjoint training and
validation sets D
(i)
train and D
(i)
valid, learning functions fi by
applying A∗
to D
(i)
train, and evaluating the predictive performance
of these functions on D
(i)
valid. This allows for the model
selection problem to be written as:
A∗
∈ argmin
A∈A
1
k
k
i=1
L(A, D
(i)
train, D
(i)
valid),
where L(A, D
(i)
train, D
(i)
valid) is the loss (here: misclassiﬁcation
rate) achieved by A when trained on D
(i)
train and evaluated on
D
(i)
valid. We use k-fold cross-validation [18], which splits the
training data into k equal-sized partitions D
(1)
valid, . . . , D
(k)
valid,
and sets D
(i)
train = D \ D
(i)
valid for i = 1, . . . , k.1
2.2 Hyperparameter Optimization
The problem of optimizing the hyperparameters λ ∈ Λ
of a given learning algorithm A is conceptually similar to
that of model selection. Some key diﬀerences are that hyperparameters
are often continuous, that hyperparameter
spaces are often high dimensional, and that we can exploit
correlation structure between diﬀerent hyperparameter settings
λ1, λ2 ∈ Λ. Given n hyperparameters λ1, . . . , λn with
domains Λ1, . . . , Λn, the hyperparameter space Λ is a subset
of the crossproduct of these domains: Λ ⊂ Λ1 × · · · × Λn.
This subset is often strict, such as when certain settings of
one hyperparameter render other hyperparameters inactive.
For example, the parameters determining the speciﬁcs of the
third layer of a deep belief network are not relevant if the
network depth is set to one or two. Likewise, the parameters
of a support vector machine’s polynomial kernel are not
relevant if we use a diﬀerent kernel instead.
More formally, following [15], we say that a hyperparameter
λi is conditional on another hyperparameter λj, if λi is only
active if hyperparameter λj takes values from a given set
Vi(j) Λj; in this case we call λj a parent of λi. Conditional
hyperparameters can in turn be parents of other conditional
hyperparameters, giving rise to a tree-structured space [3] or,
in some cases, a directed acyclic graph (DAG) [15]. Given
such a structured space Λ, the (hierarchical) hyperparameter
optimization problem can be written as:
λ∗
∈ argmin
λ∈Λ
1
k
k
i=1
L(Aλ, D
(i)
train, D
(i)
valid).
3. COMBINED ALGORITHM SELECTION
AND HYPERPARAMETER OPTIMIZATION
(CASH)
Given a set of algorithms A = {A(1)
, . . . , A(k)
} with associated
hyperparameter spaces Λ(1)
, . . . , Λ(k)
, we deﬁne the
1
K-fold cross-validation is not the only available method
for estimating generalization performance. We also experimented
with the technique of repeated random subsampling
validation [18], with similar results.
Algorithm 1 SMBO
1: initialise model ML; H ← ∅
2: while time budget for optimization has not been exhausted
do
3: λ ← candidate conﬁguration from ML
4: Compute c = L(Aλ, D
(i)
train, D
(i)
valid)
5: H ← H ∪ {(λ, c)}
6: Update ML given H
7: end while
8: return λ from H with minimal c
combined algorithm selection and hyperparameter optimization
problem (CASH) as computing
A∗
λ∗ ∈ argmin
A(j)∈A,λ∈Λ(j)
1
k
k
i=1
L(A
(j)
λ , D
(i)
train, D
(i)
valid). (1)
We note that this problem can be reformulated as a single
combined hierarchical hyperparameter optimization problem
with parameter space Λ = Λ(1)
∪ · · · ∪ Λ(k)
∪ {λr}, where
λr is a new root-level hyperparameter that selects between
algorithms A(1)
, . . . , A(k)
. The root-level parameters of each
subspace Λ(i)
are made conditional on λr being instantiated
to Ai.
In principle, Problem 1 can be tackled in various ways.
A promising approach is Bayesian Optimization [8], and
in particular Sequential Model-Based Optimization [SMBO;
14], a versatile stochastic optimization framework that can
work explicitly with both categorical and continuous hyperparameters,
and that can exploit hierarchical structure
stemming from conditional parameters. SMBO (outlined
in Algorithm 1) ﬁrst builds a model ML that captures the
dependence of loss function L on hyperparameter settings λ
(line 1 in Algorithm 1). It then iterates the following steps:
use ML to determine a promising candidate conﬁguration
of hyperparameters λ to evaluate next (line 3); evaluate the
loss c of λ (line 4); and update the model ML with the new
data point (λ, c) thus obtained (lines 5–6).
In order to select its next hyperparameter conﬁguration λ
using model ML, SMBO uses a so-called acquisition function
aML : Λ → R, which uses the predictive distribution of
model ML at arbitrary hyperparameter conﬁgurations λ ∈ Λ
to quantify (in closed form) how useful knowledge about λ
would be. SMBO then simply maximizes this function over
Λ to select the most useful conﬁguration λ to evaluate next.
Several prominent acquisition functions exist [17, 23, 26] that
all aim to automatically trade oﬀ exploitation (locally optimizing
hyperparameters in regions known to perform well)
versus exploration (trying hyperparameters in a relatively
unexplored region of the space) in order to avoid premature
convergence. In this work, we use one of the most prominent
acquisition functions, the positive expected improvement (EI)
attainable over an existing given error rate cmin [23]. Let
c(λ) denote the error rate of hyperparameter conﬁguration
λ. Then, the positive improvement function over cmin is
deﬁned as:
Icmin (λ) := max{cmin − c(λ), 0}.
Of course, we do not know c(λ). We can, however, compute
its expectation with respect to the current model ML:
EML [Icmin (λ)] :=
cmin
−∞
max{cmin − c, 0} · pML (c | λ) dc.
(2)
One main diﬀerence between existing SMBO algorithms lies
in the model class they employ. We now review the two
SMBO algorithms whose models can handle hierarchical hyperparameters
and that are thus suitable for the task of
combined algorithm selection and hyperparameter optimiza-
tion.
3.1 Sequential Model-based Algorithm Conﬁguration
(SMAC)
Sequential model-based algorithm conﬁguration [SMAC;
14] supports a variety of models p(c | λ) to capture the
dependence of the loss function c on hyper-parameters λ, including
approximate Gaussian processes and random forests.
In this paper we use random forest models, since they tend to
perform well with discrete and high-dimensional input data.
SMAC handles conditional parameters by instantiating inactive
conditional parameters in λ to default values for model
training and prediction. This allows the individual decision
trees to include splits of the kind “is hyperparameter λi
active?”, allowing them to focus on active hyperparameters.
While random forests are not usually treated as probabilistic
models, SMAC obtains a predictive mean µλ and variance
σλ
2
of p(c | λ) as frequentist estimates over the predictions
of its individual trees for λ; it then models pML (c | λ) as a
Gaussian N(µλ, σλ
2
).
SMAC uses the expected improvement criterion deﬁned
in Equation 2, instantiating cmin to the error rate of the
best hyperparameter conﬁguration measured so far. Under
SMAC’s predictive distribution pML (c | λ) = N(µλ, σλ
2
),
this expectation can be computed by the closed-form expres-
sion
EML [Icmin (λ)] = σλ · [u · Φ(u) + ϕ(u)],
where u = cmin−µλ
σλ
, and ϕ and Φ denote the probability
density function and cumulative distribution function of a
standard normal distribution, respectively [17].
SMAC is designed for robust optimization under noisy
function evaluations, and as such implements special mechanisms
to keep track of its best known conﬁguration and
assure high conﬁdence in its estimate of that conﬁguration’s
performance. This robustness against noisy function evaluations
can be exploited in combined algorithm selection and
hyperparameter optimization, since the function to be optimized
in Equation (1) is a mean over a set of loss terms (each
corresponding to one pair of D
(i)
train and D
(i)
valid constructed
from the training set). A key idea in SMAC is to make
progressively better estimates of this mean by evaluating
these terms one at a time, thus trading oﬀ accuracy against
computational cost. In order for a new conﬁguration to
become a new incumbent, it must outperform the previous
incumbent in every comparison made: considering only one
fold, two folds, and so on up to the total number of folds
previously used to evaluate the incumbent. (Furthermore,
every time the incumbent survives such a comparison, it is
evaluated on a new fold, up to the total number available,
meaning that the number of folds used to evaluate the incumbent
grows over time.) This means that a poorly performing
conﬁguration can be discarded after considering as little as
a single fold.
Finally, SMAC also implements a diversiﬁcation mechanism
to achieve robust performance even when its model
is misled: every second conﬁguration is selected at random.
Because of the evaluation procedure just described, the overhead
required by this safeguard is less than it might appear.
3.2 Tree-structured Parzen Estimator (TPE)
While SMAC models p(c | λ) explicitly, the Tree-structure
Parzen Estimator [TPE; 3] uses separate models for p(c) and
p(λ | c). Speciﬁcally, it models p(λ | c) as one of two density
estimates, conditional on whether c is greater or less than a
given threshold value c∗
:
p(λ | c) =
(λ), if c < c∗
.
g(λ), if c ≥ c∗
.
(3)
Here, c∗
is chosen as the γ-quantile of the losses TPE obtained
so far (where γ is an algorithm parameter with a
default value of γ = 0.15), (·) is a density estimate learned
from all previous hyperparameters λ with corresponding
loss smaller than c∗
, and g(·) is a density estimate learned
from all previous hyperparameters λ with corresponding
loss greater than or equal to c∗
. Intuitively, this creates a
probabilistic density estimator (·) for hyperparameters that
appear to do ‘well’, and a diﬀerent density estimator g(·)
for hyperparameters that appear ‘poor’ with respect to the
threshold. Bergstra et al. [3] showed that the expected improvement
EML [Icmin (λ)] from Equation 2 is proportional
to a quantity that can be computed in closed-form from γ,
g(λ), and (λ):
E[Icmin (λ)] ∝ γ +
g(λ)
(λ)
· (1 − γ)
−1
.
TPE maximizes this expression by generating many candidate
hyperparameter conﬁgurations at random and picking
λ with the smallest value of g(λ)/ (λ).
The density estimators (·) and g(·) have a hierarchical
structure with discrete, continuous, and conditional variables
reﬂecting the hyperparameters and their dependence relationships.
For each node in this tree structure, a 1-D Parzen
estimator is created to model the density of the node’s corresponding
hyperparameter. For a given hyperparameter
conﬁguration λ that is being added to either or g, only
the 1-D estimators corresponding to active hyperparameters
in λ are updated. For continuous hyperparameters, these
1-D estimators are constructed by placing density in the
form of a Gaussian at each hyperparameter value λi, with
standard deviation set to the larger of each point’s left and
right neighbour. Discrete hyperparameters are estimated
with probabilities proportional to the number of times that a
particular choice occurred in the set of observations. To evaluate
a candidate hyperparameter λ’s probability estimate,
TPE starts at the root of the tree and descends into the
leaves by following paths that only use active hyperparameters.
At each node in this traversal, the probability of the
corresponding hyperparameter is computed according to its
1-D estimator, and the individual probabilities are combined
on a pass back up to the root of the tree. Note that this
means that TPE assumes independence for hyperparameters
that do not appear together along any path from the tree’s
root to one of its leaves.
4. AUTO-WEKA
To demonstrate the feasibility of an automatic approach
to solving the CASH problem, we built a tool, Auto-WEKA,
that solves this problem for all classiﬁcation algorithms and
feature selectors/evaluators implemented in the WEKA package
[13]. Note that while we have focused on classiﬁcation
Table 1: Classiﬁers in Auto-WEKA: Classiﬁers
marked with ∗
are meta-methods, which in addition
to their own parameters take one ‘base’ classiﬁer and
its parameters. Classiﬁers marked with +
are ensemble
methods that take as input up to 5 ‘base’ classiﬁers
and their parameters. Categorical and Numeric
refer to the number of hyperparameters of each kind
for each classiﬁer.
Classiﬁer Categorical Numeric
Bayes Net 2 0
Naive Bayes 2 0
Naive Bayes Multinomial 0 0
Gaussian Process 3 6
Linear Regression 2 1
Logistic Regression 0 1
Single-Layer Perceptron 5 2
Stochastic Gradient Descent 3 2
SVM 4 6
Simple Linear Regression 0 0
Simple Logistic Regression 2 1
Voted Perceptron 1 2
KNN 4 1
K-Star 2 1
Decision Table 4 0
RIPPER 3 1
M5 Rules 3 1
1-R 0 1
PART 2 2
0-R 0 0
Decision Stump 0 0
C4.5 Decision Tree 6 2
Logistic Model Tree 5 2
M5 Tree 3 1
Random Forest 2 3
Random Tree 4 4
REP Tree 2 3
Locally Weighted Learning∗
3 0
AdaBoost M1∗
2 2
Additive Regression∗
1 2
Attribute Selected∗
2 0
Bagging∗
1 2
Classification via Regression∗
0 0
LogitBoost∗
4 4
MultiClass Classifier∗
3 0
Random Committee∗
0 1
Random Subspace∗
0 2
Voting+
1 0
Stacking+
0 0
algorithms in WEKA, there is no obstacle to extending our
approach to other settings.
Table 1 provides a list of all 39 WEKA classiﬁcation algorithms.
Of these models, 27 are considered ‘base’ classiﬁers
(which can be used independently), 10 of the remaining classiﬁers
are meta methods (which take a single base classiﬁer
and its parameters as an input), and the ﬁnal 2 ensemble
classiﬁers can take any number of base classiﬁers as input.
We allowed the meta-methods to use any base classiﬁer with
any hyperparameter settings, and allowed ensemble methods
to use up to ﬁve base classiﬁers, again with any hyperparameter
settings. Not all classiﬁers are applicable on all
datasets (e.g., due to a classiﬁer’s inability to handle missing
data). For a given dataset, our Auto-WEKA implementation
automatically only considers the subset of applicable
classiﬁers.
Table 2 provides a list of WEKA’s 3 feature search methods,
as well its 8 feature evaluators, and their respective number
of subparameters (up to 5 for search; up to 4 for evaluators).
To perform feature selection, a search method is combined
Table 2: Feature Search/Evaluator methods in AutoWEKA:
Methods marked with ∗
are search methods,
which require one feature evaluator that is used to
determine the importance of a feature. Categorical
and Numeric refer to the number of hyperparameters
of each method.
Feature Method Categorical Numeric
Best First∗
1 1
Greedy Stepwise∗
3 2
Ranker∗
0 1
CFS Subset Eval 2 0
Pearson Correlation Eval 0 0
Gain Ratio Eval 0 0
Info Gain Eval 2 0
1-R Eval 1 2
Principal Components Eval 2 2
RELIEF Eval 1 2
Symmetrical Uncertainty Eval 1 0
with a feature evaluator, and the subparameters of both of
them need to be instantiated. Feature selection is run as a
pre-processing phase before building any classiﬁer.
The algorithms in Table 1 and 2 have a wide variety
of hyperparameters, which take on values from continuous
intervals, from ranges of integers, and from other discrete
sets. We associated either a uniform or log uniform prior with
each numerical parameter, depending on its semantics. For
example, we set a log uniform prior for the ridge regression
penalty, and a uniform prior for the maximum depth for a
tree in a random forest. Auto-WEKA works with continuous
hyperparameter values directly; nevertheless, to give a sense
of the size of the hypothesis space we studied, we note that
discretizing hyperparameter domains to a maximum of 10
values each would give rise to over 1047
hyperparameter
settings. We emphasize that this space is much larger than a
simple union of the base learners’ hypothesis spaces (whose
size is roughly 108
), since the ensemble methods allow up to 5
independent base learners, giving rise to a space with roughly
(108
)5
= 1040
elements. The feature selection part gives
rise to another independent decision between roughly 106
choices, and several parameters on the meta and ensemble
level contribute another order of magnitude to the total size
of AutoWEKA’s hypothesis space.
Auto-WEKA can be understood as a single learning algorithm
with a highly conditional parameter space, as depicted
in Figure 1. Auto-WEKA has two top-level Boolean parameters,
the ﬁrst of which is is_base that selects among
single base classiﬁers and ensemble or meta-classiﬁers. If
is_base is true, then the parameter base determines which
of the 27 base classiﬁers are to be used. If is_base is
false, then class indicates either an ensemble or a metaclassiﬁer.
If class is a meta-classiﬁer, then the parameter
meta_base is chosen to be one of the 27 base classiﬁers. In
the event that class is an ensemble classiﬁer, an additional
parameter num_classes is an integer chosen from {1, . . . , 5}.
base_i variables are then selected according to the value
of num_classes, which again select which of the 27 base
classiﬁers to use. For each ∗base∗ parameter, conditional
hyperparameters for every model are attached.
The second top level Boolean parameter feat_sel indicates
if one of the feature selection methods is going to be
applied. If feat_sel is false, then Auto-WEKA passes
the unmodiﬁed dataset to the classiﬁer. If it is true, then
. . .
is_base
base class
iterations
percentage
use_resampling
iterations
percentage
out_of_bag_err
AdaBoostM1 Bagging
meta_base
num_classes
. . .
Voting Stacking
combination_rule (none)
base_1 base_2 base_5
true
false
≥1 ≥2 ≥5
feat_sel
feat_ser feat_eval
falsetrue
. . .
direction
non-improving nodes
lookup cache
fwd./bkwd.
conservative
threshold
Best FirstGreedy Stepwise
. . .
num neighbours
weight by distance
...
missing as separate
include locally predictive
RELIEFCFS Subset
. . .
true
(none)
Figure 1: Auto-WEKA’s parameter space. Top:
ﬁrst top level Boolean, concerning Auto-WEKA’s
classiﬁcation methods. The triangular items represent
a parameter that selects one of the 27 base classiﬁers,
and adds conditional classiﬁer hyperparameters
accordingly. Bottom: second top level Boolean,
concerning Auto-WEKA’s feature selection meth-
ods.
feat_search selects the choice of feature search method,
and feat_eval selects the choice of feature evaluator. This
results in a very wide tree that captures all the hierarchical
nature of the model hyperparameters, and allows the creation
of a single hyperparameter optimization problem with four
hierarchical layers of a total of 786 parameters.
Auto-WEKA is agnostic to the choice of optimizer, so
we implemented variants leveraging SMAC and TPE, respectively
– the two Bayesian optimization algorithms that
can handle hierarchical parameter spaces (see Sections 3.1
and 3.2.2
We deﬁned two Auto-WEKA variants, based
on SMAC and TPE, respectively. We made both of these
Auto-WEKA versions available to the public at www.cs.ubc.
ca/labs/beta/Projects/autoweka/ and are committed to
provide support for their widespread use in practice.
Both TPE and SMAC have their own parameters that
inﬂuence their performance (such as TPE’s choice of the
γ-quantile indicating ‘good’ or ‘bad’ performance, or the
parameters of SMAC’s random forest model). In Auto-Weka,
we used the defaults for these meta-hyperparameters, as
set by the authors (we assume further small improvements
may be possible by meta-hyperparameter optimization, but
a separate process with a meta-training/validation set split
would be required to guard against over-ﬁtting, and we did
not attempt this).
Finally, we note that both TPE and SMAC are randomized
algorithms and thus expected to produce diﬀerent results
based on the random seed provided. As demonstrated in [16],
this allows for trivial, yet eﬀective parallelization of the optimization
process: simply perform k independent runs of
the optimization method in parallel and select the result of
the run with the lowest cross-validation error to return.3
We
2
We thank the authors of SMAC and TPE for giving us
access to their respective implementations.
3
Other, more sophisticated methods for the parallelization of
Bayesian optimization exist [16, 3, 10, 25], but to date, there
Table 3: Datasets Used; Num. Discr.. and Num.
Cont. refer to the number of discrete and continuous
attributes of elements in the dataset, respectively.
Name
Num Num Num Num Num
Discr. Cont. Classes Training Test
Dexter 20 000 0 2 420 180
GermanCredit 13 7 2 700 300
Dorothea 100 000 0 2 805 345
Yeast 0 8 10 1 038 446
Amazon 10 000 0 49 1 050 450
Secom 0 591 2 1 096 471
Semeion 256 0 10 1 115 478
Car 6 0 4 1 209 519
Madelon 500 0 2 1 820 780
KR-vs-KP 37 0 2 2 237 959
Abalone 1 7 28 2 923 1 254
Wine Quality 0 11 11 3 425 1 469
Waveform 0 40 3 3 500 1 500
Gisette 5 000 0 2 4 900 2 100
Convex 0 784 2 8 000 50 000
CIFAR-10-Small 3 072 0 10 10 000 10 000
MNIST Basic 0 784 10 12 000 50 000
Rot. MNIST + BI 0 784 10 12 000 50 000
Shuttle 9 0 7 43 500 14 500
KDD09-Appentency 190 40 2 35 000 15 000
CIFAR-10 3 072 0 10 50 000 10 000
veriﬁed experimentally that the more parallel runs, the faster
this process identiﬁes high-quality conﬁgurations; nevertheless,
we restricted the version of Auto-WEKA studied here
to run only 4 parallel processes, in order to study a setting
typical for commonly used workstations.
5. EVALUATING AUTO-WEKA
We now describe an experimental study of the performance
that can be achieved by Auto-WEKA on various datasets.
After specifying our experiment environment, we demonstrate
the importance of addressing the algorithm selection
and the CASH problems, and establish baselines for them
(Section 5.2). We then demonstrate Auto-WEKA’s ability to
search its enormous hyperparameter space eﬀectively to ﬁnd
algorithms and hyperparameters with low cross-validation
error (Section 5.3). Then, we analyse its test performance
and address concerns regarding overﬁtting (Section 5.4). Finally,
we provide a synopsis of the classiﬁers and feature
search/evaluators Auto-WEKA chose in our experiments
(Section 5.5).
5.1 Experimental setup
We evaluated Auto-WEKA on 21 prominent benchmark
datasets (see Table 3): 15 sets from the UCI repository [11];
the ‘convex’, ‘MNIST basic’ and ‘rotated MNIST with background
images’ tasks used in [4]; the appentency task from
the KDD Cup ’09; and two versions of the CIFAR-10 image
classiﬁcation task [19] (CIFAR-10-Small is a subset of
CIFAR-10, where only the ﬁrst 10 000 training data points
are used rather than the full 50 000.) For datasets with a
predeﬁned training/test split, we used that split. Otherwise,
we randomly split the dataset into 70% training and 30%
test data. The test data was never seen by any optimization
method; it was only used once in an oﬄine analysis stage
to evaluate the models found by the various optimization
is no empirical evidence that these methods outperform the
simple approach we use here when the cost of evaluating
hyperparameter conﬁgurations varies across the space.
methods. We denote datasets with at least 10 000 training
data points as ‘large’ and all others as ‘small’.
All of our experiments were run on Linux machines with
Intel Xeon X5650 six-core processors, running at 2.66GHz.
All datasets had a RAM limit of 3GB for classiﬁcation; if
training a classiﬁer ever exceeded this memory limit, the
classiﬁer job was terminated, returning a misclassiﬁcation
rate of 100%. An additional 1GB of RAM was allocated
for the SMBO method. While these limits are somewhat
arbitrary, we believe them to be reasonably close to the
resource limitations faced by any user of machine learning
algorithms. We also limited the training time for each evaluation
of a learning algorithm on each fold, to ensure that
the optimization method had a chance to explore the search
space. Once this training budget for a fold is consumed,
Auto-WEKA sends an interrupt to the learning algorithm
to terminate as soon as possible, and the (partially) trained
model is then evaluated on the validation set to determine
the error estimate of the fold. This timeout was set to 150
minutes for classiﬁcation and 15 minutes for feature search
and evaluation in our experiments.4
For each dataset, we
ran Auto-WEKA with each hyperparameter optimization
algorithm with a total time budget of 30 hours. For each
method, we performed 25 runs of this process with diﬀerent
random seeds and then – in order to simulate parallelization
on a typical workstation – used bootstrap sampling to repeatedly
select 4 random runs and report the performance
of the one with best cross-validation performance.
In early experiments, we observed a few cases in which
Auto-WEKA’s SMBO method picked hyperparameters that
had excellent training performance, but turned out to generalize
poorly. To enable Auto-WEKA to detect cases of
overﬁtting, we partitioned its training set into two subsets:
70% for use inside the SMBO method, and 30% of validation
data that is only looked at after the SMBO method has
ﬁnished.
5.2 Algorithm Selection and CASH: Baseline
Methods
Auto-WEKA aims to aid non-expert users of machine
learning techniques. The simplest approach for selecting a
classiﬁer that is widely adopted amongst non-experts is to
use a classiﬁer merely based on its popularity or intuitive
appeal, without any empirical consideration of alternatives.
To quantify the diﬀerence such a choice can make, we consider
the 39 WEKA classiﬁers (each with default hyperparameter
settings) for each dataset, train each on the training set, and
measure its accuracy on the test set. For each dataset, the
second and third columns in Table 4 present the best and
worst “oracle performance” of these classiﬁers on the test set.
We observe that the gap between the best and worst classiﬁer
was huge, e.g. misclassiﬁcation rates of 4.93% vs 99.24%
on the Dorothea dataset. Even when the set of classiﬁers
was restricted to a few popular ones (we considered neural
networks, random forests, SVMs, AdaBoost, C4.5 decision
trees, logistic regression, and KNN), this gap still exceeded
20% on 14 out of the 21 datasets. Furthermore, there was
4
In preliminary experiments, only few models exceeded this
timeout for the datasets studied here. [24] presents a promising
approach for using runtime predictions in the expected
improvement calculation to automatically drive the search
away from excessively expensive models. We plan to incorporate
this approach into future versions of Auto-WEKA.
no single method that achieved good performance across
all datasets: every method was at least 22% worse than
the best for at least one data set. We conclude that some
form of algorithm selection is essential for achieving good
performance.
A straight-forward algorithm selection method is to perform
exhaustive 10-fold cross-validation on the training set
and to return the classiﬁer with the smallest average misclassiﬁcation
error across folds. We will refer to this method
applied to the set of 39 WEKA classiﬁers as Ex-Def ; it is the
best choice that can be made among the 39 WEKA classiﬁers
(with their default hyperparameters) based on an exhaustive
cross-validation and will serve as a baseline to compare
an optimal solution to the algorithm selection problem in
WEKA to our solution for the CASH problem in WEKA.
More experienced users of machine learning algorithms
would not only select between a ﬁxed set of default algorithms,
but would also consider diﬀerent hyperparameter
settings — for example by performing a grid search over
the hyperparameter space of a single classiﬁer (as, e.g., implemented
in WEKA5
). Since diﬀerent learning algorithms
perform well for diﬀerent problems, more experienced users
optimally would also want to consider diﬀerent hyperparameter
settings for more than one learning algorithm. Therefore,
a stronger baseline we will use is an approach that — in
addition to the 39 WEKA default classiﬁers — considers
various hyperparameter settings for all of WEKA’s 27 base
classiﬁers. More precisely, this baseline considers a grid of
hyperparameter settings for each of these 27 base classiﬁer,
and performs a random search [4] in the union of these grids
(plus the 39 WEKA default classiﬁers). We refer to this baseline
as Random Grid and note that — as an optimization
approach in the joint space of algorithms and hyperparameter
settings — it is a simple CASH algorithm. We executed this
Random Grid search for all our datasets in parallel, using
400 CPU hours on average per dataset (at least 120 hours for
each). Table 4 (columns 4 and 5) shows the best and worst
“oracle performance” on the test set across these Random
Grid classiﬁers. Comparing these performances to the default
performance, we note that in most cases even WEKA’s
best default algorithm could be improved by selecting better
hyperparameter settings, sometimes rather substantially so:
e.g., in the rotated MNIST with background images task,
random grid search oﬀered a 6% improvement over Ex-Def.
We conclude that choosing hyperparameter settings appropriately
can lead to substantial diﬀerences in performance and
that because of this fact even a relatively simple (albeit computationally
expensive) approach for CASH can outperform
algorithm selection by itself.
5.3 Results for Cross-Validation Performance
With 786 hierarchical hyperparameters, Auto-WEKA’s
combined algorithm / hyperparameter space is very complex.
We now study how eﬀectively SMAC and TPE could search
this space to optimize 10-fold cross-validation performance,
and compare their performance to that of the Ex-Def and
Random Grid methods deﬁned in the previous section.
The middle portion of Table 4 reports the results. First,
we note that random grid search over the hyperparameters of
all base-classiﬁers yielded better results than Ex-Def in 14/21
cases (and tied in the remaining seven), which underlines
5
This is implemented in WEKA’s CVParameterSelection
class, see weka.wikispaces.com/Optimizing+parameters
Table 4: Performance on both 10-fold cross-validation and test data. Random Grid Search was run once for
an average of 400 hours per dataset; for SMAC and TPE, we performed 25 runs of 30 hours each. We report
results as median percent error rate across 100 000 bootstrap samples simulating 4 parallel runs. Ex-Def is
deterministic. Test error rates are determined by training the selected model/hyperparameters on the entire
70% training data and computing the accuracy on the previously unused 30% test data. Boldface indicates
the lowest error within a block of comparable methods. SC denotes correlation coeﬃcients (see Section 5.4).
Dataset
Oracle Perf. (%) 10-Fold C.V. Performance (%) Test Performance (%) SC
Ex-Def Rand. Grid
Ex-Def Rand.
Grid
Auto-WEKA
Ex-Def Rand.
Grid
Auto-WEKA
TPE SMAC
Best Worst Best Worst TPE SMAC TPE SMAC
Dexter 7.78 52.78 5.00 58.33 10.20 7.48 9.90 5.48 8.89 5.00 9.44 7.22 0.82 0.25
GermanCredit 26.00 38.00 25.00 63.67 22.45 22.45 21.43 19.59 27.33 27.33 27.67 28.33 0.31 0.20
Dorothea 4.93 99.24 4.93 99.24 6.03 6.03 6.93 5.52 6.96 6.96 6.96 6.38 0.95 0.40
Yeast 40.00 68.99 37.53 68.99 39.43 38.87 35.03 36.27 40.45 40.90 41.12 40.45 0.36 0.49
Amazon 28.44 99.33 28.44 99.33 43.94 43.94 48.43 48.30 28.44 28.44 37.56 37.56 0.92 0.97
Secom 7.87 14.26 7.66 40.64 6.25 6.12 6.25 5.34 8.09 8.30 7.87 7.87 -0.10 -0.56
Semeion 8.18 92.45 6.08 92.45 6.52 6.52 6.91 4.86 8.18 8.18 8.18 5.03 0.84 0.73
Car 0.77 29.15 0.19 31.66 2.71 1.54 0.94 0.71 0.77 0.19 0.00 0.58 0.12 0.75
Madelon 17.05 50.26 17.05 51.03 25.98 24.26 24.26 20.87 21.38 20.77 20.77 21.15 0.44 0.43
KR-vs-KP 0.31 48.96 0.21 51.04 0.89 0.70 0.45 0.32 0.31 0.52 0.52 0.31 0.22 0.32
Abalone 73.18 84.04 72.55 89.23 73.33 72.45 72.20 71.76 73.18 72.79 72.71 73.02 0.15 0.10
Wine Quality 36.35 60.99 36.08 81.62 38.94 37.28 35.94 34.74 37.51 36.08 33.56 33.70 0.73 0.85
Waveform 14.27 68.80 14.20 68.80 12.73 12.73 12.57 11.71 14.40 14.40 14.20 14.40 0.36 0.26
Gisette 2.52 50.91 2.38 50.91 3.62 3.27 3.70 2.42 2.81 2.38 2.57 2.24 0.69 0.79
Convex 25.96 50.00 25.96 50.57 28.68 28.50 29.04 24.70 25.96 26.76 25.45 22.05 0.98 0.84
CIFAR-10-Small 65.91 90.00 64.54 90.00 66.59 65.11 57.97 57.76 65.91 64.54 56.65 55.93 0.93 0.80
MNIST Basic 5.19 88.75 3.79 88.75 5.12 4.00 13.64 3.64 5.19 3.79 18.03 3.56 1.00 0.87
Rot. MNIST + BI 63.14 88.88 57.28 90.96 66.15 59.75 73.04 59.61 63.14 58.16 69.86 55.84 0.50 0.95
Shuttle 0.0138 20.8414 0.0069 20.8414 0.0328 0.0263 0.0230 0.0230 0.0138 0.0276 0.0069 0.0069 0.60 0.73
KDD09-Appentency 1.74 6.97 1.64 54.08 1.88 1.88 1.88 1.75 1.75 1.77 1.74 1.74 0.89 1.00
CIFAR-10 64.27 90.00 64.27 90.00 65.54 65.54 66.68 63.21 64.27 64.27 64.80 62.39 0.33 0.69
the importance of not only choosing the right algorithm but
of also setting its hyperparameters well. However, we note
that this performance of random grid search is based on a
very large time budget of an average of 400 CPU hours per
dataset (650 CPU hours on average for each of the large
datasets), making it a somewhat unrealistic alternative in
practice. In contrast, Auto-WEKA was only run for 4 × 30
CPU hours per dataset, but still yielded substantially better
performance than random grid search, outperforming
it in 20/21 cases (and performing worse in one6
). Comparing
the two Auto-WEKA variants, SMAC outperformed
TPE in 19/21 cases, with one tie. We note that sometimes
Auto-WEKA’s performance improvements over the other
methods were substantial, with relative reductions of the
cross-validation error rate exceeding 15% in 12/21 cases.
We conclude that by searching Auto-WEKA’s combined
algorithm/hyperparameter space, we can eﬀectively ﬁnd models
with much better cross-validation performance than by
grid search over WEKA’s base classiﬁers.
5.4 Results for Test Performance
The results just shown demonstrate that Auto-WEKA is
eﬀective at optimizing its given objective function; however,
this is not suﬃcient to allow us to conclude that it ﬁts
models that generalize well. As the hypothesis space of
a machine learning algorithm grows, so does its potential
for overﬁtting. The use of cross-validation substantially
increases Auto-WEKA’s robustness against overﬁtting, but
6
For the Amazon data set, WEKA’s default implementation
of support vector machines yielded a very strong error rate of
44%, which was below that of SMAC’s median performance.
One of SMAC’s 25 runs actually reached an error rate of
36%, indicating that Auto-WEKA would be competitive
given more time.
since its hypothesis space is much larger than that of standard
classiﬁcation algorithms, it is important to carefully study
whether (and to what extent) overﬁtting poses a problem.
To evaluate generalization, we determined a combination
of algorithm and hyperparameter settings Aλ by running
Auto-WEKA as before (cross-validating on the training set),
trained Aλ on the entire training set, and then evaluated
the resulting model on the test set. The right portion of Table
4 reports the test performance obtained with all methods.
Broadly speaking, similar trends held as for cross-validation
performance: random grid search performed better than
Ex-Def, and Auto-WEKA in turn outperformed random
grid search. However, the performance diﬀerences were less
pronounced: random grid search only yielded better results
than Ex-Def in 9/21 cases, with 6/21 ties and 6/21 cases in
which Ex-Def performed better. Auto-WEKA continued to
outperform random grid search and Ex-Def in 15/21 cases,
with 3 ties and 3 losses. Notably, Auto-WEKA performed
best on all of the 11 largest datasets; we attribute this to the
fact that the risk of overﬁtting decreases with dataset size.
Sometimes, Auto-WEKA’s performance improvements over
the other methods were substantial, with relative reductions
of the test error rate exceeding 15% in 5/21 cases. Comparing
the diﬀerent Auto-WEKA variants, SMAC outperformed
TPE on 12 datasets and tied on 3, with TPE performing
better on 6. Even when compared to the unrealistic “oracle
best” Random Grid classiﬁer (which has access to the test
set!), Auto-WEKA found algorithms/hyperparameters with
a smaller error on 9/21 datasets, tied on 2, and was outperformed
on the remaining 10. For the 10 largest datasets, it
performed better in 8 cases, tied in 1, and only lost in 1.
As mentioned in our experimental setup, Auto-WEKA
only used 70% of its training set during the optimization of
cross-validation performance, reserving the remaining 30%
Selected Classiﬁers
0
2
4
6
8
10
12
14Freq.(%)
RandomForest
Single-LayerPerceptron
SVM
SimpleLogisticRegression
LogisticRegression
VotedPerceptron
KNN
RIPPER
LogisticModelTree
BayesNet
DecisionStump
NaiveBayesMultinomial
NaiveBayes
0-R
K-Star
C4.5DecisionTree
RandomTree
SGD
REPTree
PART
1-R
DecisionTable
AdaBoostM1*
RandomSubspace*
RandomCommittee*
MultiClassClassifier*
Bagging*
LocallyWeightedLearning*
AttributeSelected*
LogitBoost*
ClassificationviaRegression*
Voting+
Stacking+
Small
Large
Figure 2: Distribution of chosen classiﬁers across
the small and large datasets, aggregated across TPE,
and SMAC, ranked on their frequency of being selected.
Meta methods are marked by a ∗
suﬃx, ensemble
methods by a +
suﬃx.
for assessing the risk of overﬁtting. At any point in time
Auto-WEKA’s SMBO method keeps track of its incumbent
(the hyperparameter conﬁguration with the lowest crossvalidation
error rate seen so far). After its SMBO method
has ﬁnished, Auto-WEKA extracts a trajectory of these
incumbents from it and computes their generalization performance
on the withheld 30% validation data. It then computes
the Spearman rank coeﬃcient between the sequence of training
performances (evaluated by the SMBO method through
cross-validation) and this generalization performance. The
rightmost columns in Table 4 (labelled SC) show the average
correlation coeﬃcient for each run of Auto-WEKA. We note
a general trend: as the absolute gap between cross-validation
and test performance grows, this correlation coeﬃcient decreases.
The GermanCredit dataset is a good example where
Auto-WEKA can signal that it only has low conﬁdence in
how well its chosen hyperparameters will generalize. We do
note, however, that this weak signal has to be used with caution:
there is no guarantee that large correlation coeﬃcients
yield a small gap and vice versa.
5.5 Classiﬁers Selected by Auto-WEKA
Figure 2 shows the distribution of classiﬁers chosen by
our two Auto-WEKA variants (aggregated across runs and
datasets - both TPE and SMAC produced similar results
when considered individually). We note that no single classiﬁer
clearly dominated the others: the most frequently used
classiﬁers (random forests, the single layer perceptron, and
SVMs) were only selected in roughly 12% of all cases each,
and most classiﬁers were selected in at least a few percent
of the cases. Furthermore, the selected methods diﬀered
considerably between the large and small datasets, demonstrating
the need for dataset-speciﬁc methods; for example,
the large datasets beneﬁtted more from meta methods than
the small ones. A more detailed investigation of the top two
meta-methods in Figure 3 (left) shows which base methods
were chosen. Note that AdaBoostM1 frequently used the
single layer perceptron on the small datasets, but never for
the large ones, while the REP tree was highly popular for
the large datasets. In the random subspace, the two most
prominent methods were naive Bayes and the decision table.
It is interesting to note that these two methods, as well as the
REP tree frequently selected by AdaBoost, were not often
Selected Base Classiﬁers Feat. Search/Eval
0
10
20
30
40
50
60
Freq.(%)
Single-LayerPerceptron
RIPPER
REPTree
LogisticModelTree
RandomTree
C4.5DecisionTree
RandomForest
0-R
VotedPerceptron
DecisionTable
DecisionStump
KNN
PART
NaiveBayes
1-R
SVM
LogisticRegression
NaiveBayes
DecisionTable
LogisticModelTree
RIPPER
LogisticRegression
SimpleLogisticRegression
RandomForest
K-Star
PART
Single-LayerPerceptron
REPTree
VotedPerceptron
KNN
NaiveBayesMultinomial
AdaBoost M1* Random Subspace*
Small
Large
0
10
20
30
40
50
60
Freq.(%)
Ranker*
None
BestFirst*
GreedyStepwise*
InfoGainEval
GainRatioEval
RELIEFEval
SymmetricalUncertaintyEval
PrincipalComponentsEval
CFSSubsetEval
1-REval
Small
Large
Figure 3: Left: distribution of chosen base classiﬁers
for the two most frequently selected meta methods:
AdaBoostM1 and random subspace. Right: distribution
of chosen feature search and evaluator methods.
Both plots are aggregated across TPE and SMAC,
ranked on their frequency of being selected; None
indicates that no feature selection was performed.
selected as a base classiﬁer on their own. This underlines
the importance of searching Auto-WEKA’s entire parameter
space instead of, e.g., restricting one’s attention to a small
number of favourite base classiﬁers.
Figure 3 (right) provides a breakdown of the feature search
and evaluation methods Auto-WEKA selected. Overall,
it used these feature selection methods more often on the
smaller datasets than on the larger ones, and if it did use a
feature selection method it clearly favored the ranker method.
All feature evaluators were used with roughly the same frequency
for small datasets; in contrast, if Auto-WEKA performed
feature selection for a large dataset it clearly favored
the information gain evaluator. We note that Auto-WEKA’s
data-dependent choices (based on its internal cross-validation
evaluation) allow it to use feature selection as a regularization
method for small data sets, while at the same time using
all features to construct more complex hypotheses for large
datasets.
6. CONCLUSION AND FUTURE WORK
In this work, we have shown that the daunting problem
of combined algorithm selection and hyperparameter optimization
(short: CASH) can be solved by a practical, fully
automated tool. This is made possible by the use of recent
Bayesian optimization techniques that iteratively build models
of the algorithm/hyperparameter landscape and leverage
these models to identify new points in the space that deserve
investigation.
We built a tool, Auto-WEKA, that utilizes the full range
of classiﬁcation algorithms in WEKA and makes it easy
for non-experts to build high-quality classiﬁers for given
application scenarios. An extensive empirical comparison
on 21 prominent datasets showed that Auto-WEKA often
outperformed standard algorithm selection/hyperparameter
optimization methods, especially on large datasets. We
empirically compared two diﬀerent optimizers for searching
Auto-WEKA’s 786-dimensional parameter space and
in the end recommend an Auto-WEKA variant based on
the Bayesian optimization method SMAC [14]. We have
written a freely-downloadable software package to make
Auto-WEKA easy for end-users to access; it is available
at www.cs.ubc.ca/labs/beta/Projects/autoweka/.
We see several promising avenues for future work. First,
Auto-WEKA still shows larger improvements in cross-validation
performance than on test data, suggesting the investigation
of more sophisticated methods for detecting and avoiding
overﬁtting than our simple correlation-based approach. Second,
we see potential value in extending our current approach
to allow parameter sharing between classiﬁers used within
ensemble methods, likely increasing their chance of being
selected by Auto-WEKA. Finally, we could use our approach
as an inner loop for training ensembles of machine learning
algorithms by iteratively adding algorithms with maximal
marginal contribution (this idea is conceptually related to the
Hydra approach for constructing algorithm selectors [28]).
References
[1] M. Adankon and M. Cheriet. Model selection for the LSSVM.
application to handwriting recognition. Pattern
Recognition, 42(12):3264–3270, 2009.
[2] Y. Bengio. Gradient-based optimization of hyperparameters.
Neural Computation, 12(8):1889–1900, 2000.
[3] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms
for Hyper-Parameter Optimization. In Proc. of
NIPS-11, 2011.
[4] J. Bergstra and Y. Bengio. Random search for hyperparameter
optimization. JMLR, 13:281–305, 2012.
[5] A. Biem. A model selection criterion for classiﬁcation:
Application to HMM topology optimization. In
Proc. of ICDAR-03, pages 104–108. IEEE, 2003.
[6] H. Bozdogan. Model selection and Akaike’s information
criterion (AIC): The general theory and its analytical
extensions. Psychometrika, 52(3):345–370, 1987.
[7] P. Brazdil, C. Soares, and J. Da Costa. Ranking learning
algorithms: Using IBL and meta-learning on accuracy
and time results. Machine Learning, 50(3):251–277,
2003.
[8] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial
on Bayesian optimization of expensive cost functions,
with application to active user modeling and hierarchical
reinforcement learning. Technical Report UBC TR-2009-
23 and arXiv:1012.2599v1, Department of Computer
Science, University of British Columbia, 2009.
[9] O. Chapelle, V. Vapnik, and Y. Bengio. Model selection
for small sample regression. Machine Learning, 2001.
[10] T. Desautels, A. Krause, and J. Burdick. Parallelizing
exploration-exploitation tradeoﬀs with gaussian process
bandit optimization. In Proc. of ICML-12, 2012.
[11] A. Frank and A. Asuncion. UCI machine learning repository,
2010. URL: http://archive.ics.uci.edu/ml. University
of California, Irvine, School of Information and
Computer Sciences.
[12] X. Guo, J. Yang, C. Wu, C. Wang, and Y. Liang. A novel
LS-SVMs hyper-parameter selection based on particle
swarm optimization. Neurocomputing, 71(16):3211–3215,
2008.
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. Witten. The WEKA data mining software:
an update. ACM SIGKDD Explorations Newsletter,
11(1):10–18, 2009.
[14] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential
model-based optimization for general algorithm conﬁguration.
Proc. of LION-5, pages 507–523, 2011.
[15] F. Hutter, H. Hoos, K. Leyton-Brown, and T. Stützle.
ParamILS: an automatic algorithm conﬁguration
framework. JAIR, 36(1):267–306, 2009.
[16] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel
algorithm conﬁguration. In Proc. of LION-6, 2012.
[17] D. R. Jones, M. Schonlau, and W. J. Welch. Eﬃcient
global optimization of expensive black box functions.
Journal of Global Optimization, 13:455–492, 1998.
[18] R. Kohavi. A study of cross-validation and bootstrap
for accuracy estimation and model selection. In Proc. of
IJCAI-95, pages 1137–1145, 1995.
[19] A. Krizhevsky and G. Hinton. Learning multiple layers
of features from tiny images. Master’s thesis, Department
of Computer Science, University of Toronto, 2009.
[20] O. Maron and A. Moore. Hoeﬀding races: Accelerating
model selection search for classiﬁcation and function
approximation. In Proc. of NIPS-94, pages 59–66, 1994.
[21] A. McQuarrie and C. Tsai. Regression and time series
model selection. World Scientiﬁc, 1998.
[22] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder,
F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain.
JMLR, 2010.
[23] M. Schonlau, W. J. Welch, and D. R. Jones. Global versus
local search in constrained optimization of computer
models. In N. Flournoy, W. Rosenberger, and W. Wong,
editors, New Developments and Applications in Experimental
Design, volume 34, pages 11–25. Institute of
Mathematical Statistics, Hayward, California, 1998.
[24] J. Snoek, H. Larochelle, and R. Adams. Opportunity
cost in Bayesian optimization. In NIPS Workshop on
Bayesian Optimization, Sequential Experimental Design,
and Bandits, 2011. Published online.
[25] J. Snoek, H. Larochelle, and R. P. Adams. Practical
bayesian optimization of machine learning algorithms.
In Proc. of NIPS-12, 2012.
[26] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian
process optimization in the bandit setting: No regret
and experimental design. In Proc. of ICML-10,
pages 1015–1022, 2010.
[27] V. Strijov and G. Weber. Nonlinear regression model
generation using hyperparameter optimization. Computers
& Mathematics with Applications, 60(4):981–988,
2010.
[28] L. Xu, H. H. Hoos, and K. Leyton-Brown. Hydra:
Automatically conﬁguring algorithms for portfolio-based
selection. In Proc. of AAAI-10, pages 210–216, 2010.
[29] P. Zhao and B. Yu. On model selection consistency of
lasso. JMLR, 7:2541–2563, Dec. 2006.