Introduction to Hyperparameter
Optimization and AutoML
Ondřej Sotolář
xsotolar@ﬁ.muni.cz
Faculty of Informatics, Masaryk University
October 17, 2022
Introduction Content
Presentation Content
1. Introduction
2. Hyperparameter Optimization (HPO)
3. AutoML and Implementations
4. HPO and AutoML for NLP
Ondřej Sotolář ·AutoML ·October 17, 2022 2 / 35
Introduction Motivation
Introduction
Machine learning is the science of getting computers to act without
being explicitly programmed. (A. Ng)
Machine learning is concerned with predictive analytics, in contrast
with many other ﬁelds that utilize statistical models [1]. Models (e.g.
regression, classiﬁcation, etc.) are the basic ML tools.
Exercise: What is the model here?
Ondřej Sotolář ·AutoML ·October 17, 2022 3 / 35
Introduction Motivation
Motivation
Ondřej Sotolář ·AutoML ·October 17, 2022 4 / 35
Introduction Motivation
AutoML: HPO, CASH
Ondřej Sotolář ·AutoML ·October 17, 2022 5 / 35
Hyperparameter Optimization Deﬁnitions
Model Parameters vs. Hyperparameters I
It is critical to understand the diﬀerence between model parameters
and hyperparameters.
Model parameters are optimized during training, typically via loss
minimization. They are the output of the training.
Examples:
coeﬃcients θ of a linear model f (x) = θT x
the splits and terminal node of a decision tree-based model
Ondřej Sotolář ·AutoML ·October 17, 2022 6 / 35
Hyperparameter Optimization Deﬁnitions
Model Parameters vs. Hyperparameters II
In contrast, hyperparameters (HPs) are not decided during training.
They must be speciﬁed before the training, they are an input of the
training.
Hyperparameters can in principle inﬂuence any structural property of a
model or computational part of the training process.
Examples:
Tree: the maximum depth of a tree
k Nearest Neighbours: Number of neighbours k and distance
measure
Neural networks: number and type of layers, activation functions,
regularization, and many more.
Ondřej Sotolář ·AutoML ·October 17, 2022 7 / 35
Hyperparameter Optimization Deﬁnitions
Types of Hyperparameters
Real-valued parameters, e.g.:
Minimal error improvement in a tree to accept a split
Integer parameters, e.g.:
Neighbourhood size k for k-NN
Categorical parameters, e.g.:
Split criterion for classiﬁcation trees
Hierarchical dependencies
Cast as nested problem or Multi-criteria optimization
Deﬁnition
We summarize all hyperparameters we want to tune over in a vector
λ ∈ Λ,
where Λ is the space of all possible hyperparameter value
combinations. It has as many dimensions as there are hyperparameters.
Ondřej Sotolář ·AutoML ·October 17, 2022 8 / 35
Hyperparameter Optimization Methods
Manual Search Strategies I: Grid serach &
Random search
1.
1
Figure from [2]
Ondřej Sotolář ·AutoML ·October 17, 2022 9 / 35
Hyperparameter Optimization Methods
Manual Search Strategies II
Exercise: Given this 2D loss surface in, would you prefer Grid search or
random search?
Ondřej Sotolář ·AutoML ·October 17, 2022 10 / 35
Hyperparameter Optimization Methods
Bayesian Optimization: Intuition & True function
In our gold mining problem, drilling is expensive. We consider the real
distribution as the True Function (or Ground Truth).
Ondřej Sotolář ·AutoML ·October 17, 2022 11 / 35
Hyperparameter Optimization Methods
Bayesian Optimization: Active Learning,
Surrogate model, and Acquisition Functions
We need a surrogate model to approximate the True Function.2
Active learning minimizes sampling costs while maximizing
performance via uncertainty reduction. This method proposes
sampling at the point where the surrogate model uncertainty is the
highest. We use variance as the measure of uncertainty.
Furthermore, we need an acquisition function for selecting the
location in the conﬁguration space where the next sample will be
taken.
Examples of Acquisition functions:
Probability of Improvement (PI), Expected Improvement (EI),
Upper/Lower Conﬁdence Bound, Thompson Sampling, etc.
2
Figure in the next slide from [3]
Ondřej Sotolář ·AutoML ·October 17, 2022 12 / 35
Hyperparameter Optimization Methods
Bayesian Optimization with a Gaussian Process
Ondřej Sotolář ·AutoML ·October 17, 2022 13 / 35
Hyperparameter Optimization Methods
Bayesian Optimization with a Gaussian Process
Ondřej Sotolář ·AutoML ·October 17, 2022 14 / 35
Hyperparameter Optimization Methods
Bayesian Optimization with a Gaussian Process
Ondřej Sotolář ·AutoML ·October 17, 2022 15 / 35
Hyperparameter Optimization Methods
Bayesian Optimization with a Gaussian Process
Ondřej Sotolář ·AutoML ·October 17, 2022 16 / 35
Hyperparameter Optimization Methods
Bayesian Optimization with a Gaussian Process
Ondřej Sotolář ·AutoML ·October 17, 2022 17 / 35
Hyperparameter Optimization Methods
A (very minimal) Formal Deﬁnition of a Gaussian
Process
A function f (x) is generated by a Gaussian process G if for any ﬁnite
set of inputs {x1, ..., xn}, the associated vector of function values has a
Gaussian distribution:
f = (f (x1), ..., f (xn)) ∼ N(m, K),
with:
m = m(xi)i, K = k(xi, xj)i,j,
where m(x) is called mean function and k(., .) is called covariance
function.
Finally, we denote a GP by:
f (x) ∼ G(m(x), k(x, x ))
Ondřej Sotolář ·AutoML ·October 17, 2022 18 / 35
Hyperparameter Optimization Methods
Covariance Function: Intuition
The covariance controls the shape of drawn functions. Consider two
extreme cases where function values are:3
strongly correlated: K =






1 0.99 ... 0.99
0.99 1 ... 0.99
0.99 0.99
... 0.99
0.99 ... 0.99 1






uncorrelated: K = I
3
Figure from [4].
Ondřej Sotolář ·AutoML ·October 17, 2022 19 / 35
Hyperparameter Optimization Methods
Tree-Parzen Estimators I
Parzen density estimation is deﬁned as:
p(x) =
1
nh
n
i=1
K
xi − x
h
,
where n is number of elements in the vector, x is a vector, p(x) is a
probability density of x, h is the scale of xi, and K is the kernel
(product for TPE). The main idea to approximate f by a mixture of
kernels.4
4
Figure from [5]
Ondřej Sotolář ·AutoML ·October 17, 2022 20 / 35
Hyperparameter Optimization Methods
Tree-Parzen Estimators II
TPE uses two estimators, one for good l(λ) and one for bad
g(λ) HP conﬁgurations
It iterates, and in each iteration, the estimates are improved by
moving samples between l and g
l and g have thresholds (15% best/worst HP conﬁgurations)
EI(l(λ)/g(λ)) is used to sample the next conﬁguration.
Ondřej Sotolář ·AutoML ·October 17, 2022 21 / 35
Hyperparameter Optimization Methods
Speedup Techniques for HPO
Multi-ﬁdelity optimization: Successive Halving (SH)
1. Sample N conﬁgurations uniformly at random evaluate them on
the cheapest ﬁdelity
2. Keep the best half (or third), move them to the next ﬁdelity
3. GOTO 1. (until original ﬁdelity)5
5
Figure from [2].
Ondřej Sotolář ·AutoML ·October 17, 2022 22 / 35
Hyperparameter Optimization Methods
Speedup Techniques for HPO
Extension to SH: Hyperband
What is the problem with SH?
Some HP conﬁguration might start slow yet reach the highest
performance on higher ﬁdelities. The solution is to run multiple copies
of SH in parallel, starting at diﬀerent cheapest ﬁdelities. (Fig from [2])
Ondřej Sotolář ·AutoML ·October 17, 2022 23 / 35
Hyperparameter Optimization Methods
Speedup Techniques for HPO
BOHB
BOHB combines the advantages of Bayesian Optimization and
Hyperband
BOHB replaces the random selection of conﬁgurations at the
beginning of each HB with a model
The Model is variant of the Tree Parzen Estimator, with a
product kernel
Ondřej Sotolář ·AutoML ·October 17, 2022 24 / 35
Hyperparameter Optimization Methods
HPO Frameworks
Optuna, SMAC, Hyperopt, FastText optimize, end others.
import optuna
# Deﬁne a simple 2-dimensional objective function
# whose minimum value is -1 when (x, y) = (0, -1).
def objective(trial):
x = trial.suggest_ﬂoat("x", -100, 100)
y = trial.suggest_categorical("y", [-1, 0, 1])
return x**2 + y
if __name__ == "__main__":
# Let us minimize the objective function above.
study = optuna.create_study()
study.optimize(objective, n_trials=10)
print(study.best_params)
Ondřej Sotolář ·AutoML ·October 17, 2022 25 / 35
AutoML Methods
AutoML Frameworks
Auto-Sklearn process. Figure from [6].
Libraries and Frameworks for AutoML
Auto-Sklearn, AutoWeka, Auto-Keras
{Google, Azure, Amazon} AutoML
Ondřej Sotolář ·AutoML ·October 17, 2022 26 / 35
AutoML Methods
Auto-Sklearn
Classiﬁcation Case
Figure from [6].
Ondřej Sotolář ·AutoML ·October 17, 2022 27 / 35
AutoML Methods
Auto-Sklearn: Classiﬁcation Example
# example of auto-sklearn usage for a classiﬁcation problem
data = load_breast_cancer()
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=1)
# deﬁne search
model = AutoSklearnClassiﬁer(
time_left_for_this_task=5*60,
per_run_time_limit=30, n_jobs=8)
# perform the search and evaluate
model.ﬁt(X_train, y_train)
print(accuracy_score(y_test, model.predict(X_test)))
Ondřej Sotolář ·AutoML ·October 17, 2022 28 / 35
AutoML Methods
HPO and AutoML for NLP
FastText HPO (2017) [7]
Ondřej Sotolář ·AutoML ·October 17, 2022 29 / 35
AutoML Methods
HPO and AutoML for NLP
FastText HPO (2017) [7]
Zhu, Wang [8]: Multi-Stage HPO with SVM and Boosted
Regression Trees
Published in 2015, before Hyperband (2017)
First uses SH, then the standard BO. After running all ﬁdelities, it
outputs best conﬁg. from all HPs explored at all ﬁdelities.
TextNAS (2020) [9]
Ondřej Sotolář ·AutoML ·October 17, 2022 29 / 35
AutoML Methods
Closing Remarks
Interesting Resources
TWIML Podcast: AutoML Zero with Quoc Le [10]
Distill articles on HPO and GP [3, 11]
Hutter’s Book [2]
Documetation of Optuna, SMAC, Auto-Sklearn [12, 13, 14]
Ondřej Sotolář ·AutoML ·October 17, 2022 30 / 35
Bibliography
Bibliography I
[1] Galit Shmueli. “To explain or to predict?” In: Statistical science
25.3 (2010), pp. 289–310.
[2] Frank Hutter, Lars Kotthoﬀ, and Joaquin Vanschoren.
Automated machine learning: methods, systems, challenges.
Springer Nature, 2019.
[3] Apoorv Agnihotri and Nipun Batra. “Exploring Bayesian
Optimization”. In: Distill (2020).
https://distill.pub/2020/bayesian-optimization. doi:
10.23915/distill.00026.
[4] Frank Hutter, Lars Kotthoﬀ, and Joaquin Vanschoren.
Automated ML. 2022. url:
https://learn.ki-campus.org/courses/automl-luh2021.
Ondřej Sotolář ·AutoML ·October 17, 2022 31 / 35
Bibliography
Bibliography II
[5] Parzen Estimators. 2022. url:
https://stats.stackexchange.com/questions/244012/can-
you-explain-parzen-window-kernel-density-estimation-in-
laymans-terms.
[6] Matthias Feurer et al. “Eﬃcient and robust automated machine
learning”. In: Advances in neural information processing systems
28 (2015).
[7] Armand Joulin et al. “FastText.zip: Compressing text
classiﬁcation models”. In: arXiv preprint arXiv:1612.03651
(2016).
Ondřej Sotolář ·AutoML ·October 17, 2022 32 / 35
Bibliography
Bibliography III
[8] Lidan Wang et al. “Eﬃcient hyper-parameter optimization for
NLP applications”. In: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing. 2015,
pp. 2112–2117.
[9] Yujing Wang et al. “Textnas: A neural architecture search space
tailored for text representation”. In: Proceedings of the AAAI
Conference on Artiﬁcial Intelligence. Vol. 34. 05. 2020,
pp. 9242–9249.
[10] TWIML AI Podcast: AutoML Zero. 2022. url:
https://www.youtube.com/watch?v=XesVQIjQb18&ab_
channel=TheTWIMLAIPodcastwithSamCharrington.
Ondřej Sotolář ·AutoML ·October 17, 2022 33 / 35
Bibliography
Bibliography IV
[11] Jochen Görtler, Rebecca Kehlbeck, and Oliver Deussen. “A
Visual Exploration of Gaussian Processes”. In: Distill (2019).
https://distill.pub/2019/visual-exploration-gaussian-processes.
doi: 10.23915/distill.00017.
[12] Optuna. 2022. url: https://optuna.org/.
[13] SMAC. 2022. url: https://github.com/automl/SMAC3.
[14] AutoSklearn. 2022. url:
https://automl.github.io/auto-sklearn/master/.
Ondřej Sotolář ·AutoML ·October 17, 2022 34 / 35
Thank You for Your Attention!