Introduction to Hyperparameter Optimization and AutoML Ondřej Sotolář xsotolar@fi.muni.cz Faculty of Informatics, Masaryk University October 17, 2022 Introduction Content Presentation Content 1. Introduction 2. Hyperparameter Optimization (HPO) 3. AutoML and Implementations 4. HPO and AutoML for NLP Ondřej Sotolář ·AutoML ·October 17, 2022 2 / 35 Introduction Motivation Introduction Machine learning is the science of getting computers to act without being explicitly programmed. (A. Ng) Machine learning is concerned with predictive analytics, in contrast with many other fields that utilize statistical models [1]. Models (e.g. regression, classification, etc.) are the basic ML tools. Exercise: What is the model here? Ondřej Sotolář ·AutoML ·October 17, 2022 3 / 35 Introduction Motivation Motivation Ondřej Sotolář ·AutoML ·October 17, 2022 4 / 35 Introduction Motivation AutoML: HPO, CASH Ondřej Sotolář ·AutoML ·October 17, 2022 5 / 35 Hyperparameter Optimization Definitions Model Parameters vs. Hyperparameters I It is critical to understand the difference between model parameters and hyperparameters. Model parameters are optimized during training, typically via loss minimization. They are the output of the training. Examples: coefficients θ of a linear model f (x) = θT x the splits and terminal node of a decision tree-based model Ondřej Sotolář ·AutoML ·October 17, 2022 6 / 35 Hyperparameter Optimization Definitions Model Parameters vs. Hyperparameters II In contrast, hyperparameters (HPs) are not decided during training. They must be specified before the training, they are an input of the training. Hyperparameters can in principle influence any structural property of a model or computational part of the training process. Examples: Tree: the maximum depth of a tree k Nearest Neighbours: Number of neighbours k and distance measure Neural networks: number and type of layers, activation functions, regularization, and many more. Ondřej Sotolář ·AutoML ·October 17, 2022 7 / 35 Hyperparameter Optimization Definitions Types of Hyperparameters Real-valued parameters, e.g.: Minimal error improvement in a tree to accept a split Integer parameters, e.g.: Neighbourhood size k for k-NN Categorical parameters, e.g.: Split criterion for classification trees Hierarchical dependencies Cast as nested problem or Multi-criteria optimization Definition We summarize all hyperparameters we want to tune over in a vector λ ∈ Λ, where Λ is the space of all possible hyperparameter value combinations. It has as many dimensions as there are hyperparameters. Ondřej Sotolář ·AutoML ·October 17, 2022 8 / 35 Hyperparameter Optimization Methods Manual Search Strategies I: Grid serach & Random search 1. 1 Figure from [2] Ondřej Sotolář ·AutoML ·October 17, 2022 9 / 35 Hyperparameter Optimization Methods Manual Search Strategies II Exercise: Given this 2D loss surface in, would you prefer Grid search or random search? Ondřej Sotolář ·AutoML ·October 17, 2022 10 / 35 Hyperparameter Optimization Methods Bayesian Optimization: Intuition & True function In our gold mining problem, drilling is expensive. We consider the real distribution as the True Function (or Ground Truth). Ondřej Sotolář ·AutoML ·October 17, 2022 11 / 35 Hyperparameter Optimization Methods Bayesian Optimization: Active Learning, Surrogate model, and Acquisition Functions We need a surrogate model to approximate the True Function.2 Active learning minimizes sampling costs while maximizing performance via uncertainty reduction. This method proposes sampling at the point where the surrogate model uncertainty is the highest. We use variance as the measure of uncertainty. Furthermore, we need an acquisition function for selecting the location in the configuration space where the next sample will be taken. Examples of Acquisition functions: Probability of Improvement (PI), Expected Improvement (EI), Upper/Lower Confidence Bound, Thompson Sampling, etc. 2 Figure in the next slide from [3] Ondřej Sotolář ·AutoML ·October 17, 2022 12 / 35 Hyperparameter Optimization Methods Bayesian Optimization with a Gaussian Process Ondřej Sotolář ·AutoML ·October 17, 2022 13 / 35 Hyperparameter Optimization Methods Bayesian Optimization with a Gaussian Process Ondřej Sotolář ·AutoML ·October 17, 2022 14 / 35 Hyperparameter Optimization Methods Bayesian Optimization with a Gaussian Process Ondřej Sotolář ·AutoML ·October 17, 2022 15 / 35 Hyperparameter Optimization Methods Bayesian Optimization with a Gaussian Process Ondřej Sotolář ·AutoML ·October 17, 2022 16 / 35 Hyperparameter Optimization Methods Bayesian Optimization with a Gaussian Process Ondřej Sotolář ·AutoML ·October 17, 2022 17 / 35 Hyperparameter Optimization Methods A (very minimal) Formal Definition of a Gaussian Process A function f (x) is generated by a Gaussian process G if for any finite set of inputs {x1, ..., xn}, the associated vector of function values has a Gaussian distribution: f = (f (x1), ..., f (xn)) ∼ N(m, K), with: m = m(xi)i, K = k(xi, xj)i,j, where m(x) is called mean function and k(., .) is called covariance function. Finally, we denote a GP by: f (x) ∼ G(m(x), k(x, x )) Ondřej Sotolář ·AutoML ·October 17, 2022 18 / 35 Hyperparameter Optimization Methods Covariance Function: Intuition The covariance controls the shape of drawn functions. Consider two extreme cases where function values are:3 strongly correlated: K =       1 0.99 ... 0.99 0.99 1 ... 0.99 0.99 0.99 ... 0.99 0.99 ... 0.99 1       uncorrelated: K = I 3 Figure from [4]. Ondřej Sotolář ·AutoML ·October 17, 2022 19 / 35 Hyperparameter Optimization Methods Tree-Parzen Estimators I Parzen density estimation is defined as: p(x) = 1 nh n i=1 K xi − x h , where n is number of elements in the vector, x is a vector, p(x) is a probability density of x, h is the scale of xi, and K is the kernel (product for TPE). The main idea to approximate f by a mixture of kernels.4 4 Figure from [5] Ondřej Sotolář ·AutoML ·October 17, 2022 20 / 35 Hyperparameter Optimization Methods Tree-Parzen Estimators II TPE uses two estimators, one for good l(λ) and one for bad g(λ) HP configurations It iterates, and in each iteration, the estimates are improved by moving samples between l and g l and g have thresholds (15% best/worst HP configurations) EI(l(λ)/g(λ)) is used to sample the next configuration. Ondřej Sotolář ·AutoML ·October 17, 2022 21 / 35 Hyperparameter Optimization Methods Speedup Techniques for HPO Multi-fidelity optimization: Successive Halving (SH) 1. Sample N configurations uniformly at random evaluate them on the cheapest fidelity 2. Keep the best half (or third), move them to the next fidelity 3. GOTO 1. (until original fidelity)5 5 Figure from [2]. Ondřej Sotolář ·AutoML ·October 17, 2022 22 / 35 Hyperparameter Optimization Methods Speedup Techniques for HPO Extension to SH: Hyperband What is the problem with SH? Some HP configuration might start slow yet reach the highest performance on higher fidelities. The solution is to run multiple copies of SH in parallel, starting at different cheapest fidelities. (Fig from [2]) Ondřej Sotolář ·AutoML ·October 17, 2022 23 / 35 Hyperparameter Optimization Methods Speedup Techniques for HPO BOHB BOHB combines the advantages of Bayesian Optimization and Hyperband BOHB replaces the random selection of configurations at the beginning of each HB with a model The Model is variant of the Tree Parzen Estimator, with a product kernel Ondřej Sotolář ·AutoML ·October 17, 2022 24 / 35 Hyperparameter Optimization Methods HPO Frameworks Optuna, SMAC, Hyperopt, FastText optimize, end others. import optuna # Define a simple 2-dimensional objective function # whose minimum value is -1 when (x, y) = (0, -1). def objective(trial): x = trial.suggest_float("x", -100, 100) y = trial.suggest_categorical("y", [-1, 0, 1]) return x**2 + y if __name__ == "__main__": # Let us minimize the objective function above. study = optuna.create_study() study.optimize(objective, n_trials=10) print(study.best_params) Ondřej Sotolář ·AutoML ·October 17, 2022 25 / 35 AutoML Methods AutoML Frameworks Auto-Sklearn process. Figure from [6]. Libraries and Frameworks for AutoML Auto-Sklearn, AutoWeka, Auto-Keras {Google, Azure, Amazon} AutoML Ondřej Sotolář ·AutoML ·October 17, 2022 26 / 35 AutoML Methods Auto-Sklearn Classification Case Figure from [6]. Ondřej Sotolář ·AutoML ·October 17, 2022 27 / 35 AutoML Methods Auto-Sklearn: Classification Example # example of auto-sklearn usage for a classification problem data = load_breast_cancer() X, y = data[:, :-1], data[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define search model = AutoSklearnClassifier( time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8) # perform the search and evaluate model.fit(X_train, y_train) print(accuracy_score(y_test, model.predict(X_test))) Ondřej Sotolář ·AutoML ·October 17, 2022 28 / 35 AutoML Methods HPO and AutoML for NLP FastText HPO (2017) [7] Ondřej Sotolář ·AutoML ·October 17, 2022 29 / 35 AutoML Methods HPO and AutoML for NLP FastText HPO (2017) [7] Zhu, Wang [8]: Multi-Stage HPO with SVM and Boosted Regression Trees Published in 2015, before Hyperband (2017) First uses SH, then the standard BO. After running all fidelities, it outputs best config. from all HPs explored at all fidelities. TextNAS (2020) [9] Ondřej Sotolář ·AutoML ·October 17, 2022 29 / 35 AutoML Methods Closing Remarks Interesting Resources TWIML Podcast: AutoML Zero with Quoc Le [10] Distill articles on HPO and GP [3, 11] Hutter’s Book [2] Documetation of Optuna, SMAC, Auto-Sklearn [12, 13, 14] Ondřej Sotolář ·AutoML ·October 17, 2022 30 / 35 Bibliography Bibliography I [1] Galit Shmueli. “To explain or to predict?” In: Statistical science 25.3 (2010), pp. 289–310. [2] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019. [3] Apoorv Agnihotri and Nipun Batra. “Exploring Bayesian Optimization”. In: Distill (2020). https://distill.pub/2020/bayesian-optimization. doi: 10.23915/distill.00026. [4] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated ML. 2022. url: https://learn.ki-campus.org/courses/automl-luh2021. Ondřej Sotolář ·AutoML ·October 17, 2022 31 / 35 Bibliography Bibliography II [5] Parzen Estimators. 2022. url: https://stats.stackexchange.com/questions/244012/can- you-explain-parzen-window-kernel-density-estimation-in- laymans-terms. [6] Matthias Feurer et al. “Efficient and robust automated machine learning”. In: Advances in neural information processing systems 28 (2015). [7] Armand Joulin et al. “FastText.zip: Compressing text classification models”. In: arXiv preprint arXiv:1612.03651 (2016). Ondřej Sotolář ·AutoML ·October 17, 2022 32 / 35 Bibliography Bibliography III [8] Lidan Wang et al. “Efficient hyper-parameter optimization for NLP applications”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 2112–2117. [9] Yujing Wang et al. “Textnas: A neural architecture search space tailored for text representation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 05. 2020, pp. 9242–9249. [10] TWIML AI Podcast: AutoML Zero. 2022. url: https://www.youtube.com/watch?v=XesVQIjQb18&ab_ channel=TheTWIMLAIPodcastwithSamCharrington. Ondřej Sotolář ·AutoML ·October 17, 2022 33 / 35 Bibliography Bibliography IV [11] Jochen Görtler, Rebecca Kehlbeck, and Oliver Deussen. “A Visual Exploration of Gaussian Processes”. In: Distill (2019). https://distill.pub/2019/visual-exploration-gaussian-processes. doi: 10.23915/distill.00017. [12] Optuna. 2022. url: https://optuna.org/. [13] SMAC. 2022. url: https://github.com/automl/SMAC3. [14] AutoSklearn. 2022. url: https://automl.github.io/auto-sklearn/master/. Ondřej Sotolář ·AutoML ·October 17, 2022 34 / 35 Thank You for Your Attention!