# Tutorial 11 â€“ Hyper-parameters

In this tutorial we will take a look at hyper-parameters, how to tune hyper-parameters automatically, and Support Vector Machines.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set()  # make plots nicer

## Hyper-parameters
Hyper-parameters are part of almost any model. Unlike parameters, hyper-parameters are not optimized during model training (calling `fit` method) instead they are fixed during training. Hyper-parameters are typically set during model construction by passing them as constructor parameters to models. We have used many hyper-parameters in previous tutorials, here are few examples.

* `DecisionTree`: `max_depth`, `criterion`
* `SimpleImputer`: `strategy`
* `KNNImputer`: `n_neighbors`
* `KNeighborsClassifier`: `n_neighbors`
* `Kmeans`: `n_clusters`
* `DBSCAN`: `eps`, `min_samples`

It is evident from the list above, that hyper-parameters are everywhere even in preprocessing. Note that other sklearn classes also have hyper-parameters that are set to some sensible defaults even when you do not explicitly pass any parameters to constructor.

Hyper-parameters can have profound impact on model performance. The model might by very sensitive to some hyper-parameter (e.g., produce very different result base on the actual value) and insensitive to others. For example, changing `criterion` in `DecisionTree` will produce similar results but changing `max_depth` will result in very different tree. Also model can be more sensitive to changes in some range of value than in other. For example, changing `max_depth` form 2 to 3 is more radical than changing it from 20 to 25.

Unfortunately, there is no good way of knowing what hyper-parameters will be good in advance. It all depends on dataset, i.e. number of examples, number of features, feature distributions, separability, and other characteristics. You need to test different values and select the best performing one.

**NOTE:** Tuning hyper-parameters is one of the last steps when building a machine learning model. It does not make sense to play with hyper-parameters if you plan to, for example, extract additional features later because then you would need to tune the hyper-parameters again!

Let's load up a dataset with Pima indian diabates and play with hyper-parameters

In [None]:
diabetes = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/diabetes.csv")

diabetes.Glucose.replace(0, np.nan, inplace=True)
diabetes.BloodPressure.replace(0, np.nan, inplace=True)
diabetes.SkinThickness.replace(0, np.nan, inplace=True)
diabetes.Insulin.replace(0, np.nan, inplace=True)
diabetes.BMI.replace(0, np.nan, inplace=True)

diabetes.dropna(inplace=True)
diabetes.reset_index(drop=True, inplace=True)

diabetes.head()

In [None]:
from sklearn.model_selection import train_test_split

diabetes_X, diabetes_y = diabetes.drop(columns="Outcome"), diabetes.Outcome

diabetes_train_X, diabetes_test_X, diabetes_train_y, diabetes_test_y = train_test_split(
    diabetes_X, diabetes_y, test_size=0.2, random_state=42
)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score


dtc = DecisionTreeClassifier()
dtc.fit(diabetes_train_X, diabetes_train_y)
print(f1_score(dtc.predict(diabetes_train_X), diabetes_train_y, average=None))
print(f1_score(dtc.predict(diabetes_test_X), diabetes_test_y, average=None))

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

Try different combinations of hyper-parameters and observe the `DecisionTreeClassifier`'s performance. Look into [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) to find what hyper-parameters and values are available. To which hyper-parameters is the model sensitive and to which insensitive. **\*Write your finding as a comment bellow.\***

In [None]:
# TODO: your findings goes here...

This is quite tedious process and there are a lot of parameter combinations. This means two things:
1. It would be good to automatically go through all the combinations of hyper-parameters and select the best one.
2. It would be good if the best hyper-parameters were universally good and not best fit to data we have.

Luckily, both of them are covered by grid search with cross validation.

## Grid Search
Basic idea of grid search is very simple. It runs exhaustive search by training and evaluating the model with every possible combination of predefined hyper-parameter values. It is computationally expensive as number of combinations grows exponentially with number of hyper-parameters explored. For example, say we would like to tune 5 different hyper-parameters each with 4 possible values. The grid search will have to train and evaluate $4^5$ models. That's a lot of models.

To reduce over-fitting the hyper-parameters are $k$-fold cross-validated. This assures that the selected hyper-parameters perform well across different portions of data and hopefully generalize well. The total number of models needed to be trained and evaluated is then $k$ times more.

Grid search is viable option only on small datasets, with simple models, or for few hyper-parameters. Say the training of the model takes around one second. Then the grid search from example above with 5-fold CV will take at least $4^5 \cdot 5 \cdot 1$ seconds (~hour and a half).

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

Use [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with **\*10-fold\*** cross-validation to select the best hyper-parameter combination for decision tree classifier from Exercise 1. Use **\*F1\*** measure of positive class for optimization and finally evaluate the model with the best hyper-parameter combination using **\*F1\*** on test set. Explore following hyper-parameter values in the grid search.
* maximal depth: 3, 5, 10, 20, 50
* minimal number of samples in leaf: 1, 3, 5, 7, 10
* feature selection criteria: Gini index, Info gain
* Cost-complexity pruning threshold: 0.0, 0.001, 0.01, 0.1, 0.3

Look into [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) or use `get_params` method of DecisionTreeClassifier to get hyper-parameter names.

The expected best hyper-parameter combination is `{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 5}` (there can be small variation due to randomness in fold generation and decision tree it self, ccp_alpha of 0.001 and min_samples_leaf of 7 are also correct results) and the expected F1 score of positive class on test set is $0.655$

In [None]:
# TODO: your code goes here...

Decision tree is a fairly simple model and a tree depth is the single most important aspect of the model that can be changed through hyper-parameters. Let's take a look at a new model with plenty of hyper-parameters to choose.

## Support Vector Machines (SVM)
SVM is a linear model that separates the classes such that *margin* (distance from decision boundary to the closest class example) is maximal. Having wide margin is good for generalization as more uncommon observation further from the typical class examples are more likely to be still classified correctly. The margin is fully described by *support vectors* (examples from each class "closest" to the decision boundary) and that is where the model got its name.

SVM often use kernels to make nonlinear decision boundaries. It uses so called kernel trick where computing a kernel function $\kappa(\mathbf{\tilde{x}_k}, \mathbf{\tilde{x}_l})$ is equivalent computing dot product $\phi(\mathbf{\tilde{x}_k}) \cdot \phi(\mathbf{\tilde{x}_l})$ of instances $x_k$ and $x_l$ transformed into different feature space using $\phi$. The principle is similar to how we used `PolynomialFeatures` to engineer new feature (i.e. transform instances to different feature space) before applying linear regression. But the kernels do not require explicit computation of transformed instances.

Typical kernels are Gaussian (Radial Basis Function) and linear. The RBF kernel transforms dot product to value of gaussian based on the distance between examples and linear kernel keeps the dot product unchanged. There exist other kernels and you can even write your own. Choosing right kernel is a bit of black magic (RBF is often a good choice) and grid search comes in handy.

Apart from kernel it is also crucial to set appropriate value of `C` hyper-parameter that govern trade-off between bias and variance. Large `C` mean high variance, low bias, and possible over-fitting and small values of `C` mean low variance, high bias and possible under-fitting.

Let's examine how different kernels and values of C affect decision boundaries.

In [None]:
iris_df = sns.load_dataset("iris")
sns.scatterplot(data=iris_df, x="petal_width", y="sepal_width", hue="species")

We have a Iris dataset and we will classify three species of iris flower based on their sepal and petal width.

In [None]:
from sklearn.preprocessing import StandardScaler

X, y = (
    StandardScaler().fit_transform(iris_df[["petal_width", "sepal_width"]]),
    iris_df["species"].astype("category").cat.codes,
)

In [None]:
from matplotlib.colors import ListedColormap
from sklearn.svm import SVC


def make_meshgrid(x, y, h=0.01):
    x_min, x_max = x.min() - 0.5, x.max() + 0.5
    y_min, y_max = y.min() - 0.5, y.max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(clf, xx, yy, ax, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    levels = list(np.unique(Z) - 0.5) + [Z.max() + 0.5]
    ax = plt if ax is None else ax
    ax.contourf(xx, yy, Z, levels=levels, **params)


def plot_svm(kernel, C, X, y, ax=None):
    svm = SVC(kernel=kernel, C=C)
    svm.fit(X, y)
    is_support = np.isin(np.arange(len(X)), svm.support_)

    xx, yy = make_meshgrid(X[:, 0], X[:, 1])
    plot_contours(svm, xx, yy, colors=sns.color_palette(), ax=ax, alpha=0.5)
    with sns.plotting_context("talk"):
        sns.scatterplot(
            x=X[:, 0],
            y=X[:, 1],
            hue=y,
            style=is_support,
            markers={False: "o", True: "X"},
            palette="muted",
            legend=False,
            ax=ax,
        )
    if ax is not None:
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(f"kernel={kernel}, C={C}")

In [None]:
fig, sub = plt.subplots(4, 5, figsize=(15, 12))

for i, kernel in enumerate(("linear", "rbf", "poly", "sigmoid")):
    for j, C in enumerate((0.01, 0.1, 1, 10, 100)):
        plot_svm(kernel, C, X, y, ax=sub[i][j])

In the figure above each color represent different species (class). The circles and crosses are training instances and crosses are support vectors. The area with the same color is area where point are predicted to be in corresponding class as the area's color.

You can see each kernel's decision boundary is a bit different. Linear kernel has strait lines, RBF have nice rounded "bubbles" around each species, polynomial kernel is a bit more pointy and sigmoid kernel is a bit wild for higher `C` values. The bias variance trade-off is very apparent with high bias for 0.01 and very similarly partitioned space and high variance for 100. Notice how a single observation in the bottom left corner is able to shift the decision boundary away from it self for higher `C`.

Let's apply SVM to some real life dataset.

In [None]:
shoppers = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/online_shoppers_intention.csv"
)
shoppers.head()

Here we have a data about online shopping. Each row describes a session of a single user (shopper) and metadata about the user. Features range from what other products has the user visited to date to what browser the user was using. The last column is boolean variable whether the user created a revenue for the e-shop, i.e., bought some product.

Some features are categorical and we need to convert them.

In [None]:
for column in ("Month", "Browser", "Region", "TrafficType", "VisitorType"):
    shoppers[column] = shoppers[column].astype("category")
shoppers.info()

In [None]:
shoppers.Revenue.value_counts()

The classes are quite imbalanced.

Make a train test split.

In [None]:
shoppers_X, shoppers_y = shoppers.drop(columns="Revenue"), shoppers.Revenue

shoppers_train_X, shoppers_test_X, shoppers_train_y, shoppers_test_y = train_test_split(
    shoppers_X, shoppers_y, test_size=0.2, random_state=42
)

<div class="alert alert-block alert-warning"><h5><b>Exercise 3</b></h5></div>

Use [support vector machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) to classify user sessions that lead to buying a product (revenue). Do not forget to **\*one hot encode categorical\*** features, **\*scale the rest\*** of the features, and set **\*class_weight\*** to "balanced" to account for imbalanced classes.

The expected F1 score on test set is $0.92$ for sessions resulting in no revenue and $0.66$ for sessions ending with revenue. 

In [None]:
# TODO: your code goes here...

<div class="alert alert-block alert-warning"><h5><b>Exercise 4</b></h5></div>

Run a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with **\*5-fold CV\*** to tune SVM's hyper-parameters from Exercise 3. Try different values of `C`, `gamma`, `kernel`, `degree`, or `coef0` and optimize **\*F1\*** score. Evaluate the model with best parameters using **\*F1 score\*** on test set.

You should see slight increase in F1 score for both classes (in the 2nd decimal place) depending on what values you try. The computation will take a long time so start with only few values for few hyper-parameters. You can set parameter `n_jobs=-1` for GridSearchCV to speed up computation on multi-core processors.

In [None]:
# TODO: your code goes here...