# Tutorial 05 – Preprocessing & Instance Based Learning

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

## Preprocessing

The preprocessing is crutial part of any machine learning. Having clean and high quality data is offten more beneficial than using complex state of the art machine learning model on poor data. After all, machine learning is just statistics on steroids and while you can do statitistics on random noise any conclusions will be worthless.

### Standardization and Scaling

This preprocessing step is transforming feature values so that they have a "nicer" distribution. There is no universal definition of nice distribution, but some models we will encounter later in the semester for example require features to be centered around zero and have normal distribution. This can be achieved through transformation known as standardization that we have seen in the tutorial 03. There we used [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) to transform each feature to have zero mean and unit standard deviation.

Other models (including KNN in this tutorial) require features to be of the same magnitudes. Otherwise features with high magnitudes will dominate and the model will be bias towards deciding only according to these features. `scikit-lean` provides [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) and [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) for scaling features to the same magnitude. This way the relative differences between example are preserved but differences between features.

For example, suppose we have a dataset with cake recipes with amount of flour in grams and number of eggs. Then difference between 400g and 415g of flour (roughly one tablespoon) would be the same difference for the model as 5 or 20 eggs. After scaling, the difference in flour content would be smaller then 0.1 but difference in eggs would be around 1.

All scaler mentioned above are sensitive to outliers. So unless you remove them before scaling you might want to use more robust scaler. One such is [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) that scale values base on quantile values.

Another more exotic variant of scaling is quantile scaling done by [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer). Here the feature values are scaled to match a desired output distribution. By default this is uniform distribution between [0, 1]. This transformation is non-linear meaning relative differences between feature values are not preserved. On the other hand, the value rank (or ordering) is preserved. This kind of transformation is useful if feature values are not normal but your model requires normally distributed feature values.

Take a look at [transformation examples](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#) in `scikit-learn` documentation and you can also read more on their properties [here](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling).

### Missing Values

Missing values are common phenomena in real life data. A measuring sensor can malfunction, people forget to fill out few questions, or some informations may be classified or censored to protect privacy. There are, however, key distinction in mechanisms that cause missing data based on what data is actually missing.

#### Missing Completely at Random (MCAR)
In this case there is no relation between what is missing and what is the true value that would be otherwise recorded. Typical example is a faulty sensor that sometimes stops working for no apparent reason.

#### Missing at Random (MAR)
In this case there is relation between what data is missing and what data is observed but not the actual missing vales. So, for example, if men are more likely to tell you their weight than women, weight is MAR.

#### Missing Not at Random (MNAR)
In this case there is a relation between what data is missing and what are the true missing values. Let's say we have a temperature sensor that will stop working if the ambient temperature is above 35 °C. Data from this sensor is MNAR and any analysis done on them will be biased towards temperatures below 35 °C where the senor is working and recording data.

You can read more on these mechanisms in this more statistical [blog post](https://www.theanalysisfactor.com/missing-data-mechanism/).

Knowing the mechanisms of missing values is crucial if we want to make correct conclusions or train an unbiased machine learning model. While MCAR and MAR can be imputed with varying difficulty MNAR are almost impossible to impute.

#### Practical examples
Let's create an artificial missing values using different mechanisms. We will use simple dataset with data about three different species of iris flower.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df["species"] = pd.Categorical(iris.target_names[iris.target])
iris_df.head()

To simplify things a bit, let's just add missing values to feature petal length and always make 30 % of data missing.

MCAR

In [None]:
np.random.seed(42)
na_indices = np.random.choice(len(iris_df), int(len(iris_df) * 0.3), replace=False)
iris_mcar = iris_df.copy()
iris_mcar.iloc[na_indices, 2] = np.nan

MAR, data is missing based on species but not based on value of feature

In [None]:
np.random.seed(42)
na_indices = np.random.choice(
    len(iris_df),
    int(len(iris_df) * 0.3),
    replace=False,
    p=iris_df.species.cat.codes * 2 / sum(iris_df.species.cat.codes * 2),
)
iris_mar = iris_df.copy()
iris_mar.iloc[na_indices, 2] = np.nan

MNAR, higher values of feature are more likely to be missing (4th power is just to exaggerate a bit)

In [None]:
np.random.seed(42)
na_indices = np.random.choice(
    len(iris_df),
    int(len(iris_df) * 0.3),
    replace=False,
    p=iris_df["petal length (cm)"] ** 4 / sum(iris_df["petal length (cm)"] ** 4),
)
iris_mnar = iris_df.copy()
iris_mnar.iloc[na_indices, 2] = np.nan

We can inspect the data both visually

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10, 10), sharex=True, sharey=True)
fig.suptitle("Comparison of different missing values")

for ax, title, plot_data in zip(
    axes.flatten(),
    ("No missing data", "MCAR", "MAR", "MNAR"),
    (iris_df, iris_mcar, iris_mar, iris_mnar),
):
    sns.scatterplot(
        data=plot_data,
        x="sepal width (cm)",
        y="petal length (cm)",
        hue="species",
        ax=ax,
    )
    ax.set_title(title)

and using statistics.

In [None]:
print("No missing data")
print(iris_df.groupby("species")["petal length (cm)"].mean())

print("MCAR")
print(iris_mcar.groupby("species")["petal length (cm)"].mean())

print("MAR")
print(iris_mar.groupby("species")["petal length (cm)"].mean())

print("MNAR")
print(iris_mnar.groupby("species")["petal length (cm)"].mean())

Notice there is a decrease of mean petal length in case of MNAR caused by systematic missing data. Higher values were more likely to be missing.

### What to do with missing data
There are two main approaches.

1. Discard any examples and/or features with missing data.
2. Impute missing date with some plausible values.

Option 1) is a bit crude and will leave us with less data to train the model. It is, however, acceptable approach in cases where a) we have a lot of data and only few missing values or b) the feature has mostly missing values.

Option 2) is more typical and aims at filling the missing data (imputing) with values that are at least plausible. We have used [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) in tutorial 03 that is imputing values base on simple statistics (mean, median, mode) or user predefined constant. This is fine as long as the features are independent. If the features are correlated or otherwise related, `SimpleImputer` might be too simple.

Luckily, `scikit-learn` provides two more sophisticated imputers – [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#) and [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#). `IterativeImputer` tries to model each feature as a function of the remaining features and predicts the missing values. `KNNImputer` looks at `k` closest examples (neighbors) in the data and imputes a weighted average of the feature values in these neighbors. Note that `IterativeImputer` is still a bit experimental and needs to be explicitly enabled by importing from `sklearn.experimental` (see below).

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

We would like to evaluate the performance of each imputer when compared to the original values we removed. Use all **\*three imputers\*** to impute missing values of the **\*three datasets\*** with missing values created above. Next, evaluate imputers performance by **\*computing RMSE\*** of the imputed values and actual values in the original dataset, i.e., how close the imputed values are actual ones.

You can use RMSE function from previous tutorials or the one from `scikit-learn`. Do not forget to deal with categorical column `species` before applying iterative and KNN imputers. Expected RMSE values are in the table below.

| Imputer | Dataset | RMSE |
|---------|---------|------|
|Simple   | MCAR    | 1.027|
|Iterative| MCAR    | 0.153|
|KNN      | MCAR    | 0.138|
|Simple   | MAR     | 1.044|
|Iterative| MAR     | 0.150|
|KNN      | MAR     | 0.157|
|Simple   | MNAR    | 1.242|
|Iterative| MNAR    | 0.185|
|KNN      | MNAR    | 0.217|

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# TODO: your code goes here...

### Class Balancing
It may happen that classes we would like to classify are not equally represented in the training data. There are five strategies to handle imbalanced classes.
1. Up-sample the under-represented class(es).
2. Down-sample the over-represented class(es).
3. Add weights cost function to penalize model more for getting under-represented class wrong.
4. Use some sophisticated technique to create new example of under-represented classes by recombining them.
5. Reformulate the problem as an outlier detection where outliers are example of the under-represented class(es).

We will return to class balancing in later tutorials.

### Feature engineering

This part of preprocessing cover creation of new features from existing features (feature extraction), projecting features, and selecting only a subset of features that is most useful for the model (feature selection). We will take more in depth look at this part of preprocessing later in the semester.

## K Nearest Neighbors

We have used KNN as a imputation model but it can be used as a standalone model for predictions. It can be used both for classification (with majority class between neighbors) and regression (weighted average of values of neighbors). KNN can be a useful model in cases where there are few training examples. Especially if the training data is of high quality – correct labels without noise and training examples cover typical cases for each class.

We can demonstrate it in the next exercise. We would like to predict person's temperament type based on the answers in a questionnaire. To do this we have a dataset with only four examples, one for each temperament type. These examples represent answers of a "prototypical" person for each temperament type.

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

1. Load a [temperament dataset](https://www.fi.muni.cz/ib031/datasets/temperament_prototypes.csv)
2. Train a [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) on it. Set the number of neighbors to 1.
3. Once trained, head over to [online Eysenck temperament test](http://similarminds.com/eysenck.html) and fill out the test for yourself. Before submitting the test, mark your answers into a data frame. The answer are encoded as integers between 0 and 4 where 0 means `Very Inaccurate` and 4 means `Very Accurare`.
4. Predict the class for your self using the KNN model.
5. Submit the online test and compare the results. There should be a circular plot with a red dot marking your position on the results page.

Notes: The order of questions in the online test is randomized. Make sure to mark your answers to correct column.

In [None]:
# TODO: your code goes here...

X_me = pd.DataFrame(
    {
        "Being in debt would not worry me.": # TODO: your answer,
        "I am at ease around others.": # TODO: your answer,
        "I am hypersensitive.": # TODO: your answer,
        "I am outgoing.": # TODO: your answer,
        "I am quiet around others.": # TODO: your answer,
        "I am very energetic.": # TODO: your answer,
        "I am very moody.": # TODO: your answer,
        "I am very talkative.": # TODO: your answer,
        "I am very tense.": # TODO: your answer,
        "I behave properly.": # TODO: your answer,
        "I can be egocentric.": # TODO: your answer,
        "I can be unsympathetic.": # TODO: your answer,
        "I enjoy being part of a group.": # TODO: your answer,
        "I enjoy meeting new people.": # TODO: your answer,
        "I enjoy social gatherings.": # TODO: your answer,
        "I fear for the worst.": # TODO: your answer,
        "I frequently feel frustrated.": # TODO: your answer,
        "I frequently feel guilty.": # TODO: your answer,
        "I frequently worry.": # TODO: your answer,
        "I have no trouble approaching people.": # TODO: your answer,
        "I know how to get people to have fun.": # TODO: your answer,
        "I like being in high energy environments.": # TODO: your answer,
        "I like to be intimidating.": # TODO: your answer,
        "I make friends easily.": # TODO: your answer,
        "I often feel lonely.": # TODO: your answer,
        "I prefer to go my own way than live by the rules.": # TODO: your answer,
        "I respect authority.": # TODO: your answer,
        "I sometimes feel extremely sad for no reason.": # TODO: your answer,
        "I suffer from anxiety.": # TODO: your answer,
        "I tend to be more comfortable with the known than the unknown.": # TODO: your answer,
        "I tend to be nervous.": # TODO: your answer,
        "I tend to brood on past mistakes.": # TODO: your answer,
        "I think people are overly cautious.": # TODO: your answer,
        "I try not to be rude to people.": # TODO: your answer,
        "I would like other people to be afraid of me.": # TODO: your answer,
        "I would rather play by the rules.": # TODO: your answer,

    },
    index=[0],
)

# TODO: predics your temperament type...

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 3</b></h5></div>

Let's use KNN to predict continuous variable. We will try to predict how much miles per gallon (mpg) can a car make based on its characteristics. The code for loading dataset is already written. Your job is to do the preprocessing, fitting the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) model and then evaluating it using RMSE on test data.

You can get to RMSE of $2.135$ if you play a bit.

In [None]:
from sklearn.model_selection import train_test_split

mpg = sns.load_dataset("mpg")
# TODO: convert categorical features to dtype Categorical...
X, y = mpg.drop(columns=["mpg"]), mpg.mpg

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# TODO: make a pipeline, fit it, and evaluat it...