# Tutorial 09 – Baseline Model

In this tutorial, we will take a look at baselines. We will use baselines to assess whether the more complex model is worth the complexity and train a Naïve Bayes spam filter.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set()  # make plots nicer

## Baselines

Baseline is a general term for any threshold that we aim to beat with our machine learning model. It is often 1) a result obtained using the simple label/target distribution statistic or 2) previously obtained results. Case 1) is typical when you are tackling a new problem for the first time, and you have no idea how any machine learning model will perform. Case 2) is typical when improving on previous results, e.g., results published in the literature. Note that in case 2) the baseline can result from an arbitrary complex model, even a state-of-the-art neural network. The baseline is not always easy to beat.

Let's focus on case 1) for now and use the titanic dataset for demonstration.

In [None]:
titanic = sns.load_dataset("titanic")
# drop redundant columns
titanic = titanic.drop(columns=["embarked", "who", "class", "alive"])
titanic.head()

Add a new category for missing categorical values so we can later fit more complex models.

In [None]:
titanic["deck"] = titanic.deck.cat.add_categories("missing")
titanic["deck"] = titanic.deck.fillna("missing")
titanic["embark_town"] = titanic.embark_town.fillna("missing")
titanic["embark_town"] = titanic.embark_town.astype("category")

Split the dataset into training and test subsets.

In [None]:
from sklearn.model_selection import train_test_split

titanic_X, titanic_y = titanic.drop(columns="survived"), titanic.survived

titanic_train_X, titanic_test_X, titanic_train_y, titanic_test_y = train_test_split(
    titanic_X, titanic_y, test_size=0.2, random_state=42
)

The most simple baseline is to simply 'toss a fair coin' and predict labels entirely at random. This might be sensible if the classes in the dataset are balanced, i.e., classes have roughly the same number of examples. In the case of imbalanced classes, it is better to adjust the probabilities, so that predicted class labels are proportional to the number of examples of each class in the training set. If the classes are highly imbalanced, predicting the most frequent class label is a valid strategy.

`scikit-learn` has [Dummy estimators](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) that does exactly this.

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

Compute baseline **\*accuracy on the test set\*** of [simple (dummy) classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) that predict labels **\*completely randomly\***, **\*proportional to a label distribution in training data\***, and by **\*predicting the most frequent\*** label.

You should get the following accuracies (the exact number will differ due to randomness).
* completely random: $0.50 ±0.08$
* proportional: $0.52 ±0.07$
* most frequent: $0.59$

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="uniform")
dummy.fit(titanic_train_X, titanic_train_y)
print(
    "completly random (uniform)",
    round(dummy.score(titanic_test_X, titanic_test_y), 2),
)

dummy = DummyClassifier(strategy="stratified")
dummy.fit(titanic_train_X, titanic_train_y)
print(
    "proportional random (stratified)",
    round(dummy.score(titanic_test_X, titanic_test_y), 2),
)

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(titanic_train_X, titanic_train_y)
print(
    "most frequent label",
    round(dummy.score(titanic_test_X, titanic_test_y), 2),
)

---

Now let's fit a decision tree and see if it beats the baselines.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import KNNImputer

dt_pipeline = make_pipeline(
    make_column_transformer(
        (OrdinalEncoder(), ["sex"]),
        (OneHotEncoder(), ["deck", "embark_town"]),
        remainder="passthrough",
    ),
    KNNImputer(),
    DecisionTreeClassifier(max_depth=5),
)
dt_pipeline.fit(titanic_train_X, titanic_train_y)
round(dt_pipeline.score(titanic_test_X, titanic_test_y), 2)

The decision tree has achieved an accuracy of 0.79 which is 0.2 more than the best baseline. This is a very good result and clear indication that the decision tree is worth using.

## Bayesian statistics
Before we get to the Naïve Bayes we first need to be familiar with Bayesian statistics. In the Bayesian statistics the probability express the degree of belief some event will happen. This allows us to update our probability distribution estimates based on new evidence (this process is called [Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference)). The fundamental theorem in the Bayesian statistic is Bayes' theorem.

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

We can use Bayes' theorem to update our belief of an event $A$ based on the observation of an event $B$. We start with a prior probability ($P(A)$) and after observing the event $B$, we update our belief and end up with a posterior probability ($P(A|B)$) of the event $A$ given the observation of the event $B$.

**\*The following exercise is not a medical advice!\***

Let's train this type of inference on a simple example. Say you have been selected to participate in a random population screening to asses prevalence of a disease, e.g., COVID-19, in general population. Your test for COVID-19 and the test is positive. How likely are you to actually have the COVID-19?

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

Calculate the posterior probability of $P(C=True|T=+)$ if the prevalence of COVID-19 $P(C=True)$ is 0.001 (0.1 % of population) and test accuracy $P(T=+|C=True) = P(T=-|C=False)$ is 0.99 (99 %).

You can use the fact that probability of a person testing positive is $P(T=+) = P(T=+|C=T)*P(C=T) + P(T=+|C=F)*P(C=F)$.

In [None]:
covid19_prevalence = 0.001  # P(C=True)
test_accuracy = 0.99  # P(T=+|C=True) = P(T=-|C=False)

# P(T=+) = P(T=+|C=T) P(C=T) + P(T=+|C=F) P(C=F)
P_T_p = test_accuracy * covid19_prevalence + (1 - test_accuracy) * (
    1 - covid19_prevalence
)

# P(C=True|T=+) = P(T=+|C=True) * P(C=True) / P(T=+)
posterior = test_accuracy * covid19_prevalence / P_T_p
round(posterior, 2)

---

Yes, it is really less than 10 %. You would need multiple tests (more evidence) for more definitive results. The main reason is the small percentage of infected people. In other words, the prior $P(T=+)$ is tiny and you need multiple evidences to overcome it. This example also ignores any symptoms. A person with fever and cough would be much more likely to be infected and this could be modeled as additional events just like test results.

## Naïve Bayes
Naïve Bayes methods use exactly this Bayesian inference to classify the example based on features (evindence). Using Bayesian indefrence the probability of class $y$ is given by

$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)}
                                 {P(x_1, \dots, x_n)} $$
                                 
were $x_1, \dots, x_n$ are feature values. The Naïve Bayes introduces a "naïve" assumption that all features are conditionally independent resulting in the following probability.

$$P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
                                 {P(x_1, \dots, x_n)}$$
                                 
Since the denominator is independent of which class probability we are estimating, we can simplify the calculations and predict the class with maximal nominator. This is the final formula used in Naïve Bayes classifiers.

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y) $$



There are different variants of Naïve Bayes classifiers based on how they model $P(x_i \mid y)$. You can read more on each of them in [scikit-lean guide](https://scikit-learn.org/stable/modules/naive_bayes.html). Now let's use the Naïve Bayes classifier for spam detection in SMS.

In [None]:
spam_train = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/spam_train.csv")
spam_test = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/spam_test.csv")

In [None]:
spam_train_X, spam_train_y = spam_train.text, spam_train.type
spam_test_X, spam_test_y = spam_test.text, spam_test.type

In [None]:
spam_train_X.head()

We cannot use the classifier on raw strings we need to extract some features from the text. A common method is bag-of-words where we transform each string into a vector. Each element of this vector holds information about how many times a given word has occurred in the string. The example below shows should make it clear.

In [None]:
sample_sentences = [
    "the black cat",
    "the cat and dog",
    "black dog",
    "black dog black dog",
]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words = CountVectorizer().fit(sample_sentences)
print(bag_of_words.transform(sample_sentences).toarray())
print(bag_of_words.vocabulary_)

Here the first column correspond to occurrences of word "and", the second "black", and so on.

<div class="alert alert-block alert-warning"><h5><b>Exercise 3</b></h5></div>

Use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [Multinomial Naïve Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) to train a spam classifier on the train set and then evaluate the classifier using `score` method on the test set.

You should get accuracy of $0.986$.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

spam_filter_pipeline = make_pipeline(CountVectorizer(), MultinomialNB())
spam_filter_pipeline.fit(spam_train_X, spam_train_y)
round(spam_filter_pipeline.score(spam_test_X, spam_test_y), 3)

---

This looks very good, but the dataset has unbalanced classes. Spam messages are far less common in the dataset.

<div class="alert alert-block alert-warning"><h5><b>Exercise 4</b></h5></div>

Train a baseline model that will constantly predict the most frequent class label `ham` (i.e., not spam) and evaluate it using `score` method.

In [None]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(spam_train_X, spam_train_y)
round(dummy.score(spam_test_X, spam_test_y), 3)

---

As you can see, even this simple classifier was able to obtain accuracy of almost 87 %. This makes the Naïve Bayes classifier results a bit less impressive but they are still good. 

Of course, accuracy is not the best metric for unbalanced data. It prefers models that correctly classify the most common class regardless of the performance on the other class(es). F1 measure is more suited for this job as it combines both precision and recall.

In [None]:
from sklearn.metrics import f1_score

print(
    "Naïve Bayes:",
    f1_score(spam_filter_pipeline.predict(spam_test_X), spam_test_y, average=None),
)
print("Dummy:", f1_score(dummy.predict(spam_test_X), spam_test_y, average=None))

Using F1 score to evaluate spam filter and dummy model we get a better picture. The classifier is doing really great job in identifying `ham` messages, but we are mostly concerned with false positives (`ham` messages labeled as `spam`) that the filter might block and the user might not receive.

In [None]:
(
    spam_train_X,
    spam_validation_X,
    spam_train_y,
    spam_validation_y,
) = train_test_split(spam_train_X, spam_train_y, test_size=0.2, random_state=42)

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 5</b></h5></div>

Experiment with various settings of [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) or try [TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and various Naïve Bayes implementations. See which one will minimize false positives on validation dataset. Use confusion matrix for this evaluation. After deciding on pipeline, confirm that there are few false positives on test set as well.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import ComplementNB, BernoulliNB

print("count, multinomial")
spam_filter_pipeline.fit(spam_train_X, spam_train_y)
print(
    confusion_matrix(spam_validation_y, spam_filter_pipeline.predict(spam_validation_X))
)

print("count, complement")
spam_filter_complement = make_pipeline(CountVectorizer(), ComplementNB())
spam_filter_complement.fit(spam_train_X, spam_train_y)
print(
    confusion_matrix(
        spam_validation_y, spam_filter_complement.predict(spam_validation_X)
    )
)

print("tf-idf, multinomial")
spam_filter_complement = make_pipeline(TfidfVectorizer(), MultinomialNB())
spam_filter_complement.fit(spam_train_X, spam_train_y)
print(
    confusion_matrix(
        spam_validation_y, spam_filter_complement.predict(spam_validation_X)
    )
)

print("binary, bernoulli")
spam_filter_binary = make_pipeline(CountVectorizer(binary=True), BernoulliNB())
spam_filter_binary.fit(spam_train_X, spam_train_y)
print(
    confusion_matrix(spam_validation_y, spam_filter_binary.predict(spam_validation_X))
)
# no false positives; few false negatives

print(confusion_matrix(spam_test_y, spam_filter_binary.predict(spam_test_X)))

---