# Tutorial 04 – Handling Various Data Modalities

Data often comes in various modalities and it might not be possible to directly feed raw data to a machine learning algorithm. By modality, we mean format and structure of data. In this tutorial, we will focus on three most common modalities other than plain numerical values: time series, images, and texts. There are also other modalities and the data can also be multi-modal, i.e., combining multiple modalities like a text and image in a social media post. However,s  the three mentioned modalities together with numeric values cover most of the situations you might encounter.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set()  # make plots nicer

## Time Series

Time series is a series of data points, i.e., vector of numerical values, indexed by time. These values are often successive measurements take from a sensor (e.g. hourly temperatures) of arbitrary lengths. Time series in your dataset might have varying lengths (e.g., sensor readings until failure that occurs at different times). 

#### Examples
With the increasing number of sensors we put in every device, time series are quite prevalent in machine learning. Notable examples of time series are:
* stock market values (e.g. price of Apple stocks),
* weather data (e.g. temperatures),
* utilization of resources (e.g., CPU on a server or parking spaces),
* sound or voice recordings (literally pressure readings in time),
* signal processing (e.g. reading from radar antenna), 
* handwriting strokes on a touch screen or tables,
* human movement or activity readings (e.g. motion capture data), and
* human body and health sensor data (e.g. hear rate during the day).  

#### Tasks
We can perform standard machine learning tasks with time series just like with numerical values:
* regression (often called forcasting) – predict next value(s) of time series based on historical values (e.g. stock prices in the next second, or weather for next day or week) 
* classification – assign one or more predetermined labels to the time series (e.g. labels as normal, heart attack, and arrhythmia for hear rate measurements)
* clustering – group together "similar" time series; although there have been an influential paper published some time ago arguing that clustering of subsequences in meaning less (Keogh, E., Lin, J. Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8, 154–177 (2005). https://doi.org/10.1007/s10115-004-0172-7)  

#### Why We Need Special Treatment
If all time series in our data set are of the same length we could technically pass them to most general machine learning models directly with every measurement as distinct "feature". However, such treatment throws away ordering of values and their relative positions and some models may not be able to learn this implicitly.

Typical operation with time series are:
* extracting training example (time series) for regression from a single long time series
* truncating and padding time series to the same length,
* extracting statistics from time series as features for ML model (e.g., mean, variance, and trends), and
* applying Fourier Transformations to decompose periodic trends.

As an example, we will use data from sensors in an office that measure ambient quantities (e.g. room temperature) and the goal will be to predict whether any worker is present in the office. Let's load the data and have a look.

In [None]:
occupancy_train = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/occupancy_train.csv"
)
occupancy_train.head()

In [None]:
occupancy_test = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/occupancy_test.csv")

The data is already split into training and test datasets. It is crucial to train the model on data from past and predict the future otherwise you are risking unintentional cheating by leaking information about test set during training. Typically, you set a separation time stamp and all data before this time stamp is for training and data after this time stamp are for testing. Let's check this holds in our example.

In [None]:
occupancy_train.date.iloc[-1]

In [None]:
occupancy_test.date.iloc[0]

Let's get a basic idea of how the dataset looks.

In [None]:
occupancy_train[["Temperature", "Humidity", "Occupancy"]].plot()

In [None]:
occupancy_train[["CO2", "Light"]].plot()

### Extracting Examples From a Single Time Series

This is example of a single long time series that needs to be sliced into smaller time-series so we can build a model than can predict employee presence based on few historical measurements. Let's say we would like to predict the presence from last 5 measurements (translates to last 5 minutes). So we need to transform a date frame

|    | date                |   Temperature |   Humidity |   Light |    CO2 |   HumidityRatio |   Occupancy |
|---:|:--------------------|--------------:|-----------:|--------:|-------:|----------------:|------------:|
|  1 | 2015-02-04 17:51:00 |         23.18 |    27.272  |   426   | 721.25 |      0.00479299 |           1 |
|  2 | 2015-02-04 17:51:59 |         23.15 |    27.2675 |   429.5 | 714    |      0.00478344 |           1 |
|  3 | 2015-02-04 17:53:00 |         23.15 |    27.245  |   426   | 713.5  |      0.00477946 |           1 |
|  4 | 2015-02-04 17:54:00 |         23.15 |    27.2    |   426   | 708.25 |      0.00477151 |           1 |
|  5 | 2015-02-04 17:55:00 |         23.1  |    27.2    |   426   | 704.5  |      0.00475699 |           1 |


into a single row

|    |   temperature-4 |   humidity-4 |   light-4 |   co2-4 |   humidity_ratio-4 |   temperature-3 |   humidity-3 |   light-3 |   co2-3 |   humidity_ratio-3 |   temperature-2 |   humidity-2 |   light-2 |   co2-2 |   humidity_ratio-2 |   temperature-1 |   humidity-1 |   light-1 |   co2-1 |   humidity_ratio-1 |   temperature0 |   humidity0 |   light0 |   co20 |   humidity_ratio0 |   label |
|---:|----------------:|-------------:|----------:|--------:|-------------------:|----------------:|-------------:|----------:|--------:|-------------------:|----------------:|-------------:|----------:|--------:|-------------------:|----------------:|-------------:|----------:|--------:|-------------------:|---------------:|------------:|---------:|-------:|------------------:|--------:|
|  0 |           23.18 |       27.272 |       426 |  721.25 |         0.00479299 |           23.15 |      27.2675 |     429.5 |     714 |         0.00478344 |           23.15 |       27.245 |       426 |   713.5 |         0.00477946 |           23.15 |         27.2 |       426 |  708.25 |         0.00477151 |           23.1 |        27.2 |      426 |  704.5 |        0.00475699 |       1 |

<div class="alert alert-block alert-warning"><h5><b>Exercise 1</b></h5></div>

Write a function `extract_time_series_features` that will perform the transformation described above. It will take a single parameter `window` that will be a pandas Data frame with five rows and the output will be a pandas Series with *flattened* data. Exclude columns `date` and `Occupancy`. The `label` is value of `Occupancy` in the last row of `window`. 

If you did everything correctly, calling `pd.DataFrame([extract_time_series_features(occupancy_train.head())])` will produce line from example above.

Hint: You can use either method `flatten` of numpy arrays or `melt` of pandas Data frames.

In [None]:
def extract_time_series_features(window):
    # TODO: your code goes here...
    pass

In [None]:
pd.DataFrame([extract_time_series_features(occupancy_train.head())])

---

Now we split the original time series into many smaller ones of desired length (5 in this case) and apply the function `extract_time_series_features` to obtain a single training example.

In [None]:
occupancy_train_flatten = pd.DataFrame(
    [
        extract_time_series_features(occupancy_train.iloc[i : i + 5, :])
        for i in range(len(occupancy_train) - 4)
    ]
)
occupancy_train_flatten.head()

In [None]:
occupancy_test_flatten = pd.DataFrame(
    [
        extract_time_series_features(occupancy_test.iloc[i : i + 5, :])
        for i in range(len(occupancy_test) - 4)
    ]
)

We now define a helper function that will fit, evaluate, and plot a decision tree classifier on train and test data provided to the function.

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report


def fit_and_evaluate_time_series(train_data, test_data):
    X_train, y_train = (
        train_data.drop(columns=["label"]),
        train_data.label,
    )
    X_test, y_test = (
        test_data.drop(columns=["label"]),
        test_data.label,
    )

    dtc = DecisionTreeClassifier(max_depth=5)
    dtc.fit(X_train, y_train)
    y_pred = dtc.predict(X_test)

    print(
        f"Classification report for classifier {dtc}:\n"
        f"{classification_report(y_test, y_pred, target_names=('empty', 'occupied'))}\n"
    )

    plt.figure(dpi=200)
    plot_tree(
        dtc,
        filled=True,
        rounded=True,
        node_ids=True,
        feature_names=X_train.columns,
        class_names=("empty", "occupied"),
        max_depth=1,
    )

In this example, you should see accuracy of 0.95.

In [None]:
fit_and_evaluate_time_series(occupancy_train_flatten, occupancy_test_flatten)

Looking at the plot, we can easily identify the reason for such a good accuracy. The most significant feature is the measurement of light in the most recent time stamp of the example. Since employees work mostly during the day and have lights turned on, it is naturally a very good indication of employees being in the office.

### Statistics

We now try a different approach of making training examples. We will extract statistical features from the "windows" that we flattened before. To make the task more challenging we will also exclude measurements of light intensity in the office. As before, we want to transform a date frame

|    | date                |   Temperature |   Humidity |   Light |    CO2 |   HumidityRatio |   Occupancy |
|---:|:--------------------|--------------:|-----------:|--------:|-------:|----------------:|------------:|
|  1 | 2015-02-04 17:51:00 |         23.18 |    27.272  |   426   | 721.25 |      0.00479299 |           1 |
|  2 | 2015-02-04 17:51:59 |         23.15 |    27.2675 |   429.5 | 714    |      0.00478344 |           1 |
|  3 | 2015-02-04 17:53:00 |         23.15 |    27.245  |   426   | 713.5  |      0.00477946 |           1 |
|  4 | 2015-02-04 17:54:00 |         23.15 |    27.2    |   426   | 708.25 |      0.00477151 |           1 |
|  5 | 2015-02-04 17:55:00 |         23.1  |    27.2    |   426   | 704.5  |      0.00475699 |           1 |


into a single row with statistics of each column like

|    |   label |   Temperature_mean |   Temperature_std |   Temperature_change |   Humidity_mean |   Humidity_std |   Humidity_change |   CO2_mean |   CO2_std |   CO2_change |   HumidityRatio_mean |   HumidityRatio_std |   HumidityRatio_change |
|---:|--------:|-------------------:|------------------:|---------------------:|----------------:|---------------:|------------------:|-----------:|----------:|-------------:|---------------------:|--------------------:|-----------------------:|
|  0 |       1 |             23.146 |         0.0288097 |                -0.08 |         27.2369 |      0.0352037 |            -0.072 |      712.3 |   6.35757 |       -16.75 |           0.00477688 |          1.3542e-05 |           -3.59952e-05 |

<div class="alert alert-block alert-warning"><h5><b>Exercise 2</b></h5></div>

Write a function `extract_time_series_statistics` that will perform the transformation described above. It will take a single parameter `window` that will be a pandas Data frame and the output will be a pandas Series with extracted statistics of each measured quantity (mean value, standard deviation, and difference between first and last value). Exclude columns `date`, `Light` and `Occupancy`. The `label` is value of `Occupancy` in the last row of `window`. 

If you did everything correctly, calling `pd.DataFrame([extract_time_series_statistics(occupancy_train.head())])` will produce line from example above.

In [None]:
def extract_time_series_statistics(window):
    # TODO: your code goes here...
    pass

In [None]:
pd.DataFrame([extract_time_series_statistics(occupancy_train.head())])

---

Now we split the original time series into many smaller ones of desired length (32 in this case) and apply the function `extract_time_series_statistics` to obtain a single training example.

In [None]:
occupancy_train_statistics = pd.DataFrame(
    [
        extract_time_series_statistics(occupancy_train.iloc[i : i + 32, :])
        for i in range(len(occupancy_train) - 31)
    ]
)
occupancy_train_statistics.head()

In [None]:
occupancy_test_statistics = pd.DataFrame(
    [
        extract_time_series_statistics(occupancy_test.iloc[i : i + 32, :])
        for i in range(len(occupancy_test) - 31)
    ]
)

Once again, we fit and evaluate the decision tree classifier. In this example, you should achieve accuracy around 0.6

In [None]:
fit_and_evaluate_time_series(occupancy_train_statistics, occupancy_test_statistics)

The accuracy is lower, as expected, given that we removed `Light` measurements. From recall and precision, we see that we can be quite confident that the office is empty if the model predicts empty. But a lot of example of empty office is still being miss-classified as occupied.

### (Fast) Fourier Transformation

The last technique we will look at is a Fourier transformation. It transforms signals from time domain into frequency domain essentially decomposing it into infinite sum of sine and cosine functions with different coefficients and periods. You can think of it as decomposing a sound of a chord into individual notes (sound frequencies) being played. Fourier transformation is at the heart of any kind of signal processing and it is especially useful in analyzing periodic or seasonal time series.

The mathematical details of Fourier transformation are quite complex and outside of the scope of this course. For our purposes it suffices to use standard calls of library functions. Nevertheless, it is useful to visualize what the result of the transformation. Suppose we have a signal (green line) composed of two sine signals with periods $2\pi$ (blue) and $2\pi/3$ (orange). We sample this signal 32 times on interval $[0, 2\pi]$.

In [None]:
x = np.linspace(0, 2 * np.pi, 32)
sin1 = np.sin(x)
sin2 = 0.5 * np.sin(2 * x)
combined = sin1 + sin2

plt.plot(x, sin1, x, sin2, x, combined)
plt.show()

After applying Fourier transformation we obtain complex numbers corresponding to 32 sine functions that the signal was decomposed to.

In [None]:
fft = np.fft.fft(combined)
fft

We can plot corresponding sine signals to each complex number and also the approximated input signal.

In [None]:
import cmath

freqs = np.fft.fftfreq(33, (2 * np.pi) / 32)
recomb = np.zeros((len(x),))
middle = len(x) // 2 + 1
for i in range(middle):
    if i == 0:
        coeff = 2
    else:
        coeff = 1
    sinusoid = (
        1
        / (len(x) * coeff / 2)
        * (abs(fft[i]) * np.cos(freqs[i] * 2 * np.pi * x + cmath.phase(fft[i])))
    )
    recomb += sinusoid
    plt.plot(x, sinusoid)
plt.show()

plt.plot(x, recomb, c="g")
plt.show()

The two sine waves with highest amplitudes are indeed the sine waves we used to create our signal.

Now to use it as features for a machine learning model, we have three options.
1. Pass the entire array of values (their absolute values to be precise) from `fft` function.
2. Bin the frequencies into bands and sum values in those bands. This is useful for longer time series where the output of `fft` is also very large (e.g., for high sampling rates of sensors).  
3. Instead of amplitudes we can use frequencies sorted by their amplitudes. This is useful if the model cannot compare features values between them selves like in decision trees.

We will try the last approach. To make our lives easier, we will only extract indices of values returned by `fft` function sorted by the absolute value instead of frequencies. The actual frequencies will be just some constant multiple of these indices and the constants are irrelevant for the model.

As before, we want to transform a date frame (excluding columns date, Light, and Occupancy)

|    | date                |   Temperature |   Humidity |   Light |    CO2 |   HumidityRatio |   Occupancy |
|---:|:--------------------|--------------:|-----------:|--------:|-------:|----------------:|------------:|
|  1 | 2015-02-04 17:51:00 |         23.18 |    27.272  |   426   | 721.25 |      0.00479299 |           1 |
|  2 | 2015-02-04 17:51:59 |         23.15 |    27.2675 |   429.5 | 714    |      0.00478344 |           1 |
|  3 | 2015-02-04 17:53:00 |         23.15 |    27.245  |   426   | 713.5  |      0.00477946 |           1 |
|  4 | 2015-02-04 17:54:00 |         23.15 |    27.2    |   426   | 708.25 |      0.00477151 |           1 |
|  5 | 2015-02-04 17:55:00 |         23.1  |    27.2    |   426   | 704.5  |      0.00475699 |           1 |


into a single row with sorted indices of absolute values in Fourier transformation like

|    |   label |   Temperature_fft0 |   Temperature_fft1 |   Temperature_fft2 |   Temperature_fft3 |   Temperature_fft4 |   Humidity_fft0 |   Humidity_fft1 |   Humidity_fft2 |   Humidity_fft3 |   Humidity_fft4 |   CO2_fft0 |   CO2_fft1 |   CO2_fft2 |   CO2_fft3 |   CO2_fft4 |   HumidityRatio_fft0 |   HumidityRatio_fft1 |   HumidityRatio_fft2 |   HumidityRatio_fft3 |   HumidityRatio_fft4 |
|---:|--------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|----------------:|----------------:|----------------:|----------------:|----------------:|-----------:|-----------:|-----------:|-----------:|-----------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|
|  0 |       1 |                  0 |                  3 |                  2 |                  4 |                  1 |               0 |               4 |               1 |               3 |               2 |          0 |          4 |          1 |          3 |          2 |                    0 |                    4 |                    1 |                    3 |                    2 |

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 3</b></h5></div>

Write a function `extract_time_series_fft_freq` that will perform the transformation described above. It will take a single parameter `window` that will be a pandas Data frame and the output will be a pandas Series with extracted indices of absolute values in Fourier transformation of each measured quantity sorted by the absolute value. Exclude columns `date`, `Light` and `Occupancy`. The `label` is value of `Occupancy` in the last row of `window`. 

If you did everything correctly, calling `pd.DataFrame([extract_time_series_fft_freq(occupancy_train.head())])` will produce line from example above.

Hint: You might use function `argsort` from numpy. 

In [None]:
def extract_time_series_fft_freq(window):
    # TODO: your code goes here...
    pass

In [None]:
pd.DataFrame([extract_time_series_fft_freq(occupancy_train.head())])

---

Now we split the original time series into many smaller ones of desired length (32 in this case) and apply the function `extract_time_series_fft_freq` to obtain a single training example.

In [None]:
occupancy_train_fourier = pd.DataFrame(
    [
        extract_time_series_fft_freq(occupancy_train.iloc[i : i + 32, :])
        for i in range(len(occupancy_train) - 31)
    ]
)
occupancy_train_fourier.head()

In [None]:
occupancy_test_fourier = pd.DataFrame(
    [
        extract_time_series_fft_freq(occupancy_test.iloc[i : i + 32, :])
        for i in range(len(occupancy_test) - 31)
    ]
)

Once again, we fit and evaluate the decision tree classifier. In this example, you should achieve accuracy around 0.77

In [None]:
fit_and_evaluate_time_series(occupancy_train_fourier, occupancy_test_fourier)

## Images

Image is a 2D surface made of pixels, i.e. tensor of shape (width, height, number of channels). Tensor is just a generalization of matrix into higher dimensions. In case of three dimensional tensors you can think of it as matrix where each value is a vector. Images are often colored and each color has its own channel. In case of RGB, there are three channels for red, green, and blue.

### Examples
Most people carry a decent camera in their pocket nowadays and share hundreds of photos on social media. Examples of images processed by machine learning include:
* photos of people
* sensor data from cameras on self-driving cars and drones
* photos of streets by services like Google Street View
* scans of paper documents
* photos from cameras above manufacturing lines 

### Tasks
* classification – assign one or more predetermined labels to the image (e.g., dog vs cat, accept vs reject in quality assurance, or digit represented in the image)
* regression – either assign numeric value to the image (e.g., product is 95 % passing, or number of cats in the picture) or generate entire new image (e.g., deep fakes)
* object detection – find bounding boxes of recognized things in the image (e.g., face detection, or obstacle detection for self-driving cars)
* character optical recognition (OCR) – transcribe text in the image into string, could be done also for handwritten texts (offline hand writing recognition)

#### Why We Need Special Treatment
* typical machine learning models do not accept tensors but vectors,
* relative position of pixel value in an image is important (i.e., an image of a cat must have cat's head close to cat's body), and
* we require positional invariance from the model (i.e., image of a cat can have the cat in the center but also in the corner or anywhere else)

Therefore, we need to use special methods or special models:
* extracting statistics from the image (mean pixel value, variations in each channel, ...)
* flattening matrix into vector
* using convolutional neural networks (CNN)

As an example, we will use images of handwritten digits (MNIST dataset) and try to correctly recognize them. This dataset is already prepared for us in sklearn so all we need to do is just import sklearn and load it. 

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
digits = datasets.load_digits()

`digits` dataset has two attributes:
* images – numpy array representation of images
* target – numpy vector with digit values

In [None]:
digits.images

In [None]:
digits.target

Each images is represented as a matrix of integers. This is because images are black and white and they have only single channel. Pixel intensities in these images range from 0 to 16.

In [None]:
digits.images[0]

It is always better to draw images in order to fully understand image data.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title("Training: %i" % label)

We can see that images have quite low resolution. Looking at shape of `images` attribute tells us there are 1797 images each 8 by 8 pixels. 

In [None]:
digits.images.shape

### Statistics

First, let's try extracting some statistical features from the image pixel values.

<div class="alert alert-block alert-warning"><h5><b>Exercise 4</b></h5></div>

Transform images in `digits.images` into a training data matrix where each image is represented by three features: mean pixel value, standard deviation of pixel values, and number of non zero pixels. So for example the first image `digits.images[0]` gets transformed into a vector `[ 4.59375, 5.18326258, 35.]`.

Hint: You can use functionality of numpy to do all transformations easily. 

In [None]:
# TODO: your code goes here...
# image_data_stats = ...

---

Here we define a helper function that does classifier training and evaluation.

In [None]:
def fit_and_evaluate_decision_tree_classifier(image_data, true_labels):
    X_train, X_test, y_train, y_test = train_test_split(
        image_data, true_labels, test_size=0.2, shuffle=False, random_state=42
    )
    dtc = DecisionTreeClassifier()
    dtc.fit(X_train, y_train)
    y_pred = dtc.predict(X_test)

    print(
        f"Classification report for classifier {dtc}:\n"
        f"{classification_report(y_test, y_pred)}\n"
    )

    return dtc

Now we train and evaluate decision tree classifier on our statistical features. You should get precision around 0.19.

In [None]:
fit_and_evaluate_decision_tree_classifier(image_data_stats, digits.target)

The overall performance is poor. But notice that classes 0 and 1 have much higher accuracy as their shapes are distinct from other digits. 

### Flatten

Let's improve the performance by include as much of the original image as possible. 

<div class="alert alert-block alert-warning"><h5><b>Exercise 5</b></h5></div>

Transform images in `digits.images` into a training data matrix where each image is a vector of pixels in the image. So for example the first image `digits.images[0]` gets transformed into a vector `[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10., 15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4., 12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8., 0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5., 10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]`.

In [None]:
# TODO: Your code goes here...
# image_data_flatten = ...

---

Now we train and evaluate decision tree classifier on our statistical features. You should get accuracy around 0.8.

In [None]:
dtc_flatten = fit_and_evaluate_decision_tree_classifier(
    image_data_flatten, digits.target
)

This result is quite good but only because we have "nice" training data. Images are "nice" because all digits are perfectly centered. If we move the digit a bit, we get different prediction.

In [None]:
one_original = digits.images[93]

In [None]:
one_shifted = digits.images[93][:, [6, 7, 0, 1, 2, 3, 4, 5]]

In [None]:
_, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 3))
for (
    ax,
    image,
) in zip(axes, [one_original, one_shifted]):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")

Here we have two almost identical pictures of number 1. The first digit is centered and the second digit is moved to the right. For the first digit we obtain correct label.

In [None]:
dtc_flatten.predict([one_original.flatten()])

But for the shifted image we get different label, 7 in this case.

In [None]:
dtc_flatten.predict([one_shifted.flatten()])

This is because in the training data only sevens have more black pixels in the right part of the image.

### Convolutional Neural Networks (CNN)

CNN are typically used model for image recognition. They repeatedly extract features (e.g., edges, shapes, or gradients) from small patches of image through convolution. We will come back to CNN in later tutorial on neural networks. 

## Text

Text data is are sequences of characters of varying lengths typically represented as strings.

### Examples
* messages (IM, SMS, ...)
* textual part of social media posts
* articles (news, wikipedia, ...) 
* research papers
* books
* reviews (books, movies, games, ...)
* text form answers 

### Tasks
* classification – assign one or more predetermined labels to the text (e.g., sentiment analysis (positive vs negative review))
* clustering – group different texts based on their similarity (e.g., by topic)
* translation – for a given text in one language generate new text in other language with the same meaning
* summarization – for a given text generate new text that keeps only main points of the original text (e.g., abstracts in news article)
* question answering – for a given questions generate text that answers the question (e.g., aggregated answers in search engines)
* named entity recognition – find parts of the text that reference some real world entity

### Why We Need Special Treatment
* typical machine learning models do not accept strings but numbers,
* input texts can have different lengths,
* two words may have very different meaning even though they have similar letters in them (vine vs wine), and
* same word can have multiple meanings based on the context (How **can** I help you? vs Open the **can**.)

Therefore, we need to use special methods or special models:
* vectorizing sentences by counting word occurrences from predefined vocabulary (bag of words)
* embeddings (e.g., word2vec, position in a vector space based on word's semantics)
* transformers (models with internal embeddings based on attention)

We will use movie reviews from IMDB as an example dataset. Our goal will be to correctly classify sentiment of the reviews. Possible labels are positive (1) and negative (0).

In [None]:
text_data = pd.read_csv(
    "https://www.fi.muni.cz/ib031/datasets/imdb_labelled.txt",
    sep="\t",
    names=["review", "sentiment"],
)
pd.set_option("display.max_colwidth", 200)
text_data.head()

### Bag of Words

Let's start with the simplest technique called bag-of-words.

<div class="alert alert-block alert-warning"><h5><b>Exercise 6</b></h5></div>

Transform text reviews using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `scikit-learn`. Take a look at the documentation and configure the vectorizer such that it removes English stop words (i.e., common words not having any meaning like "the"). Save transformed reviews into variable `vectorized_reviews`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# TODO: Your code goes here...
# vectorized_reviews = ...

---

Once again we use a helper function for training and evaluation of decision tree classifier.

In [None]:
def fit_and_evaluate_decision_tree_classifier(text_data, true_labels):
    X_train, X_test, y_train, y_test = train_test_split(
        text_data, true_labels, test_size=0.2, shuffle=False, random_state=42
    )
    dtc = DecisionTreeClassifier()
    dtc.fit(X_train, y_train)
    y_pred = dtc.predict(X_test)

    print(
        f"Classification report for classifier {dtc}:\n"
        f"{classification_report(y_test, y_pred)}\n"
    )

You should get accuracy around 0.67.

In [None]:
fit_and_evaluate_decision_tree_classifier(vectorized_reviews, text_data.sentiment)

### Word Embeddings

Let's see if we can improve the accuracy with word embeddings. There are many possible ways to embed words and you can even train your own. However, often it is best to use some pretrained embedding because these embeddings have been trained on massive corpora that you likely do not have access to. The size of training data is crucial to get useful embedding capturing the word similarity correctly.

In [None]:
import gensim.downloader as api

Here we load [GloVe](https://nlp.stanford.edu/projects/glove/) word embedding trained on Wikipedia articles. <span style="color:red">It can take some time to download the embeddings!</span>

In [None]:
word_vectors = api.load("glove-wiki-gigaword-100")

We can have a look at vector representation for word "hello".

In [None]:
word_vectors.get_vector("hello")

<div class="alert alert-block alert-danger"><h5><b>(Optional) Exercise 7</b></h5></div>

Transform reviews using GloVe word embeddings. Each review will be represented by a mean vector of all word embeddings in that review. Save the result to variable `review2vec`.

Note that not all words are present in the GloVe embeddings, you can ignore such words now.

Hint: Use method `get_vector` to get embeddings and then numpy to find mean. Use method `build_analyzer` of `CountVectorizer` to correctly tokenize the reviews into tokens (words).

In [None]:
# TODO: Your code goes here...
# review2vec = ...

---

If you did everything correctly you should get accuracy of 0.6. That is not better than bag-of-words but sometimes more complex model does not mean better results.

In [None]:
fit_and_evaluate_decision_tree_classifier(review2vec, text_data.sentiment)

### Transformers

Right now the state of the art models for many NLP tasks are so-called transformers. They are similar to word2vec in the sense that they build a word embedding internally but in a clever way. In stead of looking at the whole sentence they have attention mechanisms that can locate important positions in the sentence. This way they can better capture the true meaning of a sentence. Transformers are highly advanced topic well outside of this course's scope. If you are interested in transformers you can look at this [primer on NLP models and transformer](https://www.youtube.com/watch?v=rURRYI66E54)