# Tutorial 01 â€“ Working with Data

In this tutorial we will take a look at how to represent, manipulate, store, and load data. We will use two widely used packages called `NumPy` and `pandas`. The actual representation of data is handled by `NumPy` and it uses n-dimensional array objects to stored the data. Part of the code is written in C, C++, and Fortran to make it fast. `pandas` is high-level abstraction that makes data manipulation like selecting a specific subset of data, grouping, and basic plotting simple. Mastering just these two packages almost qualifies you as a data scientist.

While none of these two packages is really doing any machine learning you can go quite a long way with them. `NumPy` alone is perfect for any linear algebra like matrix multiplication and it is an essential requirement for other machine learning packages. `pandas` is great for exploring data and storing/loading data.

Remember, nicely prepared data are the key to success in machine learning.

## NumPy

`NumPy` is typically imported with alias `np` to save some characters.

In [None]:
import numpy as np

You can create a simple array right from a list.

In [None]:
alist = [1, 2, 3, 4]
array = np.array(alist)
array

They behave quite similarly in terms of indexing and slicing. Doing any operations with arrays is quite different though. They behave more like vectors (which they are after all). The key difference is that it is a proper array and it cannot be easily extended with new elements.

In [None]:
print(array[1])
print(array[1::2])
print(array * 2)
print(array + 10)
print(array + array)
print(array * array)
# array.append(10) this does not work!

Notice that the math operations are done element-wise. Multiplying `array` with it self results in vector of squared elements. If you would like to do some linear algebra you need to call a method.

In [None]:
array.dot(array)

Now its time to go to higher dimensions.

In [None]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix

Multidimensional arrays can be also index in each dimension. 

In [None]:
matrix[1:, :2]

You can even use array of booleans to take elements on indices with value True.

In [None]:
matrix[[True, False, True]]

Often you will need to create larger arrays with some specific values and you would not want to manually fill all the numbers. Luckily, there are handful of methods at your disposal.

In [None]:
print(np.zeros(10))
print(np.full(3, fill_value=7))
print(np.identity(10))

Now, every array has its shape, i.e., dimensions. They are always represented as tuples and each dimension is called axis.

In [None]:
print(array.shape)
print(np.identity(10).shape)

Note that when creating an array you can supply any wild shape you want the new array to have. That's right, you can make a tensor.

In [None]:
np.ones(shape=(3, 5, 7))

Every array can be reshaped into any shape you like as long as the number of elements is equal to the product of array dimensions.

<div class="alert alert-block alert-warning"><b>Exercise 1</b></div>

Reshape the vector $\mathbf{a}$ into shape $(3,4,2)$, multiply it with matrix $\mathbf{M}$ and than flatten it back to vector. The expected result is $[ 11,  53,  95, 137, 179, 221, 263, 305, 347, 389, 431, 473]$.

In [None]:
M = np.arange(10, 12)
a = np.arange(24)

# TODO: your code goes here...

All these operations are incredibly fast. Below is a code that takes the matrix and normalizes the values, i.e., divides all values by maximal value in the matrix.

In [None]:
matrix = np.random.randint(0, 100, (1000, 1000))
%timeit -n 5 matrix / matrix.max()

<div class="alert alert-block alert-warning"><b>Exercise 2</b></div>

Write a function that does the same thing, i.e., divides each element by the maximal values in the matrix, but with list of lists and compare the times.

In [None]:
def normalize(list_matrix):
    # TODO: your code goes here...
    pass

In [None]:
list_matrix = matrix.tolist()
%timeit -n 5 normalize(list_matrix)

<div class="alert alert-block alert-warning"><b>Exercise 3</b></div>

Later in the course we will use metrics to evaluate machine learning models. Write a function `rmse(x, y)` that computes one such metric called Root Mean Squared Error (RMSE) of two vectors $x$ and $y$ given by the following formula $$\text{RMSE} = \sqrt{\frac{\sum_{i=0}^{n}{(x_i-y_i)^2}}{n}}$$

The expected result for the predefined vectors is 2.16794833886788.

In [None]:
x = np.array([0, 4, 1, 0, 2, 0, 2, 2, 4, 2])
y = np.array([1, 1, 2, 3, 1, 4, 1, 0, 3, 4])


def rmse(x, y):
    # TODO: your code goes here...
    pass


rmse(x, y)

<div class="alert alert-block alert-warning"><b>Exercise 4</b></div>

Typical operation in linear algebra you might encounter is a matrix multiplication. Let's say we measure three quantities (e.g. width, height, and depth of real life objects). For each observation we have a vector of these measurements $(w, h, d)$. Now suppose we need to transform these measurements for further processing such that each observation is transformed into $(w+h, h-d)$. We want to transform these observations at once using transformation matrix $\mathbf{M}$.
1. Stack observation into a matrix of observations $\mathbf{X}$ with each row corresponding to one of the observations, i.e. $\mathbf{X} = \begin{pmatrix}x_1 \\ x_2 \\ x_3 \\ x_4\end{pmatrix}$.
2. Write a transformation matrix $\mathbf{M}$ such that $\mathbf{X}\mathbf{M} = \mathbf{Y}$ where $\mathbf{Y}$ has transformed observations as rows. In other words, the following equations needs to hold.
$$\begin{pmatrix}1 & 4 & 6 \\ 2 & 4 & 7 \\ 1 & 2 & 8 \\ 1 & 4 & 9\end{pmatrix} \times \mathbf{M} = \begin{pmatrix}5 & -2 \\ 6 & -3 \\ 3 & -6 \\ 5 & -5 \end{pmatrix}$$

In [None]:
x1 = np.array([1, 4, 6])
x2 = np.array([2, 4, 7])
x3 = np.array([1, 2, 8])
x4 = np.array([1, 4, 9])

# TODO: your code goes here...

<div class="alert alert-block alert-danger"><b>Exercise 5</b></div>

Matrix multiplication is also basis for projections that helps to reduce dimensionality of the data. One such projection is known as Principal Component Analysis (PCA) that we will cover in depth later. For now, let's say we have observation $\mathbf{x_1},\ldots,\mathbf{x_4}$ and we want to project them using PCA.
1. Stack observation vectors into a matrix $\mathbf{X}$.
2. Center each column of $\mathbf{X}$ by subtracting its mean. We lable the centered matrix $\mathbf{X_c}$.
3. Compute $\mathbf{C_X}$ which is a covariation matrix of $\mathbf{X^T}$, i.e., transposed matrix $\mathbf{X}$.
4. Computer projected observations $\mathbf{Y}$ given by the formula
$$\mathbf{Y} = \mathbf{X_c}\mathbf{P}$$
where $\mathbf{P}$ is a matrix of eigenvectors of $\mathbf{C_X}$.

Expected result for the given observations is $$Y = \begin{pmatrix}  \phantom{-}0.61750836 & \phantom{-}0.10599488 & -0.02394807 & 0. \\ -0.48025785 & -0.12818168 & -0.03235508 & 0. \\ \phantom{-}0.22710374 & -0.20107790 & \phantom{-}0.03341513 & 0. \\ -0.36435424 & \phantom{-}0.22326470 & \phantom{-}0.02288802 &  0. \end{pmatrix}$$

In [None]:
x1 = np.array([1, 1, 1, 1])
x2 = np.array([1, 0.5, 0.9, 0])
x3 = np.array([1, 1, 1, 0.5])
x4 = np.array([1, 0.3, 1, 0.3])
X = np.stack((x1, x2, x3, x4))

# TODO: your code goes here...

## Pandas

`Pandas` is usually imported with alias `pd` to save some characters.

In [None]:
import pandas as pd

The main feature of `Pandas` is a data structure called `DataFrame`. You can think of it as a table with named columns and number rows. Let's create some data frames.

In [None]:
df = pd.DataFrame(
    [[10, "hey", 0.3], [20, "hi", 0.1], [30, "hello", 0.16]],
    columns=["number", "greeting", "ratio"],
    index=["first", "second", "third"],
)
df

You can also specify data column-wise. Both styles produce the same result.

In [None]:
pd.DataFrame(
    {
        "number": [10, 20, 30],
        "greeting": ["hey", "hi", "hello"],
        "ratio": [0.3, 0.1, 0.16],
    },
    index=["first", "second", "third"],
)

You can select a sub-table both by names of columns/rows or by their indices. The data are actually saved as `Numpy` array so you can use the same indexing techniques.

In [None]:
df["number"]

In [None]:
df[["number", "ratio"]]

In [None]:
df.loc["first"]

In [None]:
df.loc[["first", "third"]]

In [None]:
df.loc[["first", "third"], ["greeting"]]

In [None]:
df.iloc[1:, :2]

You can also query the data frame and get rows that satisfy a defined condition. This can be done either by supplying binary index vector (the first example) or explicitly calling `query` method (the second example).

In [None]:
df[(df.number >= 20) & (df.greeting.str.contains("e"))]

In [None]:
df.query("number > (ratio * 100)")

`Pandas` also implements all the basic statistics you might need to know about data.

In [None]:
df.describe()

In [None]:
df["number"].median()

Apart from manipulating data in memory, it is also handy for loading and storing data on disk. It implements many standard file formats like `csv`, `hdf`, and `pickle`. Saving and loading a file is as easy as calling a single method.

In [None]:
df.to_csv("example.csv")

In [None]:
pd.read_csv("example.csv")

Notice that the process of saving and loading the same file does not necessarily result in the identical data frame. In this example, the index of original data frame has become a column. In this very example we need to specify column that will be used as index. This, however, is not a universal solution, you might encounter many problems.

In [None]:
pd.read_csv("example.csv", index_col=0)

You can even specify an URL of file you would like to read.

In [None]:
weather = pd.read_csv("https://www.fi.muni.cz/ib031/datasets/weather.csv")
weather

<div class="alert alert-block alert-warning"><b>Exercise 6</b></div>

Let's do some "data science"! Compute a mean temperature of days when golf was played. The expected result is 73.0.

In [None]:
# TODO: your code goes here...

<div class="alert alert-block alert-warning"><b>Exercise 7</b></div>

Compute the most common weather outlook when golf was not played. The expected result is sunny.

In [None]:
# TODO: your code goes here...

It is really easy to modify and calculate with whole columns. You can also save the result as new column. Notice that you can access columns either like values in dictionaries but also as attributes.

In [None]:
weather["humidity"] + weather.temperature

In [None]:
weather["over_70"] = weather.temperature > 70
weather

<div class="alert alert-block alert-warning"><b>Exercise 8</b></div>

Compute the correlation between columns `windy` and `play`. The expected result is -0.258199.

In [None]:
# TODO: your code goes here...

Until now, we were simply using the data as it is. Sometimes this is not enough and we need to rearrange the data a bit. There are three common operations: grouping, pivoting, and melting. Grouping is useful if you want to aggregate information base on value in a specified column(s). Pivot and melt operations reshapes the data frame. You might need these operations to reshape the data frame for easier plotting.

In [None]:
weather[["outlook", "humidity"]].groupby("outlook").mean()

The function applied to the groups can be arbitrary, e.g., take the last row with the given value.

In [None]:
weather.groupby("outlook").apply(lambda group: group.iloc[-1])

In [None]:
melted_df = weather.melt(id_vars="play", value_vars=["windy", "outlook"])
melted_df

In [None]:
weather[:5].pivot(index="windy", columns="temperature", values=["play"])

<div class="alert alert-block alert-danger"><b>Exercise 9</b></div>

Download and load dataset `attempts.csv` from https://github.com/adaptive-learning/adaptive-learning-research/tree/master/data/robomission-2019-12. Read the instructions on the page and briefly read through data description.

1. Find id of the student with the most solved problems. Expected result is 16364.
2. Compute average success rate of problems. Expected results starts with 0.870189, 0.973117, 0.866077, 0.362319, 0.658576.
3. Add 2 minutes to all time stamps in column `start`.
4. Create a table with row for each student and column for each item. The value on row $i$ and column $j$ says whether student $j$ have ever solved problem $j$.

In [None]:
# TODO: your code goes here...