# PB016: Artificial intelligence I, labs 12 - Deep learning

Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:
1. __Dummy deep learning pipeline__
2. __Developing your own deep learning classifier__

---

## 1. Dummy deep learning pipeline

__Basic facts__
- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).
- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:
 - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.
 - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.
 - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow.

### The example task - predicting onset of diabetes using Keras
- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).
- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.
- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index.

#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)

In [None]:
# importing the library for handy CSV file processing
import pandas as pd

# loading the data, in CSV format, from the web
dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')

# checking the first few rows of the CSV
dataframe.head()

### Creating the features and labels data structures

In [None]:
# getting just the Outcome column as the vector of labels
# - note that the column contains 0, 1 values that correspond to negaive 
# (no diabetes developed) and positive (diabetes developed) example labels,
# respectively
df_labels = dataframe.Outcome.values.astype(float)
# the features are the data minus the label vector
# - this contains the remaining features present in the data
df_features = dataframe.drop('Outcome',axis=1).values

### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)

In [None]:
# importing a convenience data splitting function from scikit-learn

from sklearn.model_selection import train_test_split

# computing a random 80-20 split (80% training data, 20% of remaining
# "unseen" data for testing the model trained on the 80%)

x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\
 test_size=0.2,\
 random_state=42)

### Creating a [Keras](https://keras.io/) model

In [None]:
# importing the basics from Keras

from keras.models import Sequential
from keras.layers import Dense

# designing the model

# a model for simple sequential stacking of layers
model = Sequential()
# the first hidden layer, linked to the input layer matching the feature 
# vectors of size 8 (number of features in the diabetes data)
model.add(Dense(100,input_dim=8,activation='sigmoid')) 
# the second hidden layer
model.add(Dense(100,activation='sigmoid'))
# the output layer, for classification into 2 classes (does/doesn't develop
# diabetes)
model.add(Dense(2,activation='softmax'))

# compiling the model
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])

### Training the created model

In [None]:
# simply calling the fit function, training on the training data and validating
# on the test data after each epoch
model.fit(x_train,y_train,epochs=15,\
 validation_data=(x_test, y_test))

### Interpreting the results
- Not too great:
 - The loss is barely being optimised.
 - The accuracy worse then a random baseline (0.5) in most runs.
- The reasons:
 - More or less boilerplate/default settings of the model.
 - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out).

---

## 2. Developing your own deep learning classifier
- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).

![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)

- Split into groups (for instance, one row of seats makes one group).
- Each group will choose a coordinator. That person will be responsible for
 - Outlining the overall solution and distributing specific sub-tasks among group members.
 - Integrating the results of the group's work in their shared notebook.
 - Subsequent presentation of the results to the rest of the class.
- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition (one account per group is enough).
- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:
 - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.
 - Design a simple neural model for classification of (non)survivors using Keras.
 - Train the model on the `train.csv` dataset (after possibly preprocessing the data).
 - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _"survived"_; for more details, see the competition documentation itself).
 - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.
 - Brag with your model and score to the rest of the class!


### Loading the train and test data

In [None]:
# importing pandas, just in case it wasn't imported before
import pandas as pd

# loading the train and test data using pandas

df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\
 index_col='PassengerId')
df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \
 index_col='PassengerId')

### Checking out the train and test data contents

In [None]:
df_train.head()

In [None]:
df_test.head()

### Developing the model itself

In [None]:
# TODO - YOUR MODEL GOES HERE

### Notes on the solution
- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.
- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:
 - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.
 - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine.