# PB016: Artificial intelligence I, labs 12 - Deep learning

Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:
1. __Dummy deep learning pipeline__
2. __Developing your own deep learning classifier__

---

## 1. Dummy deep learning pipeline

__Basic facts__
- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).
- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:
 - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.
 - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.
 - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow.

### The example task - predicting onset of diabetes using Keras
- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).
- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.
- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index.

#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)

In [1]:
# importing the library for handy CSV file processing
import pandas as pd

# loading the data, in CSV format, from the web
dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')

# checking the first few rows of the CSV
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Creating the features and labels data structures

In [3]:
# getting just the Outcome column as the vector of labels
# - note that the column contains 0, 1 values that correspond to negaive 
#   (no diabetes developed) and positive (diabetes developed) example labels,
#   respectively
df_labels = dataframe.Outcome.values.astype(float)
# the features are the data minus the label vector
# - this contains the remaining features present in the data
df_features = dataframe.drop('Outcome',axis=1).values

### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)

In [5]:
# importing a convenience data splitting function from scikit-learn

from sklearn.model_selection import train_test_split

# computing a random 80-20 split (80% training data, 20% of remaining
# "unseen" data for testing the model trained on the 80%)

x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\
                                                    test_size=0.2,\
                                                    random_state=42)

### Creating a [Keras](https://keras.io/) model

In [6]:
# importing the basics from Keras

from keras.models import Sequential
from keras.layers import Dense

# designing the model

# a model for simple sequential stacking of layers
model = Sequential()
# the first hidden layer, linked to the input layer matching the feature 
# vectors of size 8 (number of features in the diabetes data)
model.add(Dense(100,input_dim=8,activation='sigmoid')) 
# the second hidden layer
model.add(Dense(100,activation='sigmoid'))
# the output layer, for classification into 2 classes (does/doesn't develop
# diabetes)
model.add(Dense(2,activation='softmax'))

# compiling the model
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])

### Training the created model

In [None]:
# simply calling the fit function, training on the training data and validating
# on the test data after each epoch
model.fit(x_train,y_train,epochs=15,\
          validation_data=(x_test, y_test))

### Interpreting the results
- Not too great:
 - The loss is barely being optimised.
 - The accuracy worse then a random baseline (0.5) in most runs.
- The reasons:
 - More or less boilerplate/default settings of the model.
 - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out).

---

## 2. Developing your own deep learning classifier
- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).

![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)

- Split into groups (for instance, one row of seats makes one group).
- Each group will choose a coordinator. That person will be responsible for
 - Outlining the overall solution and distributing specific sub-tasks among group members.
 - Integrating the results of the group's work in their shared notebook.
 - Subsequent presentation of the results to the rest of the class.
- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition  (one account per group is enough).
- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:
 - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.
 - Design a simple neural model for classification of (non)survivors using Keras.
 - Train the model on the `train.csv` dataset (after possibly preprocessing the data).
 - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _"survived"_; for more details, see the competition documentation itself).
 - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.
 - Brag with your model and score to the rest of the class!


### Loading the train and test data

In [8]:
# importing pandas, just in case it wasn't imported before
import pandas as pd

# loading the train and test data using pandas

df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\
                       index_col='PassengerId')
df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \
                      index_col='PassengerId')

### Checking out the train and test data contents

In [9]:
df_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
df_test.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Developing the model itself

In [None]:
# TODO - YOUR MODEL GOES HERE

### Notes on the solution
- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.
- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:
 - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.
 - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine.

### __Possible solution of the task__
- Note: The following code is largely based on the simpler solution alternative presented [here](https://www.kaggle.com/stefanbergstein/keras-deep-learning-on-titanic-data).

#### Preprocessing the training data

In [12]:
# the function dealing with all preprocessing steps

def preprocess_data(df):
    # drop unwanted features (stuff that likely is just noise with no
    # discriminvative power w.r.t. surviving the sinking)
    df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
    
    # impute missing data: Age and Fare with the mean, Embarked with the most 
    # frequent value
    df[['Age']] = df[['Age']].fillna(value=df[['Age']].mean())
    df[['Fare']] = df[['Fare']].fillna(value=df[['Fare']].mean())
    df[['Embarked']] = df[['Embarked']].fillna(value=\
                                        df['Embarked'].value_counts().idxmax())
    
    # convert the categorical feature Sex into a binary/numeric one
    df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
      
    # convert Embarked to one-hot (i.e., N binary yes/no features for one 
    # feature with N possible categorical values)
    embarked_one_hot = pd.get_dummies(df['Embarked'], prefix='Embarked')
    df = df.drop('Embarked', axis=1)
    df = df.join(embarked_one_hot)

    return df

In [13]:
# preprocessing the training data using the function defined above

df_train = preprocess_data(df_train)
df_train.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,0,22.0,1,0,7.25,0,0,1
2,1,1,1,38.0,1,0,71.2833,1,0,0
3,1,3,1,26.0,0,0,7.925,0,0,1
4,1,1,1,35.0,1,0,53.1,0,0,1
5,0,3,0,35.0,0,0,8.05,0,0,1


#### Creating the feature and label data structures, smoothing the features using scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [14]:
# importing a standard scaler from scikit-learn to smooth out the features
from sklearn.preprocessing import StandardScaler

# dropping the labels from the data frame, type conversion of feature vectors
X = df_train.drop(['Survived'], axis=1).values.astype(float)

# scaling the values of the features to make them more uniform
scaler = StandardScaler()
X = scaler.fit_transform(X)

# labels are simply the Survived values
Y = df_train['Survived'].values

#### Creating a Keras model template

In [16]:
from keras.models import Sequential
from keras.layers import Dense

def create_model(optimizer='rmsprop', init='glorot_uniform'):
    # create model
    model = Sequential()
    # the first hidden layer, linked with input matching the size of the 
    # feature vectors (i.e., the number of columns in X - X.shape[1])
    model.add(Dense(16, input_dim=X.shape[1], kernel_initializer=init, \
                    activation='relu'))
    model.add(Dense(8, kernel_initializer=init, activation='relu'))
    model.add(Dense(4, kernel_initializer=init, activation='relu'))
    model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
    # compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, \
                  metrics=['accuracy'])
    return model

#### Training a specific model created by the above function

In [None]:
# importing a utility classifier wrapper for Keras models
from keras.wrappers.scikit_learn import KerasClassifier
import tensorflow as tf

# early stopping callback (for stopping when the loss doesn't improve for a 
# while, even if we haven't gone through all epochs yet)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# creating the classifier based on the above model definition
model = KerasClassifier(build_fn=create_model, \
                        epochs=200, \
                        batch_size=5)

# fitting the classifier
model.fit(X, Y, callbacks=[callback])

#### Using the trained model on the test data

In [19]:
# preprocessing the test data (needs to be the same shape, distribution, etc.,
# like the train)
df_test = preprocess_data(df_test)
# creating the X_test feature matrix
X_test = df_test.values.astype(float)
# scaling the X_test with the scaler trained on the trained data
X_test = scaler.transform(X_test)

# predict the 'Survived' values
predictions = model.predict(X_test)

#### Preparing the Kaggle submission file
- Note: After submitting this one to Kaggle, the accuracy should be slightly over 76% - not too bad, but not too great either (more sophisticated data preprocessing and model hyper-parameter optimisation would take us way further).

In [None]:
# the pandas data frame with the results
submission = pd.DataFrame({
    'PassengerId': df_test.index,
    'Survived': predictions[:,0],
})

# storing the submissions as CSV
submission.sort_values('PassengerId', inplace=True)    
submission.to_csv('submission-naive.csv', index=False)

# downloading the created CSV file locally
from google.colab import files
files.download('submission-naive.csv')

---

#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:
- Image of Titanic:
 - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:St%C3%B6wer_Titanic.jpg)
 - Author: Willy Stöwer (image reproduction)
 - License: none (or [Public Domain](https://en.wikipedia.org/wiki/public_domain))