{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.1"},"colab":{"name":"PB016_lab12_G01.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"A8Cj_X1VnmC0"},"source":["# PB016: Artificial intelligence I, labs 12 - Deep learning\n","\n","Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:\n","1. __Dummy deep learning pipeline__\n","2. __Developing your own deep learning classifier__"]},{"cell_type":"markdown","metadata":{"id":"MjLOiihWnmC1"},"source":["---\n","\n","## 1. Dummy deep learning pipeline\n","\n","__Basic facts__\n","- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).\n","- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:\n"," - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.\n"," - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.\n"," - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow."]},{"cell_type":"markdown","metadata":{"id":"OtRA1s8qdz2e"},"source":["### The example task - predicting onset of diabetes using Keras\n","- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).\n","- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.\n","- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index."]},{"cell_type":"markdown","metadata":{"id":"WtWS3vDyd9pI"},"source":["#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)"]},{"cell_type":"code","metadata":{"id":"DvmXGN9OWh3O","colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"status":"ok","timestamp":1638116593835,"user_tz":-60,"elapsed":1029,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"479e16d7-ca83-485c-8d51-4e34d749d29c"},"source":["# importing the library for handy CSV file processing\n","import pandas as pd\n","\n","# loading the data, in CSV format, from the web\n","dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')\n","\n","# checking the first few rows of the CSV\n","dataframe.head()"],"execution_count":1,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Pregnancies</th>\n","      <th>Glucose</th>\n","      <th>BloodPressure</th>\n","      <th>SkinThickness</th>\n","      <th>Insulin</th>\n","      <th>BMI</th>\n","      <th>DiabetesPedigreeFunction</th>\n","      <th>Age</th>\n","      <th>Outcome</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>6</td>\n","      <td>148</td>\n","      <td>72</td>\n","      <td>35</td>\n","      <td>0</td>\n","      <td>33.6</td>\n","      <td>0.627</td>\n","      <td>50</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>85</td>\n","      <td>66</td>\n","      <td>29</td>\n","      <td>0</td>\n","      <td>26.6</td>\n","      <td>0.351</td>\n","      <td>31</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>8</td>\n","      <td>183</td>\n","      <td>64</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>23.3</td>\n","      <td>0.672</td>\n","      <td>32</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>1</td>\n","      <td>89</td>\n","      <td>66</td>\n","      <td>23</td>\n","      <td>94</td>\n","      <td>28.1</td>\n","      <td>0.167</td>\n","      <td>21</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>0</td>\n","      <td>137</td>\n","      <td>40</td>\n","      <td>35</td>\n","      <td>168</td>\n","      <td>43.1</td>\n","      <td>2.288</td>\n","      <td>33</td>\n","      <td>1</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   Pregnancies  Glucose  BloodPressure  ...  DiabetesPedigreeFunction  Age  Outcome\n","0            6      148             72  ...                     0.627   50        1\n","1            1       85             66  ...                     0.351   31        0\n","2            8      183             64  ...                     0.672   32        1\n","3            1       89             66  ...                     0.167   21        0\n","4            0      137             40  ...                     2.288   33        1\n","\n","[5 rows x 9 columns]"]},"metadata":{},"execution_count":1}]},{"cell_type":"markdown","metadata":{"id":"XzQWHmkmg3BZ"},"source":["### Creating the features and labels data structures"]},{"cell_type":"code","metadata":{"id":"dYwVJCQng8gJ","executionInfo":{"status":"ok","timestamp":1638116782996,"user_tz":-60,"elapsed":300,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# getting just the Outcome column as the vector of labels\n","# - note that the column contains 0, 1 values that correspond to negaive \n","#   (no diabetes developed) and positive (diabetes developed) example labels,\n","#   respectively\n","df_labels = dataframe.Outcome.values.astype(float)\n","# the features are the data minus the label vector\n","# - this contains the remaining features present in the data\n","df_features = dataframe.drop('Outcome',axis=1).values"],"execution_count":3,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"diBlgxl_hTCS"},"source":["### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)"]},{"cell_type":"code","metadata":{"id":"kL8i6lBXhWfi","executionInfo":{"status":"ok","timestamp":1638116845656,"user_tz":-60,"elapsed":342,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a convenience data splitting function from scikit-learn\n","\n","from sklearn.model_selection import train_test_split\n","\n","# computing a random 80-20 split (80% training data, 20% of remaining\n","# \"unseen\" data for testing the model trained on the 80%)\n","\n","x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\\\n","                                                    test_size=0.2,\\\n","                                                    random_state=42)"],"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"paAJQMT6iAX2"},"source":["### Creating a [Keras](https://keras.io/) model"]},{"cell_type":"code","metadata":{"id":"x3oWiAHwiC2y","executionInfo":{"status":"ok","timestamp":1638117108753,"user_tz":-60,"elapsed":2981,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing the basics from Keras\n","\n","from keras.models import Sequential\n","from keras.layers import Dense\n","\n","# designing the model\n","\n","# a model for simple sequential stacking of layers\n","model = Sequential()\n","# the first hidden layer, linked to the input layer matching the feature \n","# vectors of size 8 (number of features in the diabetes data)\n","model.add(Dense(100,input_dim=8,activation='sigmoid')) \n","# the second hidden layer\n","model.add(Dense(100,activation='sigmoid'))\n","# the output layer, for classification into 2 classes (does/doesn't develop\n","# diabetes)\n","model.add(Dense(2,activation='softmax'))\n","\n","# compiling the model\n","model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])"],"execution_count":6,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"nXoiiGpCjaD0"},"source":["### Training the created model"]},{"cell_type":"code","metadata":{"id":"R0k0oJzyjb22"},"source":["# simply calling the fit function, training on the training data and validating\n","# on the test data after each epoch\n","model.fit(x_train,y_train,epochs=15,\\\n","          validation_data=(x_test, y_test))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"X0pTXap1nBq5"},"source":["### Interpreting the results\n","- Not too great:\n"," - The loss is barely being optimised.\n"," - The accuracy worse then a random baseline (0.5) in most runs.\n","- The reasons:\n"," - More or less boilerplate/default settings of the model.\n"," - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out)."]},{"cell_type":"markdown","metadata":{"id":"lr1fmV_UnmDZ"},"source":["---\n","\n","## 2. Developing your own deep learning classifier\n","- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).\n","\n","![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)\n","\n","- Split into groups (for instance, one row of seats makes one group).\n","- Each group will choose a coordinator. That person will be responsible for\n"," - Outlining the overall solution and distributing specific sub-tasks among group members.\n"," - Integrating the results of the group's work in their shared notebook.\n"," - Subsequent presentation of the results to the rest of the class.\n","- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition  (one account per group is enough).\n","- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:\n"," - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.\n"," - Design a simple neural model for classification of (non)survivors using Keras.\n"," - Train the model on the `train.csv` dataset (after possibly preprocessing the data).\n"," - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _\"survived\"_; for more details, see the competition documentation itself).\n"," - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.\n"," - Brag with your model and score to the rest of the class!\n"]},{"cell_type":"markdown","metadata":{"id":"lWIcu-WFrB-N"},"source":["### Loading the train and test data"]},{"cell_type":"code","metadata":{"id":"kzC3yqpHniPM","executionInfo":{"status":"ok","timestamp":1638117506631,"user_tz":-60,"elapsed":3020,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing pandas, just in case it wasn't imported before\n","import pandas as pd\n","\n","# loading the train and test data using pandas\n","\n","df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\\\n","                       index_col='PassengerId')\n","df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \\\n","                      index_col='PassengerId')"],"execution_count":8,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"IA4hfwGR26oP"},"source":["### Checking out the train and test data contents"]},{"cell_type":"code","metadata":{"id":"agY80lGO2_Sa","colab":{"base_uri":"https://localhost:8080/","height":290},"executionInfo":{"status":"ok","timestamp":1638117528228,"user_tz":-60,"elapsed":369,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"8db27e87-9332-49c6-f8a0-e877f2c728c3"},"source":["df_train.head()"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Survived</th>\n","      <th>Pclass</th>\n","      <th>Name</th>\n","      <th>Sex</th>\n","      <th>Age</th>\n","      <th>SibSp</th>\n","      <th>Parch</th>\n","      <th>Ticket</th>\n","      <th>Fare</th>\n","      <th>Cabin</th>\n","      <th>Embarked</th>\n","    </tr>\n","    <tr>\n","      <th>PassengerId</th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>1</th>\n","      <td>0</td>\n","      <td>3</td>\n","      <td>Braund, Mr. Owen Harris</td>\n","      <td>male</td>\n","      <td>22.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>A/5 21171</td>\n","      <td>7.2500</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n","      <td>female</td>\n","      <td>38.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>PC 17599</td>\n","      <td>71.2833</td>\n","      <td>C85</td>\n","      <td>C</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>1</td>\n","      <td>3</td>\n","      <td>Heikkinen, Miss. Laina</td>\n","      <td>female</td>\n","      <td>26.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>STON/O2. 3101282</td>\n","      <td>7.9250</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n","      <td>female</td>\n","      <td>35.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>113803</td>\n","      <td>53.1000</td>\n","      <td>C123</td>\n","      <td>S</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>0</td>\n","      <td>3</td>\n","      <td>Allen, Mr. William Henry</td>\n","      <td>male</td>\n","      <td>35.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>373450</td>\n","      <td>8.0500</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["             Survived  Pclass  ... Cabin Embarked\n","PassengerId                    ...               \n","1                   0       3  ...   NaN        S\n","2                   1       1  ...   C85        C\n","3                   1       3  ...   NaN        S\n","4                   1       1  ...  C123        S\n","5                   0       3  ...   NaN        S\n","\n","[5 rows x 11 columns]"]},"metadata":{},"execution_count":9}]},{"cell_type":"code","metadata":{"id":"3RAwtMw53LMa","colab":{"base_uri":"https://localhost:8080/","height":238},"executionInfo":{"status":"ok","timestamp":1638117531606,"user_tz":-60,"elapsed":323,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"cc9932f5-76a6-47f2-ec9c-2f1abbdc2d7b"},"source":["df_test.head()"],"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Pclass</th>\n","      <th>Name</th>\n","      <th>Sex</th>\n","      <th>Age</th>\n","      <th>SibSp</th>\n","      <th>Parch</th>\n","      <th>Ticket</th>\n","      <th>Fare</th>\n","      <th>Cabin</th>\n","      <th>Embarked</th>\n","    </tr>\n","    <tr>\n","      <th>PassengerId</th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>892</th>\n","      <td>3</td>\n","      <td>Kelly, Mr. James</td>\n","      <td>male</td>\n","      <td>34.5</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>330911</td>\n","      <td>7.8292</td>\n","      <td>NaN</td>\n","      <td>Q</td>\n","    </tr>\n","    <tr>\n","      <th>893</th>\n","      <td>3</td>\n","      <td>Wilkes, Mrs. James (Ellen Needs)</td>\n","      <td>female</td>\n","      <td>47.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>363272</td>\n","      <td>7.0000</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","    <tr>\n","      <th>894</th>\n","      <td>2</td>\n","      <td>Myles, Mr. Thomas Francis</td>\n","      <td>male</td>\n","      <td>62.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>240276</td>\n","      <td>9.6875</td>\n","      <td>NaN</td>\n","      <td>Q</td>\n","    </tr>\n","    <tr>\n","      <th>895</th>\n","      <td>3</td>\n","      <td>Wirz, Mr. Albert</td>\n","      <td>male</td>\n","      <td>27.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>315154</td>\n","      <td>8.6625</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","    <tr>\n","      <th>896</th>\n","      <td>3</td>\n","      <td>Hirvonen, Mrs. Alexander (Helga E Lindqvist)</td>\n","      <td>female</td>\n","      <td>22.0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>3101298</td>\n","      <td>12.2875</td>\n","      <td>NaN</td>\n","      <td>S</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["             Pclass  ... Embarked\n","PassengerId          ...         \n","892               3  ...        Q\n","893               3  ...        S\n","894               2  ...        Q\n","895               3  ...        S\n","896               3  ...        S\n","\n","[5 rows x 10 columns]"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","metadata":{"id":"z7jqJu5SrGec"},"source":["### Developing the model itself"]},{"cell_type":"code","metadata":{"id":"jESxfpu-rIzx"},"source":["# TODO - YOUR MODEL GOES HERE"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"diP58FYa-TYD"},"source":["### Notes on the solution\n","- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.\n","- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:\n"," - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.\n"," - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine."]},{"cell_type":"markdown","metadata":{"id":"Xe_zOGCS0Pet"},"source":["### __Possible solution of the task__\n","- Note: The following code is largely based on the simpler solution alternative presented [here](https://www.kaggle.com/stefanbergstein/keras-deep-learning-on-titanic-data)."]},{"cell_type":"markdown","metadata":{"id":"Yz0SCO7T0UWy"},"source":["#### Preprocessing the training data"]},{"cell_type":"code","metadata":{"id":"8FO5HA0G0YEO","executionInfo":{"status":"ok","timestamp":1638117878555,"user_tz":-60,"elapsed":398,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# the function dealing with all preprocessing steps\n","\n","def preprocess_data(df):\n","    # drop unwanted features (stuff that likely is just noise with no\n","    # discriminvative power w.r.t. surviving the sinking)\n","    df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n","    \n","    # impute missing data: Age and Fare with the mean, Embarked with the most \n","    # frequent value\n","    df[['Age']] = df[['Age']].fillna(value=df[['Age']].mean())\n","    df[['Fare']] = df[['Fare']].fillna(value=df[['Fare']].mean())\n","    df[['Embarked']] = df[['Embarked']].fillna(value=\\\n","                                        df['Embarked'].value_counts().idxmax())\n","    \n","    # convert the categorical feature Sex into a binary/numeric one\n","    df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)\n","      \n","    # convert Embarked to one-hot (i.e., N binary yes/no features for one \n","    # feature with N possible categorical values)\n","    embarked_one_hot = pd.get_dummies(df['Embarked'], prefix='Embarked')\n","    df = df.drop('Embarked', axis=1)\n","    df = df.join(embarked_one_hot)\n","\n","    return df"],"execution_count":12,"outputs":[]},{"cell_type":"code","metadata":{"id":"DTlhJJyr1JKB","colab":{"base_uri":"https://localhost:8080/","height":238},"executionInfo":{"status":"ok","timestamp":1638117891251,"user_tz":-60,"elapsed":416,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"c2c346e3-d95e-4cb7-abfa-9358b316369c"},"source":["# preprocessing the training data using the function defined above\n","\n","df_train = preprocess_data(df_train)\n","df_train.head()"],"execution_count":13,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Survived</th>\n","      <th>Pclass</th>\n","      <th>Sex</th>\n","      <th>Age</th>\n","      <th>SibSp</th>\n","      <th>Parch</th>\n","      <th>Fare</th>\n","      <th>Embarked_C</th>\n","      <th>Embarked_Q</th>\n","      <th>Embarked_S</th>\n","    </tr>\n","    <tr>\n","      <th>PassengerId</th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>1</th>\n","      <td>0</td>\n","      <td>3</td>\n","      <td>0</td>\n","      <td>22.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>7.2500</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>38.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>71.2833</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>1</td>\n","      <td>3</td>\n","      <td>1</td>\n","      <td>26.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>7.9250</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>35.0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>53.1000</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>0</td>\n","      <td>3</td>\n","      <td>0</td>\n","      <td>35.0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>8.0500</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["             Survived  Pclass  Sex  ...  Embarked_C  Embarked_Q  Embarked_S\n","PassengerId                         ...                                    \n","1                   0       3    0  ...           0           0           1\n","2                   1       1    1  ...           1           0           0\n","3                   1       3    1  ...           0           0           1\n","4                   1       1    1  ...           0           0           1\n","5                   0       3    0  ...           0           0           1\n","\n","[5 rows x 10 columns]"]},"metadata":{},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"HA1KNvG11zpB"},"source":["#### Creating the feature and label data structures, smoothing the features using scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)"]},{"cell_type":"code","metadata":{"id":"2jstye7y15Hx","executionInfo":{"status":"ok","timestamp":1638118019474,"user_tz":-60,"elapsed":273,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a standard scaler from scikit-learn to smooth out the features\n","from sklearn.preprocessing import StandardScaler\n","\n","# dropping the labels from the data frame, type conversion of feature vectors\n","X = df_train.drop(['Survived'], axis=1).values.astype(float)\n","\n","# scaling the values of the features to make them more uniform\n","scaler = StandardScaler()\n","X = scaler.fit_transform(X)\n","\n","# labels are simply the Survived values\n","Y = df_train['Survived'].values"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Nio4fWP23zDo"},"source":["#### Creating a Keras model template"]},{"cell_type":"code","metadata":{"id":"1uNv531131_A","executionInfo":{"status":"ok","timestamp":1638118207135,"user_tz":-60,"elapsed":321,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["from keras.models import Sequential\n","from keras.layers import Dense\n","\n","def create_model(optimizer='rmsprop', init='glorot_uniform'):\n","    # create model\n","    model = Sequential()\n","    # the first hidden layer, linked with input matching the size of the \n","    # feature vectors (i.e., the number of columns in X - X.shape[1])\n","    model.add(Dense(16, input_dim=X.shape[1], kernel_initializer=init, \\\n","                    activation='relu'))\n","    model.add(Dense(8, kernel_initializer=init, activation='relu'))\n","    model.add(Dense(4, kernel_initializer=init, activation='relu'))\n","    model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))\n","    # compile model\n","    model.compile(loss='binary_crossentropy', optimizer=optimizer, \\\n","                  metrics=['accuracy'])\n","    return model"],"execution_count":16,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"L1WOnLiH42Y1"},"source":["#### Training a specific model created by the above function"]},{"cell_type":"code","metadata":{"id":"hvNTq9rc5Fsp"},"source":["# importing a utility classifier wrapper for Keras models\n","from keras.wrappers.scikit_learn import KerasClassifier\n","import tensorflow as tf\n","\n","# early stopping callback (for stopping when the loss doesn't improve for a \n","# while, even if we haven't gone through all epochs yet)\n","callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)\n","\n","# creating the classifier based on the above model definition\n","model = KerasClassifier(build_fn=create_model, \\\n","                        epochs=200, \\\n","                        batch_size=5)\n","\n","# fitting the classifier\n","model.fit(X, Y, callbacks=[callback])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cUF9WXHL7t9a"},"source":["#### Using the trained model on the test data"]},{"cell_type":"code","metadata":{"id":"H-r5xICx7wiS","executionInfo":{"status":"ok","timestamp":1638118378883,"user_tz":-60,"elapsed":20823,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# preprocessing the test data (needs to be the same shape, distribution, etc.,\n","# like the train)\n","df_test = preprocess_data(df_test)\n","# creating the X_test feature matrix\n","X_test = df_test.values.astype(float)\n","# scaling the X_test with the scaler trained on the trained data\n","X_test = scaler.transform(X_test)\n","\n","# predict the 'Survived' values\n","predictions = model.predict(X_test)"],"execution_count":19,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"brEAxGaK8LfW"},"source":["#### Preparing the Kaggle submission file\n","- Note: After submitting this one to Kaggle, the accuracy should be slightly over 76% - not too bad, but not too great either (more sophisticated data preprocessing and model hyper-parameter optimisation would take us way further)."]},{"cell_type":"code","metadata":{"id":"UM4oXPXI8O0l"},"source":["# the pandas data frame with the results\n","submission = pd.DataFrame({\n","    'PassengerId': df_test.index,\n","    'Survived': predictions[:,0],\n","})\n","\n","# storing the submissions as CSV\n","submission.sort_values('PassengerId', inplace=True)    \n","submission.to_csv('submission-naive.csv', index=False)\n","\n","# downloading the created CSV file locally\n","from google.colab import files\n","files.download('submission-naive.csv')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"collapsed":true,"id":"yLjsBHRLnmDa"},"source":["---\n","\n","#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:\n","- Image of Titanic:\n"," - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:St%C3%B6wer_Titanic.jpg)\n"," - Author: Willy Stöwer (image reproduction)\n"," - License: none (or [Public Domain](https://en.wikipedia.org/wiki/public_domain))"]}]}