DiabetesPedigreeFunction Age Outcome DiabetesPedigreeFunction Age Outcome\n","0 6 148 72 ... 0.627 50 1\n","1 1 85 66 ... 0.351 31 0\n","2 8 183 64 ... 0.672 32 1\n","3 1 89 66 ... 0.167 21 0\n","4 0 137 40 ... 2.288 33 1\n","\n","[5 rows x 9 columns]"]},"metadata":{},"execution_count":1}]},{"cell_type":"markdown","metadata":{"id":"XzQWHmkmg3BZ"},"source":["### Creating the features and labels data structures"]},{"cell_type":"code","metadata":{"id":"dYwVJCQng8gJ","executionInfo":{"status":"ok","timestamp":1638116782996,"user_tz":-60,"elapsed":300,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# getting just the Outcome column as the vector of labels\n","# - note that the column contains 0, 1 values that correspond to negaive \n","# (no diabetes developed) and positive (diabetes developed) example labels,\n","# respectively\n","df_labels = dataframe.Outcome.values.astype(float)\n","# the features are the data minus the label vector\n","# - this contains the remaining features present in the data\n","df_features = dataframe.drop('Outcome',axis=1).values"],"execution_count":3,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"diBlgxl_hTCS"},"source":["### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)"]},{"cell_type":"code","metadata":{"id":"kL8i6lBXhWfi","executionInfo":{"status":"ok","timestamp":1638116845656,"user_tz":-60,"elapsed":342,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a convenience data splitting function from scikit-learn\n","\n","from sklearn.model_selection import train_test_split\n","\n","# computing a random 80-20 split (80% training data, 20% of remaining\n","# \"unseen\" data for testing the model trained on the 80%)\n","\n","x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\\\n"," test_size=0.2,\\\n"," random_state=42)"],"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"paAJQMT6iAX2"},"source":["### Creating a [Keras](https://keras.io/) model"]},{"cell_type":"code","metadata":{"id":"x3oWiAHwiC2y","executionInfo":{"status":"ok","timestamp":1638117108753,"user_tz":-60,"elapsed":2981,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing the basics from Keras\n","\n","from keras.models import Sequential\n","from keras.layers import Dense\n","\n","# designing the model\n","\n","# a model for simple sequential stacking of layers\n","model = Sequential()\n","# the first hidden layer, linked to the input layer matching the feature \n","# vectors of size 8 (number of features in the diabetes data)\n","model.add(Dense(100,input_dim=8,activation='sigmoid')) \n","# the second hidden layer\n","model.add(Dense(100,activation='sigmoid'))\n","# the output layer, for classification into 2 classes (does/doesn't develop\n","# diabetes)\n","model.add(Dense(2,activation='softmax'))\n","\n","# compiling the model\n","model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])"],"execution_count":6,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"nXoiiGpCjaD0"},"source":["### Training the created model"]},{"cell_type":"code","metadata":{"id":"R0k0oJzyjb22"},"source":["# simply calling the fit function, training on the training data and validating\n","# on the test data after each epoch\n","model.fit(x_train,y_train,epochs=15,\\\n"," validation_data=(x_test, y_test))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"X0pTXap1nBq5"},"source":["### Interpreting the results\n","- Not too great:\n"," - The loss is barely being optimised.\n"," - The accuracy worse then a random baseline (0.5) in most runs.\n","- The reasons:\n"," - More or less boilerplate/default settings of the model.\n"," - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out)."]},{"cell_type":"markdown","metadata":{"id":"lr1fmV_UnmDZ"},"source":["---\n","\n","## 2. Developing your own deep learning classifier\n","- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).\n","\n","![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)\n","\n","- Split into groups (for instance, one row of seats makes one group).\n","- Each group will choose a coordinator. That person will be responsible for\n"," - Outlining the overall solution and distributing specific sub-tasks among group members.\n"," - Integrating the results of the group's work in their shared notebook.\n"," - Subsequent presentation of the results to the rest of the class.\n","- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition (one account per group is enough).\n","- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:\n"," - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.\n"," - Design a simple neural model for classification of (non)survivors using Keras.\n"," - Train the model on the `train.csv` dataset (after possibly preprocessing the data).\n"," - Use the trained model to predict the labels of the set `test.csv` (i.e., the values of the column _\"survived\"_; for more details, see the competition documentation itself).\n"," - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.\n"," - Brag with your model and score to the rest of the class!\n"]},{"cell_type":"markdown","metadata":{"id":"lWIcu-WFrB-N"},"source":["### Loading the train and test data"]},{"cell_type":"code","metadata":{"id":"kzC3yqpHniPM","executionInfo":{"status":"ok","timestamp":1638117506631,"user_tz":-60,"elapsed":3020,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing pandas, just in case it wasn't imported before\n","import pandas as pd\n","\n","# loading the train and test data using pandas\n","\n","df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\\\n"," index_col='PassengerId')\n","df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \\\n"," index_col='PassengerId')"],"execution_count":8,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"IA4hfwGR26oP"},"source":["### Checking out the train and test data contents"]},{"cell_type":"code","metadata":{"id":"agY80lGO2_Sa","colab":{"base_uri":"https://localhost:8080/","height":290},"executionInfo":{"status":"ok","timestamp":1638117528228,"user_tz":-60,"elapsed":369,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"8db27e87-9332-49c6-f8a0-e877f2c728c3"},"source":["df_train.head()"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/html":["\n","\n","
Survived Pclass ... Cabin Embarked
PassengerId ... 
1 0 3 ... NaN S
2 1 1 ... C85 C
3 1 3 ... NaN S
4 1 1 ... C123 S
5 0 3 ... NaN S

[5 rows x 11 columns]
Pclass ... Embarked
PassengerId ... 
892 3 ... Q
893 3 ... S
894 2 ... Q
895 3 ... S
896 3 ... S

[5 rows x 10 columns]
Embarked_C Embarked_Q Embarked_S Embarked_C Embarked_Q Embarked_S\n","PassengerId ... \n","1 0 3 0 ... 0 0 1\n","2 1 1 1 ... 1 0 0\n","3 1 3 1 ... 0 0 1\n","4 1 1 1 ... 0 0 1\n","5 0 3 0 ... 0 0 1\n","\n","[5 rows x 10 columns]"]},"metadata":{},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"HA1KNvG11zpB"},"source":["#### Creating the feature and label data structures, smoothing the features using scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)"]},{"cell_type":"code","metadata":{"id":"2jstye7y15Hx","executionInfo":{"status":"ok","timestamp":1638118019474,"user_tz":-60,"elapsed":273,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a standard scaler from scikit-learn to smooth out the features\n","from sklearn.preprocessing import StandardScaler\n","\n","# dropping the labels from the data frame, type conversion of feature vectors\n","X = df_train.drop(['Survived'], axis=1).values.astype(float)\n","\n","# scaling the values of the features to make them more uniform\n","scaler = StandardScaler()\n","X = scaler.fit_transform(X)\n","\n","# labels are simply the Survived values\n","Y = df_train['Survived'].values"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Nio4fWP23zDo"},"source":["#### Creating a Keras model template"]},{"cell_type":"code","metadata":{"id":"1uNv531131_A","executionInfo":{"status":"ok","timestamp":1638118207135,"user_tz":-60,"elapsed":321,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["from keras.models import Sequential\n","from keras.layers import Dense\n","\n","def create_model(optimizer='rmsprop', init='glorot_uniform'):\n"," # create model\n"," model = Sequential()\n"," # the first hidden layer, linked with input matching the size of the \n"," # feature vectors (i.e., the number of columns in X - X.shape[1])\n"," model.add(Dense(16, input_dim=X.shape[1], kernel_initializer=init, \\\n"," activation='relu'))\n"," model.add(Dense(8, kernel_initializer=init, activation='relu'))\n"," model.add(Dense(4, kernel_initializer=init, activation='relu'))\n"," model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))\n"," # compile model\n"," model.compile(loss='binary_crossentropy', optimizer=optimizer, \\\n"," metrics=['accuracy'])\n"," return model"],"execution_count":16,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"L1WOnLiH42Y1"},"source":["#### Training a specific model created by the above function"]},{"cell_type":"code","metadata":{"id":"hvNTq9rc5Fsp"},"source":["# importing a utility classifier wrapper for Keras models\n","from keras.wrappers.scikit_learn import KerasClassifier\n","import tensorflow as tf\n","\n","# early stopping callback (for stopping when the loss doesn't improve for a \n","# while, even if we haven't gone through all epochs yet)\n","callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)\n","\n","# creating the classifier based on the above model definition\n","model = KerasClassifier(build_fn=create_model, \\\n"," epochs=200, \\\n"," batch_size=5)\n","\n","# fitting the classifier\n","model.fit(X, Y, callbacks=[callback])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cUF9WXHL7t9a"},"source":["#### Using the trained model on the test data"]},{"cell_type":"code","metadata":{"id":"H-r5xICx7wiS","executionInfo":{"status":"ok","timestamp":1638118378883,"user_tz":-60,"elapsed":20823,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# preprocessing the test data (needs to be the same shape, distribution, etc.,\n","# like the train)\n","df_test = preprocess_data(df_test)\n","# creating the X_test feature matrix\n","X_test = df_test.values.astype(float)\n","# scaling the X_test with the scaler trained on the trained data\n","X_test = scaler.transform(X_test)\n","\n","# predict the 'Survived' values\n","predictions = model.predict(X_test)"],"execution_count":19,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"brEAxGaK8LfW"},"source":["#### Preparing the Kaggle submission file\n","- Note: After submitting this one to Kaggle, the accuracy should be slightly over 76% - not too bad, but not too great either (more sophisticated data preprocessing and model hyper-parameter optimisation would take us way further)."]},{"cell_type":"code","metadata":{"id":"UM4oXPXI8O0l"},"source":["# the pandas data frame with the results\n","submission = pd.DataFrame({\n"," 'PassengerId': df_test.index,\n"," 'Survived': predictions[:,0],\n","})\n","\n","# storing the submissions as CSV\n","submission.sort_values('PassengerId', inplace=True) \n","submission.to_csv('submission-naive.csv', index=False)\n","\n","# downloading the created CSV file locally\n","from google.colab import files\n","files.download('submission-naive.csv')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"collapsed":true,"id":"yLjsBHRLnmDa"},"source":["---\n","\n","#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:\n","- Image of Titanic:\n"," - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:St%C3%B6wer_Titanic.jpg)\n"," - Author: Willy Stöwer (image reproduction)\n"," - License: none (or [Public Domain](https://en.wikipedia.org/wiki/public_domain))"]}]}