{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.1"},"colab":{"name":"PB016_lab12_G01.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"A8Cj_X1VnmC0"},"source":["# PB016: Artificial intelligence I, labs 12 - Deep learning\n","\n","Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:\n","1. __Dummy deep learning pipeline__\n","2. __Developing your own deep learning classifier__"]},{"cell_type":"markdown","metadata":{"id":"MjLOiihWnmC1"},"source":["---\n","\n","## 1. Dummy deep learning pipeline\n","\n","__Basic facts__\n","- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).\n","- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:\n"," - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.\n"," - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.\n"," - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow."]},{"cell_type":"markdown","metadata":{"id":"OtRA1s8qdz2e"},"source":["### The example task - predicting onset of diabetes using Keras\n","- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).\n","- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.\n","- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index."]},{"cell_type":"markdown","metadata":{"id":"WtWS3vDyd9pI"},"source":["#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)"]},{"cell_type":"code","metadata":{"id":"DvmXGN9OWh3O","colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"status":"ok","timestamp":1638116593835,"user_tz":-60,"elapsed":1029,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"479e16d7-ca83-485c-8d51-4e34d749d29c"},"source":["# importing the library for handy CSV file processing\n","import pandas as pd\n","\n","# loading the data, in CSV format, from the web\n","dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')\n","\n","# checking the first few rows of the CSV\n","dataframe.head()"],"execution_count":1,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n"," \n"," | \n"," Pregnancies | \n"," Glucose | \n"," BloodPressure | \n"," SkinThickness | \n"," Insulin | \n"," BMI | \n"," DiabetesPedigreeFunction | \n"," Age | \n"," Outcome | \n","
\n"," \n"," \n"," \n"," 0 | \n"," 6 | \n"," 148 | \n"," 72 | \n"," 35 | \n"," 0 | \n"," 33.6 | \n"," 0.627 | \n"," 50 | \n"," 1 | \n","
\n"," \n"," 1 | \n"," 1 | \n"," 85 | \n"," 66 | \n"," 29 | \n"," 0 | \n"," 26.6 | \n"," 0.351 | \n"," 31 | \n"," 0 | \n","
\n"," \n"," 2 | \n"," 8 | \n"," 183 | \n"," 64 | \n"," 0 | \n"," 0 | \n"," 23.3 | \n"," 0.672 | \n"," 32 | \n"," 1 | \n","
\n"," \n"," 3 | \n"," 1 | \n"," 89 | \n"," 66 | \n"," 23 | \n"," 94 | \n"," 28.1 | \n"," 0.167 | \n"," 21 | \n"," 0 | \n","
\n"," \n"," 4 | \n"," 0 | \n"," 137 | \n"," 40 | \n"," 35 | \n"," 168 | \n"," 43.1 | \n"," 2.288 | \n"," 33 | \n"," 1 | \n","
\n"," \n","
\n","
"],"text/plain":[" Pregnancies Glucose BloodPressure ... DiabetesPedigreeFunction Age Outcome\n","0 6 148 72 ... 0.627 50 1\n","1 1 85 66 ... 0.351 31 0\n","2 8 183 64 ... 0.672 32 1\n","3 1 89 66 ... 0.167 21 0\n","4 0 137 40 ... 2.288 33 1\n","\n","[5 rows x 9 columns]"]},"metadata":{},"execution_count":1}]},{"cell_type":"markdown","metadata":{"id":"XzQWHmkmg3BZ"},"source":["### Creating the features and labels data structures"]},{"cell_type":"code","metadata":{"id":"dYwVJCQng8gJ","executionInfo":{"status":"ok","timestamp":1638116782996,"user_tz":-60,"elapsed":300,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# getting just the Outcome column as the vector of labels\n","# - note that the column contains 0, 1 values that correspond to negaive \n","# (no diabetes developed) and positive (diabetes developed) example labels,\n","# respectively\n","df_labels = dataframe.Outcome.values.astype(float)\n","# the features are the data minus the label vector\n","# - this contains the remaining features present in the data\n","df_features = dataframe.drop('Outcome',axis=1).values"],"execution_count":3,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"diBlgxl_hTCS"},"source":["### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)"]},{"cell_type":"code","metadata":{"id":"kL8i6lBXhWfi","executionInfo":{"status":"ok","timestamp":1638116845656,"user_tz":-60,"elapsed":342,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a convenience data splitting function from scikit-learn\n","\n","from sklearn.model_selection import train_test_split\n","\n","# computing a random 80-20 split (80% training data, 20% of remaining\n","# \"unseen\" data for testing the model trained on the 80%)\n","\n","x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\\\n"," test_size=0.2,\\\n"," random_state=42)"],"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"paAJQMT6iAX2"},"source":["### Creating a [Keras](https://keras.io/) model"]},{"cell_type":"code","metadata":{"id":"x3oWiAHwiC2y","executionInfo":{"status":"ok","timestamp":1638117108753,"user_tz":-60,"elapsed":2981,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing the basics from Keras\n","\n","from keras.models import Sequential\n","from keras.layers import Dense\n","\n","# designing the model\n","\n","# a model for simple sequential stacking of layers\n","model = Sequential()\n","# the first hidden layer, linked to the input layer matching the feature \n","# vectors of size 8 (number of features in the diabetes data)\n","model.add(Dense(100,input_dim=8,activation='sigmoid')) \n","# the second hidden layer\n","model.add(Dense(100,activation='sigmoid'))\n","# the output layer, for classification into 2 classes (does/doesn't develop\n","# diabetes)\n","model.add(Dense(2,activation='softmax'))\n","\n","# compiling the model\n","model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])"],"execution_count":6,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"nXoiiGpCjaD0"},"source":["### Training the created model"]},{"cell_type":"code","metadata":{"id":"R0k0oJzyjb22"},"source":["# simply calling the fit function, training on the training data and validating\n","# on the test data after each epoch\n","model.fit(x_train,y_train,epochs=15,\\\n"," validation_data=(x_test, y_test))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"X0pTXap1nBq5"},"source":["### Interpreting the results\n","- Not too great:\n"," - The loss is barely being optimised.\n"," - The accuracy worse then a random baseline (0.5) in most runs.\n","- The reasons:\n"," - More or less boilerplate/default settings of the model.\n"," - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out)."]},{"cell_type":"markdown","metadata":{"id":"lr1fmV_UnmDZ"},"source":["---\n","\n","## 2. Developing your own deep learning classifier\n","- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).\n","\n","![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)\n","\n","- Split into groups (for instance, one row of seats makes one group).\n","- Each group will choose a coordinator. That person will be responsible for\n"," - Outlining the overall solution and distributing specific sub-tasks among group members.\n"," - Integrating the results of the group's work in their shared notebook.\n"," - Subsequent presentation of the results to the rest of the class.\n","- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition (one account per group is enough).\n","- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:\n"," - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.\n"," - Design a simple neural model for classification of (non)survivors using Keras.\n"," - Train the model on the `train.csv` dataset (after possibly preprocessing the data).\n"," - Use the trained model to predict the labels of the set `test.csv` (i.e., the values of the column _\"survived\"_; for more details, see the competition documentation itself).\n"," - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.\n"," - Brag with your model and score to the rest of the class!\n"]},{"cell_type":"markdown","metadata":{"id":"lWIcu-WFrB-N"},"source":["### Loading the train and test data"]},{"cell_type":"code","metadata":{"id":"kzC3yqpHniPM","executionInfo":{"status":"ok","timestamp":1638117506631,"user_tz":-60,"elapsed":3020,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing pandas, just in case it wasn't imported before\n","import pandas as pd\n","\n","# loading the train and test data using pandas\n","\n","df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\\\n"," index_col='PassengerId')\n","df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \\\n"," index_col='PassengerId')"],"execution_count":8,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"IA4hfwGR26oP"},"source":["### Checking out the train and test data contents"]},{"cell_type":"code","metadata":{"id":"agY80lGO2_Sa","colab":{"base_uri":"https://localhost:8080/","height":290},"executionInfo":{"status":"ok","timestamp":1638117528228,"user_tz":-60,"elapsed":369,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"8db27e87-9332-49c6-f8a0-e877f2c728c3"},"source":["df_train.head()"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/html":["\n","\n","
\n"," \n"," \n"," | \n"," Survived | \n"," Pclass | \n"," Name | \n"," Sex | \n"," Age | \n"," SibSp | \n"," Parch | \n"," Ticket | \n"," Fare | \n"," Cabin | \n"," Embarked | \n","
\n"," \n"," PassengerId | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n","
\n"," \n"," \n"," \n"," 1 | \n"," 0 | \n"," 3 | \n"," Braund, Mr. Owen Harris | \n"," male | \n"," 22.0 | \n"," 1 | \n"," 0 | \n"," A/5 21171 | \n"," 7.2500 | \n"," NaN | \n"," S | \n","
\n"," \n"," 2 | \n"," 1 | \n"," 1 | \n"," Cumings, Mrs. John Bradley (Florence Briggs Th... | \n"," female | \n"," 38.0 | \n"," 1 | \n"," 0 | \n"," PC 17599 | \n"," 71.2833 | \n"," C85 | \n"," C | \n","
\n"," \n"," 3 | \n"," 1 | \n"," 3 | \n"," Heikkinen, Miss. Laina | \n"," female | \n"," 26.0 | \n"," 0 | \n"," 0 | \n"," STON/O2. 3101282 | \n"," 7.9250 | \n"," NaN | \n"," S | \n","
\n"," \n"," 4 | \n"," 1 | \n"," 1 | \n"," Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n"," female | \n"," 35.0 | \n"," 1 | \n"," 0 | \n"," 113803 | \n"," 53.1000 | \n"," C123 | \n"," S | \n","
\n"," \n"," 5 | \n"," 0 | \n"," 3 | \n"," Allen, Mr. William Henry | \n"," male | \n"," 35.0 | \n"," 0 | \n"," 0 | \n"," 373450 | \n"," 8.0500 | \n"," NaN | \n"," S | \n","
\n"," \n","
\n","
"],"text/plain":[" Survived Pclass ... Cabin Embarked\n","PassengerId ... \n","1 0 3 ... NaN S\n","2 1 1 ... C85 C\n","3 1 3 ... NaN S\n","4 1 1 ... C123 S\n","5 0 3 ... NaN S\n","\n","[5 rows x 11 columns]"]},"metadata":{},"execution_count":9}]},{"cell_type":"code","metadata":{"id":"3RAwtMw53LMa","colab":{"base_uri":"https://localhost:8080/","height":238},"executionInfo":{"status":"ok","timestamp":1638117531606,"user_tz":-60,"elapsed":323,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"cc9932f5-76a6-47f2-ec9c-2f1abbdc2d7b"},"source":["df_test.head()"],"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/html":["\n","\n","
\n"," \n"," \n"," | \n"," Pclass | \n"," Name | \n"," Sex | \n"," Age | \n"," SibSp | \n"," Parch | \n"," Ticket | \n"," Fare | \n"," Cabin | \n"," Embarked | \n","
\n"," \n"," PassengerId | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n","
\n"," \n"," \n"," \n"," 892 | \n"," 3 | \n"," Kelly, Mr. James | \n"," male | \n"," 34.5 | \n"," 0 | \n"," 0 | \n"," 330911 | \n"," 7.8292 | \n"," NaN | \n"," Q | \n","
\n"," \n"," 893 | \n"," 3 | \n"," Wilkes, Mrs. James (Ellen Needs) | \n"," female | \n"," 47.0 | \n"," 1 | \n"," 0 | \n"," 363272 | \n"," 7.0000 | \n"," NaN | \n"," S | \n","
\n"," \n"," 894 | \n"," 2 | \n"," Myles, Mr. Thomas Francis | \n"," male | \n"," 62.0 | \n"," 0 | \n"," 0 | \n"," 240276 | \n"," 9.6875 | \n"," NaN | \n"," Q | \n","
\n"," \n"," 895 | \n"," 3 | \n"," Wirz, Mr. Albert | \n"," male | \n"," 27.0 | \n"," 0 | \n"," 0 | \n"," 315154 | \n"," 8.6625 | \n"," NaN | \n"," S | \n","
\n"," \n"," 896 | \n"," 3 | \n"," Hirvonen, Mrs. Alexander (Helga E Lindqvist) | \n"," female | \n"," 22.0 | \n"," 1 | \n"," 1 | \n"," 3101298 | \n"," 12.2875 | \n"," NaN | \n"," S | \n","
\n"," \n","
\n","
"],"text/plain":[" Pclass ... Embarked\n","PassengerId ... \n","892 3 ... Q\n","893 3 ... S\n","894 2 ... Q\n","895 3 ... S\n","896 3 ... S\n","\n","[5 rows x 10 columns]"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","metadata":{"id":"z7jqJu5SrGec"},"source":["### Developing the model itself"]},{"cell_type":"code","metadata":{"id":"jESxfpu-rIzx"},"source":["# TODO - YOUR MODEL GOES HERE"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"diP58FYa-TYD"},"source":["### Notes on the solution\n","- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.\n","- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:\n"," - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.\n"," - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine."]},{"cell_type":"markdown","metadata":{"id":"Xe_zOGCS0Pet"},"source":["### __Possible solution of the task__\n","- Note: The following code is largely based on the simpler solution alternative presented [here](https://www.kaggle.com/stefanbergstein/keras-deep-learning-on-titanic-data)."]},{"cell_type":"markdown","metadata":{"id":"Yz0SCO7T0UWy"},"source":["#### Preprocessing the training data"]},{"cell_type":"code","metadata":{"id":"8FO5HA0G0YEO","executionInfo":{"status":"ok","timestamp":1638117878555,"user_tz":-60,"elapsed":398,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# the function dealing with all preprocessing steps\n","\n","def preprocess_data(df):\n"," # drop unwanted features (stuff that likely is just noise with no\n"," # discriminvative power w.r.t. surviving the sinking)\n"," df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n"," \n"," # impute missing data: Age and Fare with the mean, Embarked with the most \n"," # frequent value\n"," df[['Age']] = df[['Age']].fillna(value=df[['Age']].mean())\n"," df[['Fare']] = df[['Fare']].fillna(value=df[['Fare']].mean())\n"," df[['Embarked']] = df[['Embarked']].fillna(value=\\\n"," df['Embarked'].value_counts().idxmax())\n"," \n"," # convert the categorical feature Sex into a binary/numeric one\n"," df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)\n"," \n"," # convert Embarked to one-hot (i.e., N binary yes/no features for one \n"," # feature with N possible categorical values)\n"," embarked_one_hot = pd.get_dummies(df['Embarked'], prefix='Embarked')\n"," df = df.drop('Embarked', axis=1)\n"," df = df.join(embarked_one_hot)\n","\n"," return df"],"execution_count":12,"outputs":[]},{"cell_type":"code","metadata":{"id":"DTlhJJyr1JKB","colab":{"base_uri":"https://localhost:8080/","height":238},"executionInfo":{"status":"ok","timestamp":1638117891251,"user_tz":-60,"elapsed":416,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}},"outputId":"c2c346e3-d95e-4cb7-abfa-9358b316369c"},"source":["# preprocessing the training data using the function defined above\n","\n","df_train = preprocess_data(df_train)\n","df_train.head()"],"execution_count":13,"outputs":[{"output_type":"execute_result","data":{"text/html":["\n","\n","
\n"," \n"," \n"," | \n"," Survived | \n"," Pclass | \n"," Sex | \n"," Age | \n"," SibSp | \n"," Parch | \n"," Fare | \n"," Embarked_C | \n"," Embarked_Q | \n"," Embarked_S | \n","
\n"," \n"," PassengerId | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n"," | \n","
\n"," \n"," \n"," \n"," 1 | \n"," 0 | \n"," 3 | \n"," 0 | \n"," 22.0 | \n"," 1 | \n"," 0 | \n"," 7.2500 | \n"," 0 | \n"," 0 | \n"," 1 | \n","
\n"," \n"," 2 | \n"," 1 | \n"," 1 | \n"," 1 | \n"," 38.0 | \n"," 1 | \n"," 0 | \n"," 71.2833 | \n"," 1 | \n"," 0 | \n"," 0 | \n","
\n"," \n"," 3 | \n"," 1 | \n"," 3 | \n"," 1 | \n"," 26.0 | \n"," 0 | \n"," 0 | \n"," 7.9250 | \n"," 0 | \n"," 0 | \n"," 1 | \n","
\n"," \n"," 4 | \n"," 1 | \n"," 1 | \n"," 1 | \n"," 35.0 | \n"," 1 | \n"," 0 | \n"," 53.1000 | \n"," 0 | \n"," 0 | \n"," 1 | \n","
\n"," \n"," 5 | \n"," 0 | \n"," 3 | \n"," 0 | \n"," 35.0 | \n"," 0 | \n"," 0 | \n"," 8.0500 | \n"," 0 | \n"," 0 | \n"," 1 | \n","
\n"," \n","
\n","
"],"text/plain":[" Survived Pclass Sex ... Embarked_C Embarked_Q Embarked_S\n","PassengerId ... \n","1 0 3 0 ... 0 0 1\n","2 1 1 1 ... 1 0 0\n","3 1 3 1 ... 0 0 1\n","4 1 1 1 ... 0 0 1\n","5 0 3 0 ... 0 0 1\n","\n","[5 rows x 10 columns]"]},"metadata":{},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"HA1KNvG11zpB"},"source":["#### Creating the feature and label data structures, smoothing the features using scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)"]},{"cell_type":"code","metadata":{"id":"2jstye7y15Hx","executionInfo":{"status":"ok","timestamp":1638118019474,"user_tz":-60,"elapsed":273,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# importing a standard scaler from scikit-learn to smooth out the features\n","from sklearn.preprocessing import StandardScaler\n","\n","# dropping the labels from the data frame, type conversion of feature vectors\n","X = df_train.drop(['Survived'], axis=1).values.astype(float)\n","\n","# scaling the values of the features to make them more uniform\n","scaler = StandardScaler()\n","X = scaler.fit_transform(X)\n","\n","# labels are simply the Survived values\n","Y = df_train['Survived'].values"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Nio4fWP23zDo"},"source":["#### Creating a Keras model template"]},{"cell_type":"code","metadata":{"id":"1uNv531131_A","executionInfo":{"status":"ok","timestamp":1638118207135,"user_tz":-60,"elapsed":321,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["from keras.models import Sequential\n","from keras.layers import Dense\n","\n","def create_model(optimizer='rmsprop', init='glorot_uniform'):\n"," # create model\n"," model = Sequential()\n"," # the first hidden layer, linked with input matching the size of the \n"," # feature vectors (i.e., the number of columns in X - X.shape[1])\n"," model.add(Dense(16, input_dim=X.shape[1], kernel_initializer=init, \\\n"," activation='relu'))\n"," model.add(Dense(8, kernel_initializer=init, activation='relu'))\n"," model.add(Dense(4, kernel_initializer=init, activation='relu'))\n"," model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))\n"," # compile model\n"," model.compile(loss='binary_crossentropy', optimizer=optimizer, \\\n"," metrics=['accuracy'])\n"," return model"],"execution_count":16,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"L1WOnLiH42Y1"},"source":["#### Training a specific model created by the above function"]},{"cell_type":"code","metadata":{"id":"hvNTq9rc5Fsp"},"source":["# importing a utility classifier wrapper for Keras models\n","from keras.wrappers.scikit_learn import KerasClassifier\n","import tensorflow as tf\n","\n","# early stopping callback (for stopping when the loss doesn't improve for a \n","# while, even if we haven't gone through all epochs yet)\n","callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)\n","\n","# creating the classifier based on the above model definition\n","model = KerasClassifier(build_fn=create_model, \\\n"," epochs=200, \\\n"," batch_size=5)\n","\n","# fitting the classifier\n","model.fit(X, Y, callbacks=[callback])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cUF9WXHL7t9a"},"source":["#### Using the trained model on the test data"]},{"cell_type":"code","metadata":{"id":"H-r5xICx7wiS","executionInfo":{"status":"ok","timestamp":1638118378883,"user_tz":-60,"elapsed":20823,"user":{"displayName":"Vít Nováček","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"05686630748873346736"}}},"source":["# preprocessing the test data (needs to be the same shape, distribution, etc.,\n","# like the train)\n","df_test = preprocess_data(df_test)\n","# creating the X_test feature matrix\n","X_test = df_test.values.astype(float)\n","# scaling the X_test with the scaler trained on the trained data\n","X_test = scaler.transform(X_test)\n","\n","# predict the 'Survived' values\n","predictions = model.predict(X_test)"],"execution_count":19,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"brEAxGaK8LfW"},"source":["#### Preparing the Kaggle submission file\n","- Note: After submitting this one to Kaggle, the accuracy should be slightly over 76% - not too bad, but not too great either (more sophisticated data preprocessing and model hyper-parameter optimisation would take us way further)."]},{"cell_type":"code","metadata":{"id":"UM4oXPXI8O0l"},"source":["# the pandas data frame with the results\n","submission = pd.DataFrame({\n"," 'PassengerId': df_test.index,\n"," 'Survived': predictions[:,0],\n","})\n","\n","# storing the submissions as CSV\n","submission.sort_values('PassengerId', inplace=True) \n","submission.to_csv('submission-naive.csv', index=False)\n","\n","# downloading the created CSV file locally\n","from google.colab import files\n","files.download('submission-naive.csv')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"collapsed":true,"id":"yLjsBHRLnmDa"},"source":["---\n","\n","#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:\n","- Image of Titanic:\n"," - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:St%C3%B6wer_Titanic.jpg)\n"," - Author: Willy Stöwer (image reproduction)\n"," - License: none (or [Public Domain](https://en.wikipedia.org/wiki/public_domain))"]}]}