{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" }, "colab": { "name": "PB016_lab12.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "A8Cj_X1VnmC0" }, "source": [ "# PB016: Artificial intelligence I, labs 12 - Deep learning\n", "\n", "Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:\n", "1. __Dummy deep learning pipeline__\n", "2. __Developing your own deep learning classifier__" ] }, { "cell_type": "markdown", "metadata": { "id": "MjLOiihWnmC1" }, "source": [ "---\n", "\n", "## 1. Dummy deep learning pipeline\n", "\n", "__Basic facts__\n", "- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).\n", "- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:\n", " - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.\n", " - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.\n", " - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow." ] }, { "cell_type": "markdown", "metadata": { "id": "OtRA1s8qdz2e" }, "source": [ "### The example task - predicting onset of diabetes using Keras\n", "- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).\n", "- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.\n", "- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index." ] }, { "cell_type": "markdown", "metadata": { "id": "WtWS3vDyd9pI" }, "source": [ "#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)" ] }, { "cell_type": "code", "metadata": { "id": "DvmXGN9OWh3O" }, "source": [ "# importing the library for handy CSV file processing\n", "import pandas as pd\n", "\n", "# loading the data, in CSV format, from the web\n", "dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')\n", "\n", "# checking the first few rows of the CSV\n", "dataframe.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XzQWHmkmg3BZ" }, "source": [ "### Creating the features and labels data structures" ] }, { "cell_type": "code", "metadata": { "id": "dYwVJCQng8gJ" }, "source": [ "# getting just the Outcome column as the vector of labels\n", "# - note that the column contains 0, 1 values that correspond to negaive \n", "# (no diabetes developed) and positive (diabetes developed) example labels,\n", "# respectively\n", "df_labels = dataframe.Outcome.values.astype(float)\n", "# the features are the data minus the label vector\n", "# - this contains the remaining features present in the data\n", "df_features = dataframe.drop('Outcome',axis=1).values" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "diBlgxl_hTCS" }, "source": [ "### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)" ] }, { "cell_type": "code", "metadata": { "id": "kL8i6lBXhWfi" }, "source": [ "# importing a convenience data splitting function from scikit-learn\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "# computing a random 80-20 split (80% training data, 20% of remaining\n", "# \"unseen\" data for testing the model trained on the 80%)\n", "\n", "x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\\\n", " test_size=0.2,\\\n", " random_state=42)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "paAJQMT6iAX2" }, "source": [ "### Creating a [Keras](https://keras.io/) model" ] }, { "cell_type": "code", "metadata": { "id": "x3oWiAHwiC2y" }, "source": [ "# importing the basics from Keras\n", "\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "\n", "# designing the model\n", "\n", "# a model for simple sequential stacking of layers\n", "model = Sequential()\n", "# the first hidden layer, linked to the input layer matching the feature \n", "# vectors of size 8 (number of features in the diabetes data)\n", "model.add(Dense(100,input_dim=8,activation='sigmoid')) \n", "# the second hidden layer\n", "model.add(Dense(100,activation='sigmoid'))\n", "# the output layer, for classification into 2 classes (does/doesn't develop\n", "# diabetes)\n", "model.add(Dense(2,activation='softmax'))\n", "\n", "# compiling the model\n", "model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "nXoiiGpCjaD0" }, "source": [ "### Training the created model" ] }, { "cell_type": "code", "metadata": { "id": "R0k0oJzyjb22" }, "source": [ "# simply calling the fit function, training on the training data and validating\n", "# on the test data after each epoch\n", "model.fit(x_train,y_train,epochs=15,\\\n", " validation_data=(x_test, y_test))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "X0pTXap1nBq5" }, "source": [ "### Interpreting the results\n", "- Not too great:\n", " - The loss is barely being optimised.\n", " - The accuracy worse then a random baseline (0.5) in most runs.\n", "- The reasons:\n", " - More or less boilerplate/default settings of the model.\n", " - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out)." ] }, { "cell_type": "markdown", "metadata": { "id": "lr1fmV_UnmDZ" }, "source": [ "---\n", "\n", "## 2. Developing your own deep learning classifier\n", "- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).\n", "\n", "![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)\n", "\n", "- Split into groups (for instance, one row of seats makes one group).\n", "- Each group will choose a coordinator. That person will be responsible for\n", " - Outlining the overall solution and distributing specific sub-tasks among group members.\n", " - Integrating the results of the group's work in their shared notebook.\n", " - Subsequent presentation of the results to the rest of the class.\n", "- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition (one account per group is enough).\n", "- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:\n", " - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.\n", " - Design a simple neural model for classification of (non)survivors using Keras.\n", " - Train the model on the `train.csv` dataset (after possibly preprocessing the data).\n", " - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _\"survived\"_; for more details, see the competition documentation itself).\n", " - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.\n", " - Brag with your model and score to the rest of the class!\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lWIcu-WFrB-N" }, "source": [ "### Loading the train and test data" ] }, { "cell_type": "code", "metadata": { "id": "kzC3yqpHniPM" }, "source": [ "# importing pandas, just in case it wasn't imported before\n", "import pandas as pd\n", "\n", "# loading the train and test data using pandas\n", "\n", "df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\\\n", " index_col='PassengerId')\n", "df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \\\n", " index_col='PassengerId')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IA4hfwGR26oP" }, "source": [ "### Checking out the train and test data contents" ] }, { "cell_type": "code", "metadata": { "id": "agY80lGO2_Sa" }, "source": [ "df_train.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3RAwtMw53LMa" }, "source": [ "df_test.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "z7jqJu5SrGec" }, "source": [ "### Developing the model itself" ] }, { "cell_type": "code", "metadata": { "id": "jESxfpu-rIzx" }, "source": [ "# TODO - YOUR MODEL GOES HERE" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "diP58FYa-TYD" }, "source": [ "### Notes on the solution\n", "- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.\n", "- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:\n", " - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.\n", " - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine." ] } ] }