{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.1"
    },
    "colab": {
      "name": "PB016_lab12.ipynb",
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A8Cj_X1VnmC0"
      },
      "source": [
        "# PB016: Artificial intelligence I, labs 12 - Deep learning\n",
        "\n",
        "Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:\n",
        "1. __Dummy deep learning pipeline__\n",
        "2. __Developing your own deep learning classifier__"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MjLOiihWnmC1"
      },
      "source": [
        "---\n",
        "\n",
        "## 1. Dummy deep learning pipeline\n",
        "\n",
        "__Basic facts__\n",
        "- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).\n",
        "- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:\n",
        " - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.\n",
        " - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.\n",
        " - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OtRA1s8qdz2e"
      },
      "source": [
        "### The example task - predicting onset of diabetes using Keras\n",
        "- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).\n",
        "- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.\n",
        "- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WtWS3vDyd9pI"
      },
      "source": [
        "#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DvmXGN9OWh3O"
      },
      "source": [
        "# importing the library for handy CSV file processing\n",
        "import pandas as pd\n",
        "\n",
        "# loading the data, in CSV format, from the web\n",
        "dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')\n",
        "\n",
        "# checking the first few rows of the CSV\n",
        "dataframe.head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XzQWHmkmg3BZ"
      },
      "source": [
        "### Creating the features and labels data structures"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dYwVJCQng8gJ"
      },
      "source": [
        "# getting just the Outcome column as the vector of labels\n",
        "# - note that the column contains 0, 1 values that correspond to negaive \n",
        "#   (no diabetes developed) and positive (diabetes developed) example labels,\n",
        "#   respectively\n",
        "df_labels = dataframe.Outcome.values.astype(float)\n",
        "# the features are the data minus the label vector\n",
        "# - this contains the remaining features present in the data\n",
        "df_features = dataframe.drop('Outcome',axis=1).values"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "diBlgxl_hTCS"
      },
      "source": [
        "### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kL8i6lBXhWfi"
      },
      "source": [
        "# importing a convenience data splitting function from scikit-learn\n",
        "\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "# computing a random 80-20 split (80% training data, 20% of remaining\n",
        "# \"unseen\" data for testing the model trained on the 80%)\n",
        "\n",
        "x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\\\n",
        "                                                    test_size=0.2,\\\n",
        "                                                    random_state=42)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "paAJQMT6iAX2"
      },
      "source": [
        "### Creating a [Keras](https://keras.io/) model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x3oWiAHwiC2y"
      },
      "source": [
        "# importing the basics from Keras\n",
        "\n",
        "from keras.models import Sequential\n",
        "from keras.layers import Dense\n",
        "\n",
        "# designing the model\n",
        "\n",
        "# a model for simple sequential stacking of layers\n",
        "model = Sequential()\n",
        "# the first hidden layer, linked to the input layer matching the feature \n",
        "# vectors of size 8 (number of features in the diabetes data)\n",
        "model.add(Dense(100,input_dim=8,activation='sigmoid')) \n",
        "# the second hidden layer\n",
        "model.add(Dense(100,activation='sigmoid'))\n",
        "# the output layer, for classification into 2 classes (does/doesn't develop\n",
        "# diabetes)\n",
        "model.add(Dense(2,activation='softmax'))\n",
        "\n",
        "# compiling the model\n",
        "model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nXoiiGpCjaD0"
      },
      "source": [
        "### Training the created model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "R0k0oJzyjb22"
      },
      "source": [
        "# simply calling the fit function, training on the training data and validating\n",
        "# on the test data after each epoch\n",
        "model.fit(x_train,y_train,epochs=15,\\\n",
        "          validation_data=(x_test, y_test))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X0pTXap1nBq5"
      },
      "source": [
        "### Interpreting the results\n",
        "- Not too great:\n",
        " - The loss is barely being optimised.\n",
        " - The accuracy worse then a random baseline (0.5) in most runs.\n",
        "- The reasons:\n",
        " - More or less boilerplate/default settings of the model.\n",
        " - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance the [blog post](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) referenced before, where a detailed exploratory analysis and input data transformation is carried out)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lr1fmV_UnmDZ"
      },
      "source": [
        "---\n",
        "\n",
        "## 2. Developing your own deep learning classifier\n",
        "- Your task is to predict survivors of the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).\n",
        "\n",
        "![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)\n",
        "\n",
        "- Split into groups (for instance, one row of seats makes one group).\n",
        "- Each group will choose a coordinator. That person will be responsible for\n",
        " - Outlining the overall solution and distributing specific sub-tasks among group members.\n",
        " - Integrating the results of the group's work in their shared notebook.\n",
        " - Subsequent presentation of the results to the rest of the class.\n",
        "- Each group will register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition  (one account per group is enough).\n",
        "- Then, use [Keras](https://keras.io/) to solve the Titanic survivors' prediction problem as follows:\n",
        " - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.\n",
        " - Design a simple neural model for classification of (non)survivors using Keras.\n",
        " - Train the model on the `train.csv` dataset (after possibly preprocessing the data).\n",
        " - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _\"survived\"_; for more details, see the competition documentation itself).\n",
        " - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.\n",
        " - Brag with your model and score to the rest of the class!\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lWIcu-WFrB-N"
      },
      "source": [
        "### Loading the train and test data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kzC3yqpHniPM"
      },
      "source": [
        "# importing pandas, just in case it wasn't imported before\n",
        "import pandas as pd\n",
        "\n",
        "# loading the train and test data using pandas\n",
        "\n",
        "df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\\\n",
        "                       index_col='PassengerId')\n",
        "df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \\\n",
        "                      index_col='PassengerId')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IA4hfwGR26oP"
      },
      "source": [
        "### Checking out the train and test data contents"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "agY80lGO2_Sa"
      },
      "source": [
        "df_train.head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3RAwtMw53LMa"
      },
      "source": [
        "df_test.head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z7jqJu5SrGec"
      },
      "source": [
        "### Developing the model itself"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jESxfpu-rIzx"
      },
      "source": [
        "# TODO - YOUR MODEL GOES HERE"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "diP58FYa-TYD"
      },
      "source": [
        "### Notes on the solution\n",
        "- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's code.\n",
        "- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:\n",
        " - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.\n",
        " - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine."
      ]
    }
  ]
}