{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.1"
    },
    "colab": {
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A8Cj_X1VnmC0"
      },
      "source": [
        "# PB016: Artificial intelligence I, labs 11 - Machine learning\n",
        "\n",
        "Today we'll dig into machine learning, or rather two specific examples of \"classic\" machine learning models. We'll focus namely on:\n",
        "1. __Representation and preparation of input data__\n",
        "2. __Simple algorithm for training a perceptron__\n",
        "3. __Simple algorithm for training a decision tree__\n",
        "4. __The exercises in a few lines of code__\n",
        "\n",
        "An __important note__: from these labs on, all your work on the assignments will be collaborative - similarly to the last part of labs 10, split into groups (min 2, max 4 people) and work together."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MjLOiihWnmC1"
      },
      "source": [
        "---\n",
        "\n",
        "## 1. Representation and preparation of input data\n",
        "\n",
        "__Basic facts__\n",
        "- Machine learning typically works with a training set of machine-readable input data in the form of examples, which may (or may not) be labeled by an identifier (or identifiers) of the class (or classes) to which they belong.\n",
        "- Each example is usually assigned a unique identifier and a set of specific property (or also feature) values that depend on the specific domain of the problem.\n",
        "- A suitable representation of such data is a matrix mapping the example identifiers to vectors of their specific feature values, and (for supervised learning) also a vector mapping the example identifiers to the identifiers of the classes to which they belong."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tzl_o5pXnmC2"
      },
      "source": [
        "### Sample problem - classification of [black metal](https://en.wikipedia.org/wiki/Black_metal) fans\n",
        "\n",
        "![black_metal](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/blackmetal.png)\n",
        "\n",
        "- The problem may not be as simple as it seems (for example, not all black metal fans are Satanists with striking body paint).\n",
        "- Example of a simple dataset:\n",
        "\n",
        "| __ID__ || __HL__ | __FHT__ | __LN__ | __LS__ || __FBM__ |\n",
        "|--------||--------|---------|--------|--------||---------|\n",
        "| Petr   || l      | br      | 9/10   | 1/10   || y       |\n",
        "| Marie  || s      | $-$     | 8/10   | 9/10   || y       |\n",
        "| Azazel || l      | ms      | 1/10   | 10/10  || n       |\n",
        "| Pavel  || b      | br      | 7/10   | 2/10   || y       |\n",
        "\n",
        "- Meaning of columns and individual feature values (with the \"$ - $\" symbol for \"missing\"):\n",
        "  - __ID__ - example identifier (names of people we classify into \"black metal fan\" and \"others\" classes)\n",
        "  - __HL__ - hair length (one of $ \\{b, s, m, l \\} $ for bald, short, medium and long hair)\n",
        "  - __FHT__ - beard type (one of $ \\{no, gt, ms, sb, br \\} $ for no beard, goatie, moustache, sideburns and beard)\n",
        "  - __LN__ - relationship to nature (on a scale from 1 to 10, the higher the more devout)\n",
        "  - __LS__ - relationship to Satan (on a scale from 1 to 10, the higher the more devout)\n",
        "  - __FBM__ - belonging to the class of black metal fans (one of $ \\{y, n \\} $ for yes and no)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ein7Ntv3nmC3"
      },
      "source": [
        "### Simple structures for sample data representation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "n5K3FEn8nmC3"
      },
      "source": [
        "# vectors (in the form of dictionaries) mapping the symbolic identifiers of the\n",
        "# examples and features to unique numbers\n",
        "\n",
        "example_id_map = {\n",
        "  'Petr' : 0,\n",
        "  'Marie' : 1,\n",
        "  'Azazel' : 2,\n",
        "  'Pavel' : 3,\n",
        "}\n",
        "\n",
        "feature_id_map = {\n",
        "  'HL' : 0,\n",
        "  'FHT' : 1,\n",
        "  'LN' : 2,\n",
        "  'LS' : 3,\n",
        "}\n",
        "\n",
        "# inverse vectors to the previous two (may come handy for user-friendly\n",
        "# statements)\n",
        "\n",
        "example_id_map_inv = dict([(y,x) for x,y in example_id_map.items()])\n",
        "feature_id_map_inv = dict([(y,x) for x,y in feature_id_map.items()])\n",
        "\n",
        "# vector mapping example identifiers to identifiers of their classes\n",
        "# (0 corresponds to 'n', 1 corresponds to 'y')\n",
        "\n",
        "labels = {0:1, 1:1, 2:0, 3:1}\n",
        "\n",
        "# matrix (in the form of a \"two-dimensional\" dictionary) mapping example\n",
        "# identifiers to the values of their properties (values are currently\n",
        "# initialized to 0)\n",
        "\n",
        "features = {}\n",
        "for i in range(4):\n",
        "  for j in range(4):\n",
        "    features[(i,j)] = 0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o7P4XLo3nmC8"
      },
      "source": [
        "### __Numerical representation of property values__\n",
        "\n",
        "- Many machine learning algorithms work better with \"nice and clean\" numerical data.\n",
        "- Let's assign such values to the individual fields of the `features` matrix based on the information in the problem definition above.\n",
        "- Word of warning - it's not as trivial as it might seem:\n",
        " - Numeric values should ideally preserve the meaning of the original ones.\n",
        "   - In our example, this concerns primarily the facial hair type - while zero may represent no facial hair and higher numerical values may correspond to increasingly \"bushy\" facial hair (as it's done below), in practice, one would typically consider such features [categorical](https://medium.com/@nikhiloswal/categorical-features-in-machine-learning-14ed910c0812) and choose for instance [one-hot encoding](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics) for their representation in the feature matrix.\n",
        " - Missing data items should typically be meaningfully filled in (\"imputed\").\n",
        " - For many machine learning algorithms, the data should also be normalised so that the ranges of the individual columns are more or less balanced."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qivN-ufYnmC9"
      },
      "source": [
        "#### __Possible numerical representation of property values__"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "nQa17X6InmC-"
      },
      "source": [
        "# feature value ranges and their corresponding numerical representations:\n",
        "# HL:  {b,s,m,l}        ~ {0,1,2,3}\n",
        "# FHT: {no,gt,ms,sb,br} ~ {0,1,2,3,4}\n",
        "# LN:  1..10            ~ 1..10\n",
        "# LS:  1..10            ~ 1..10\n",
        "\n",
        "# numerical representation of property values - training data set\n",
        "# (note: imputation uses domain knowledge - women usually do not have much\n",
        "#  facial hair)\n",
        "\n",
        "features = {\n",
        "  (0,0) : 3, (0,1) : 4, (0,2) : 9, (0,3) : 1,  # ~ Petr:   l, br, 9/10, 1/10\n",
        "  (1,0) : 1, (1,1) : 0, (1,2) : 8, (1,3) : 9,  # ~ Marie:  s, -, 9/10, 9/10\n",
        "  (2,0) : 3, (2,1) : 2, (2,2) : 1, (2,3) : 10, # ~ Azazel: l, ms, 1/10, 10/10\n",
        "  (3,0) : 0, (3,1) : 4, (3,2) : 7, (3,3) : 2,  # ~ Pavel:  b, br, 7/10, 2/10\n",
        "}\n",
        "\n",
        "# normalization of property values (by maximum of each column)\n",
        "\n",
        "# initial maximum value\n",
        "max_dict = {0:0, 1:0, 2:0, 3:0}\n",
        "\n",
        "# determining the actual value of the maximum by scanning the matrix of\n",
        "# property values\n",
        "#   - NOTE: this implementation finds the maximum from the data, which may not\n",
        "#     be completely correct for the relationship to nature, where the \"real\"\n",
        "#     maximum (10) is not in the data (the maximum there is 9)\n",
        "for i,j in features:\n",
        "  if features[(i,j)] > max_dict[j]:\n",
        "    max_dict[j] = features[(i,j)]\n",
        "\n",
        "# normalization of values to 0-1 interval by dividing the values by the maximum\n",
        "for i,j in features:\n",
        "  features[(i,j)] /= max_dict[j]\n",
        "\n",
        "# listing the final matrix\n",
        "rows = []\n",
        "for i in range(4):\n",
        "  columns = []\n",
        "  for j in range(4):\n",
        "    columns.append(round(features[(i,j)],1))\n",
        "  rows.append('   '.join([str(x) for x in columns]))\n",
        "print('Normalized matrix of numeric property values: \\n')\n",
        "print('\\n'.join(rows))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tihcJlMWnmDC"
      },
      "source": [
        "---\n",
        "\n",
        "## 2. A simple algorithm for training a perceptron\n",
        "\n",
        "__Basic facts__\n",
        "- [Perceptron](https://en.wikipedia.org/wiki/Perceptron) is one of the oldest concepts in the theory (and practice) of machine learning (proposed in 1958).\n",
        "- It is a [linear](https://en.wikipedia.org/wiki/Linear_classifier) ​​[binary](https://en.wikipedia.org/wiki/Binary_classification) classifier for supervised learning over data represented by number vectors.\n",
        "- It is based on the so-called threshold function $ f $ mapping the input vector $ \\mathbf{x} $ to a binary scalar $ f(\\mathbf{x}) $ as follows:\n",
        " - $ f(\\mathbf{x}) = 1 $ if $ \\mathbf{w} \\cdot \\mathbf{x} + b \\geq 0 $\n",
        " - otherwise $ f(\\mathbf{x}) = 0 $\n",
        "- This function has two parameters - a scalar threshold $ b $ (can be equal to zero) and a weight vector $ \\mathbf{w} $, which is optimized to best discriminate the data based on their membership in individual classes (positive vs. negative examples).\n",
        "- The perceptron has become the basis of modern [deep learning](https://en.wikipedia.org/wiki/Deep_learning), which (very simply) uses perceptron-like units in many complex interconnected \"hidden\" layers. Parameters (i.e., weight vectors) are usually optimized by [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) in such networks.\n",
        "\n",
        "__General perceptron for illustration__\n",
        "\n",
        "![perceptron](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/perceptron.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yLeTeD2tnmDD"
      },
      "source": [
        "### Rules for incremental updating of scales when learning the perceptron\n",
        "- For simplicity, assume that the parameter $ b = 0 $.\n",
        "- At the beginning we initialize the vector $ \\mathbf{w} $ to random values.\n",
        "- In each individual learning step (\"epoch\") we randomly select one example $ \\mathbf{x} $ from the training dataset.\n",
        " - If the example $ \\mathbf{x} $ is positive and $ \\mathbf{w} \\cdot \\mathbf{x} < 0 $, we assign $ \\mathbf{w} $ the value $ \\mathbf{w} + \\mathbf{x} $.\n",
        " - If the example $ \\mathbf{x} $ is negative and $ \\mathbf{w} \\cdot \\mathbf{x}> 0 $, we assign $ \\mathbf{w} $ the value $ \\mathbf{w} - \\mathbf{x} $.\n",
        "- We repeat this until an approximate convergence is achieved (e.g., until all examples are classified correctly or until a predefined number of epochs is reached)."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Auxiliary functions for vector operations to be used in the following assignments"
      ],
      "metadata": {
        "id": "rc9Q5Wq_Ral9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# auxiliary vector operation functions\n",
        "\n",
        "def dot_product(u,v):\n",
        "  \"\"\"Computing the dot product of the two input vectors.\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  u, v : dict\n",
        "      The vectors for which the dot product is computed (represented as\n",
        "      dictionaries mapping indices of the vector elements to the element\n",
        "      values).\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  float\n",
        "      The value of the dot product of the two input vectors.\n",
        "  \"\"\"\n",
        "\n",
        "  if len(u) != len(v):\n",
        "    raise ValueError\n",
        "  dp = 0\n",
        "  for i in range(len(u)):\n",
        "    dp += u[i]*v[i]\n",
        "  return dp\n",
        "\n",
        "def plus(u,v):\n",
        "  \"\"\"Adding up the two input vectors.\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  u, v : dict\n",
        "      The vectors which are to be added up (represented as dictionaries mapping\n",
        "      indices of the vector elements to the element values).\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  dict\n",
        "      The vector resulting from the addition of the input vectors.\n",
        "  \"\"\"\n",
        "\n",
        "  if len(u) != len(v):\n",
        "    raise ValueError\n",
        "  w = u\n",
        "  for i in range(len(w)):\n",
        "    w[i] += v[i]\n",
        "  return w\n",
        "\n",
        "def minus(u,v):\n",
        "  \"\"\"Subtracting the second input vector from the first.\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  u, v : dict\n",
        "      The vectors which are to be subtracted (represented as dictionaries\n",
        "      mapping indices of the vector elements to the element values).\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  dict\n",
        "      The vector resulting from the subtraction of the second input vector from\n",
        "      the first.\n",
        "  \"\"\"\n",
        "\n",
        "  if len(u) != len(v):\n",
        "    raise ValueError\n",
        "  w = u\n",
        "  for i in range(len(w)):\n",
        "    w[i] -= v[i]\n",
        "  return w"
      ],
      "metadata": {
        "id": "8tCQSGEqRSwU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kh28CaMxnmDD"
      },
      "source": [
        "### __Exercise 2.1: Training a simple perceptron__\n",
        "\n",
        "- Implement the above algorithm, training on and classifying the black metal data set we worked with in the previous section.\n",
        "- Experiment with convergence as you wish.\n",
        "- A note on the solution - use the predefined auxiliary functions for operations with vectors represented by dictionaries in Python (in the spirit of the previous section)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "s-3lUnfXnmDE"
      },
      "source": [
        "import random\n",
        "\n",
        "# learning the weight vector (main part of the algorithm; \"convergence\"\n",
        "# according to a fixed number of epochs)\n",
        "\n",
        "def optimise_weights(n_epochs=20):\n",
        "  \"\"\"Optimising a pre-defined weight vector of length four corresponding to the\n",
        "  method demonstration (initialised to random values).\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  n_epochs : int\n",
        "      The number of epochs for which the weight vector is to be optimised.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  dict\n",
        "      The optimised weight vector.\n",
        "  \"\"\"\n",
        "\n",
        "  # random weight init - TODO - COMPLETE YOURSELF\n",
        "\n",
        "  w = dict([(i,0) for i in range(4)])\n",
        "\n",
        "  # iterative updates of the weights\n",
        "\n",
        "  for epoch in range(n_epochs):\n",
        "\n",
        "    # TODO - COMPLETE YOURSELF\n",
        "\n",
        "    pass # TODO - REMOVE IN THE ACTUAL IMPLEMENTATION\n",
        "\n",
        "  return w\n",
        "\n",
        "# function for verifying the classification of training examples based on the\n",
        "# weight vector\n",
        "\n",
        "def classify(x_id,w):\n",
        "  \"\"\"Classifying an example based on the weight vector.\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  x_id : int\n",
        "      The ID of the example to be classified.\n",
        "  w : dict\n",
        "      The weight vector.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  int\n",
        "      0 for negative examples and 1 for positive examples.\n",
        "  \"\"\"\n",
        "\n",
        "  # creating the example vector\n",
        "  x = {}\n",
        "  for j in range(4):\n",
        "    x[j] = features[(x_id,j)]\n",
        "\n",
        "  # multiplying the vector by the weights\n",
        "  if dot_product(x,w) > 0:\n",
        "    return 1\n",
        "  return 0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# testing the implementation\n",
        "w = optimise_weights()\n",
        "print('Learned weight vector:', ' '.join([str(w[i]) for i in range(4)]))\n",
        "print('Classifying the training examples...')\n",
        "for i in range(4):\n",
        "  class_i = classify(i,w)\n",
        "  print('Example:', example_id_map_inv[i])\n",
        "  print('  actual class   :', labels[i])\n",
        "  print('  predicted class:', class_i)"
      ],
      "metadata": {
        "id": "dhErABxPBEpE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VJVY_OIsnmDH"
      },
      "source": [
        "### __Food for thought:__\n",
        "\n",
        "- Is the chosen data representation suitable for perceptron learning?\n",
        "- What could possibly be done differently?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "svZsiCRTnmDI"
      },
      "source": [
        "---\n",
        "\n",
        "## 3. A simple algorithm for training a decision tree\n",
        "\n",
        "__Basic facts__\n",
        "- [Decision tree learning](https://en.wikipedia.org/wiki/Decision_tree_learning) is one of the most widely used types of algorithms in data analysis.\n",
        "- It is based on the recursive division of the input set of examples into as homogeneous subsets as possible (in terms of belonging to individual classes according to the labeling of training examples).\n",
        "- The division of the \"parent\" takes place on a parameter that maximizes the homogeneity of the \"offspring.\"\n",
        "- This recursive division according to the values ​​of individual features corresponds to the decision (i.e., non-leaf) nodes of the tree.\n",
        "- The leaves, on the other hand, typically contain data that most likely belong to one particular class (note: in this exercise, for simplicity, we only deal with classification trees, not regression trees that associate the individual examples with numerical values ​​on a continuous scale).\n",
        "- The resulting structure is then used to classify data for which we do not know the individual classes.\n",
        "\n",
        "__A sample decision tree:__\n",
        "\n",
        "![decision_tree](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/decision_tree.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KWEmKZ53nmDJ"
      },
      "source": [
        "### Rules for dividing the datasets when creating a decision tree\n",
        "- There are a number of metrics for this (see for instance [here](https://en.wikipedia.org/wiki/Decision_tree_learning#Metrics)).\n",
        "- We will deal with a specific one, the so-called [information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees):\n",
        " - Based on [entropy](https://en.wikipedia.org/wiki/Entropy_%28information_theory%29).\n",
        " - To divide the \"parent set\", we use the feature that maximizes the difference between the entropy of the parent and the weighted average entropy of the individual \"offspring\" (the weights correspond to the relative number of examples in each offspring)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2IEVtzJdnmDJ"
      },
      "source": [
        "#### Practical example - division of the root node (i.e., the whole data set) of fans and non-fans of black metal\n",
        "\n",
        "Original dataset:\n",
        "\n",
        "| __ID__ || __HL__ | __FHT__ | __LN__ | __LS__ || __FBM__ |\n",
        "|--------||--------|---------|--------|--------||---------|\n",
        "| Petr   || l      | br      | 9/10   | 1/10   || y       |\n",
        "| Marie  || s      | $-$     | 8/10   | 9/10   || y       |\n",
        "| Azazel || l      | ms      | 1/10   | 10/10  || n       |\n",
        "| Pavel  || b      | br      | 7/10   | 2/10   || y       |\n",
        "\n",
        "Data set after transformation:\n",
        "\n",
        "| __ID__ || 0   | 1   | 2   | 3   || y |\n",
        "|--------||-----|-----|-----|-----||---|\n",
        "| 0      || 1   | 1   | 1   | 0.1 || 1 |\n",
        "| 1      || 0.3 | 0   | 0.9 | 0.9 || 1 |\n",
        "| 2      || 1   | 0.5 | 0.1 | 1   || 0 |\n",
        "| 3      || 0   | 1   | 0.8 | 0.2 || 1 |\n",
        "\n",
        "Parent set entropy ($P$):\n",
        "- $I_E(P) = I_E([3,1]) = -\\frac{3}{4}\\log_2\\frac{3}{4} - \\frac{1}{4}\\log_2\\frac{1}{4} \\doteq 0.811$\n",
        "\n",
        "Information gain of the feature 0 (hair length):\n",
        "- The values of the features are normalized, so we can simplify the problem to a binary division to \"halves,\" i.e., divide the examples into subsets according to the values of the attribute $ 0 $ less than $ 0.5 $, or greater than or equal to $ 0.5 $.\n",
        "- Entropy of offspring:\n",
        " - $I_E(P_{v(0) < 0.5}) = I_E([2,0]) = 0$ (both examples with a normalized hair length value less than 0.5, i.e., Marie and Pavel, are positive, so there is no degree of uncertainty in the set and the entropy is thus zero by definition)\n",
        " - $I_E(P_{v(0) \\geq 0.5}) = I_E([1,1]) = -\\frac{1}{2}\\log_2\\frac{1}{2} - \\frac{1}{2}\\log_2\\frac{1}{2} = 1$ (Petr is a positive example, Azazel is negative, it's \"one on one\", so the entropy is maximal)\n",
        "- Weighted mean entropy of offspring:\n",
        " - $wmean(I_E(P_{v(0) < 0.5}), I_E(P_{v(0) \\geq 0.5})) = 0\\cdot\\frac{2}{4} + 1\\cdot\\frac{2}{4} = 0.5$\n",
        "- Information gain:\n",
        " - $I_E(P) - wmean(I_E(P_{v(0) < 0.5}), I_E(P_{v(0) \\geq 0.5})) \\doteq 0.311$\n",
        "\n",
        "Similarly for features $ 1,2,3 $ (note: does anyone already know who \"wins\"?) ..."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rq78sSnGnmDK"
      },
      "source": [
        "### Entropy calculation for training a decision tree\n",
        "\n",
        "- Let's implement a function for calculating the entropy of a set of examples based on their individual class labels.\n",
        "- For demonstration purposes, we will only use the data structures defined above for the \"black metal\" dataset."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "ygR-ZLV2nmDM"
      },
      "source": [
        "from math import log\n",
        "\n",
        "# function returning the entropy of the proportions of positive and negative\n",
        "# examples in a given data subset\n",
        "\n",
        "def entropy(example_ids):\n",
        "  \"\"\"Computing the entropy of the proportions of positive and negative\n",
        "  examples in a given data subset (represented by the example identifiers,\n",
        "  i.e., the indices of the rows in the feature matrix and label vectors defined\n",
        "  earlier in this notebook).\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  example_ids : list\n",
        "      The identifiers of the examples for which the entropy is to be computed.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  float\n",
        "      The entropy of the proportions of positive and negative examples in the\n",
        "      given data subset.\n",
        "  \"\"\"\n",
        "\n",
        "  # determining the number of positive and negative examples in a set\n",
        "\n",
        "  n_pos, n_neg = 0, 0\n",
        "  for example_id in example_ids:\n",
        "    if labels[example_id] == 1:\n",
        "      n_pos += 1\n",
        "    else:\n",
        "      n_neg += 1\n",
        "\n",
        "  # total number of examples\n",
        "\n",
        "  n_tot = n_pos+n_neg\n",
        "\n",
        "  # computing the entropy\n",
        "\n",
        "  if n_pos == 0 or n_neg == 0:\n",
        "    return 0\n",
        "  return -(n_pos/n_tot)*log(n_pos/n_tot,2) - (n_neg/n_tot)*log(n_neg/n_tot,2)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# random implementation test - Petr (0) and Marie (1) belong to the same class,\n",
        "# but Marie and Azazel (2) do not\n",
        "assert entropy([0,1]) == 0\n",
        "assert entropy([1,2]) == 1"
      ],
      "metadata": {
        "id": "MJ8zGWikHGNB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BwVbFHxynmDQ"
      },
      "source": [
        "### __Exercise 3.1: Computing all subsets for determining the optimal dataset distribution__\n",
        "\n",
        "- Use the entropy calculation function to implement a function that computes the information gain of possible distributions of the input data set.\n",
        "- When implementing, use the procedure from the example above (including the simplifications, such as splitting the set to \"halves\" according to the values of the given property).\n",
        "\n",
        "An __important note__ - bear in mind what are the sets that are being split. It might help to recall how the data looks like, and what does it correspond to in the code.\n",
        "\n",
        "This is how the `features` data structure looks like:\n",
        "\n",
        "| __ID__ || 0   | 1   | 2   | 3   |\n",
        "|--------||-----|-----|-----|-----|\n",
        "| 0      || 1   | 1   | 1   | 0.1 |\n",
        "| 1      || 0.3 | 0   | 0.9 | 0.9 |\n",
        "| 2      || 1   | 0.5 | 0.1 | 1   |\n",
        "| 3      || 0   | 1   | 0.8 | 0.2 |\n",
        "\n",
        "The row IDs correspond to Petr, Marie, Azazel, Pavel, respectively. The column IDs correspond to the hair length, facial hair type, love of nature, love of Satan features, respectively. The parent set from the information gain example above corresponds to a list `[0,1,2,3]` (all row IDs, i.e., all training examples). The feature we are splitting the data set on is `0` (column ID corresponding to the hair length). The resulting splits are `[[1,3], [0,2]]` (the example IDs for which the value of the feature `0` is below or above 0.5, respectively) - this is what `splits[0]['splits']` would be set to in the code below. The corresponding `splits[0]['ig']` value would be ca. 0.311.\n",
        "\n",
        "Finally, this is how the `labels` data structure looks like:\n",
        "\n",
        "| __ID__ || y |\n",
        "|--------||---|\n",
        "| 0      || 1 |\n",
        "| 1      || 1 |\n",
        "| 2      || 0 |\n",
        "| 3      || 1 |\n",
        "\n",
        "Once again, the row IDs correspond to Petr, Marie, Azazel, Pavel, respectively. This is what is being used to compute the entropy of a set of training example IDs - the pre-defined `entropy` function looks up the labels of each ID in the input list and computes the probabilities of picking positive and negative examples from the text, from which the overall set entropy is computed."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "RtpKdbJktzd7"
      },
      "source": [
        "# computing subsets for optimal distribution of a set according to the\n",
        "# information gain of individual properties\n",
        "\n",
        "def get_splits(example_ids):\n",
        "  \"\"\"Computing the splits of the input set of examples (represented by the\n",
        "  example identifiers, i.e., the indices of the rows in the feature matrix and\n",
        "  label vectors defined earlier in this notebook). A split per each of the\n",
        "  four possible features is performed.\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  example_ids : list\n",
        "      The identifiers of the examples for which the splits are to be computed.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  dict\n",
        "      The splits of the input set of examples, along with their information\n",
        "      gain. The structure of the output dictionary is as follows:\n",
        "      - the keys are the fearure identifies (i.e., column identifiers in the\n",
        "        feature matrix defined earlier in this notebook)\n",
        "      - the values are nested dictionaries that store the following:\n",
        "        - a list of two lists (indexed by `'splits'`) - two non-overlapping\n",
        "          subsets of the input IDs resulting from the input split based on the\n",
        "          value of the corresponding feature being lower than 0.5 or not,\n",
        "          respectively\n",
        "        - a float number (indexed by `'ig'`) reflecting the information gain of\n",
        "          this particular split\n",
        "  \"\"\"\n",
        "\n",
        "  # distributions and their information gain - initialization\n",
        "\n",
        "  splits = {\n",
        "    0 : {'splits' : [], 'ig' : 0}, # splits by feature 0 (hair length)\n",
        "    1 : {'splits' : [], 'ig' : 0}, # splits by feature 1 (facial hair)\n",
        "    2 : {'splits' : [], 'ig' : 0}, # splits by feature 2 (love of nature)\n",
        "    3 : {'splits' : [], 'ig' : 0}  # splits by feature 2 (love of Satan)\n",
        "  }\n",
        "\n",
        "  # TODO - COMPLETE THE REST YOURSELF\n",
        "\n",
        "  return splits"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# listing the distribution of the entire dataset (tree root node); sorted in a\n",
        "# descending order by information gain\n",
        "splits = get_splits([0,1,2,3])\n",
        "for feature_id in sorted(splits,key=lambda x: splits[x]['ig'],reverse=True):\n",
        "  print('Splitting by feature %d with information gain=%f:' % \\\n",
        "        (feature_id,splits[feature_id]['ig']))\n",
        "  print('  ', splits[feature_id]['splits'])"
      ],
      "metadata": {
        "id": "pzJCZon2Hq_s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3ksfVo02nmDU"
      },
      "source": [
        "### __Exercise 3.2: Learning the actual decision tree__\n",
        "\n",
        "- Implement a decision tree learning function that:\n",
        " - Recursively calls the function for computing the information gain of possible distributions of the input data set.\n",
        " - Selects the optimal distribution according to the most discriminating features.\n",
        " - Stores information about the parents of individual nodes as a representation of the resulting tree.\n",
        "- Notes on the solution:\n",
        " - Feel free to implement the tree simply as a dictionary mapping nodes to their parents.\n",
        " - The non-leaf nodes can be strings representing what feature is the split being made on at that level.\n",
        " - Similarly, the leaves can simply be strings saying whether they're positive or negative example nodes, and listing the corresponding example IDs.\n",
        " - For instance, a tree with a root `split_on_feature_X_at_level_0` and two immediate leaves `pos : [0,1], neg : [2,3]` as its children can be represented as a Python dictionary `{'split_on_feature_X_at_level_0' : None, 'pos : [0,1]' : 'split_on_feature_X_at_level_0', 'neg : [2,3]' : 'split_on_feature_X_at_level_0'}`"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": false,
        "id": "KBa9vlyOuTxm"
      },
      "source": [
        "# checking the homogeneity of a data set (important for stopping the recursion)\n",
        "\n",
        "def is_pure(example_ids):\n",
        "  \"\"\"Checking the homogeneity of the input example set (represented by the\n",
        "  example identifiers, i.e., the indices of the rows in the feature matrix and\n",
        "  label vectors defined earlier in this notebook).\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  example_ids : list\n",
        "      The identifiers of the examples for which the homogeneity is to be\n",
        "      checked.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  bool\n",
        "      `True` for a set that contains examples labeled with the same class,\n",
        "      `False` for a set that contains examples labeled with different classes.\n",
        "  \"\"\"\n",
        "\n",
        "  # TODO - COMPLETE YOURSELF\n",
        "\n",
        "  return False # TODO - POSSIBLE CHANGE IN THE ACTUAL IMPLEMENTATION\n",
        "\n",
        "# recursive splitting of the data set\n",
        "\n",
        "def split_set(example_ids,parent_feature,parent_map,level):\n",
        "  \"\"\"Recursively splitting the set of examples IDs (represented by the example\n",
        "  identifiers, i.e., the indices of the rows in the feature matrix and label\n",
        "  vectors defined earlier in this notebook). Updates the mapping of node\n",
        "  descriptions to their parent descriptions (represented as a dictionary).\n",
        "\n",
        "  Parameters\n",
        "  ----------\n",
        "  example_ids : list\n",
        "      The identifiers of the examples for which the split is to be performed.\n",
        "  parent_feature : str or NoneType\n",
        "      The description of the parent node. `None` for the root node. String\n",
        "      reflecting the name of the feature on which the split was performed at\n",
        "      the preceding level of the growing tree and the depth of that level (just\n",
        "      in case the same feature is being used for splits at different tree\n",
        "      levels).\n",
        "  parent_map : dict\n",
        "      The current mapping of node descriptions to their parent descriptions.\n",
        "  level : int\n",
        "      The depth of the current level of the growing tree.\n",
        "\n",
        "  Returns\n",
        "  -------\n",
        "  dict\n",
        "      The mapping of node descriptions to their parent descriptions.\n",
        "  \"\"\"\n",
        "\n",
        "  # TODO - COMPLETE YOURSELF (RETURNS A DICTIONARY MAPPING NODES TO THEIR\n",
        "  #        PARENTS)\n",
        "\n",
        "  return parent_map"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# learning the tree from the complete input data\n",
        "decision_tree = split_set([0,1,2,3],None,{},0)\n",
        "# printing the learned tree\n",
        "print('Learned decision tree:')\n",
        "print(decision_tree)"
      ],
      "metadata": {
        "id": "Nvq-dXK8KjT_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hPu6CP-0nmDZ"
      },
      "source": [
        "### __Food for thought:__\n",
        "\n",
        "- Is the chosen data representation suitable for learning decision trees?\n",
        "- What might be done differently?\n",
        "- How does the resulting representation of the classifier conceptually differ from the perceptron? For example, is there something it can do much better?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lr1fmV_UnmDZ"
      },
      "source": [
        "---\n",
        "## 4. The exercises in a few lines of code\n",
        "\n",
        "- There are a number of libraries and really powerful machine learning tools available in Python.\n",
        "- One example is [scikit-learn](https://scikit-learn.org/) - a general library implementing de facto all \"classical\" machine learning algorithms.\n",
        "- This library can be used for solving the above exercises in a much more systematic manner, and using much less code, as shown in the following example.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### __*Exercise 4.1: Using scikit-learn to solve the labs assignments (optional)__\n",
        "\n",
        "- Training a perceptron model and a decision tree model on the data in this notebook using scikit-learn.\n",
        "- Use the trained classifiers to classify unknown examples designed by you."
      ],
      "metadata": {
        "id": "G2n5ZsZnOoT_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# TODO - complete yourselves"
      ],
      "metadata": {
        "id": "962Dp3Q5PZCq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true,
        "id": "yLjsBHRLnmDa"
      },
      "source": [
        "---\n",
        "\n",
        "#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:\n",
        "- Sources for the picture illustrating the two faces of black metal:\n",
        " - Retrieved from Wikimedia Commons [here](https://commons.wikimedia.org/wiki/File:Gorgoroth_I.jpg) and [here](https://en.wikipedia.org/wiki/File:Celestial_Lineage_2011.jpg)\n",
        " - Authors: [Alina Sofia](https://www.flickr.com/people/43529122@N08) and [Alison Scarpulla](https://commons.wikimedia.org/w/index.php?title=User:Kingofcupsqueenofswords&action=edit&redlink=1)\n",
        " - License: [Creative Commons Attribution-Share Alike 2.0 (CC BY-SA 2.0)](https://creativecommons.org/licenses/by-sa/2.0/deed.en) and [Creative Commons Attribution-Share Alike 4.0 International ](https://creativecommons.org/licenses/by-sa/4.0/deed.en)\n",
        "- Picture of the general perceptron:\n",
        " - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Perceptron.svg)\n",
        " - Author: [Mat the w](https://en.wikipedia.org/wiki/User:Mat_the_w)\n",
        " - License: [Creative Commons Attribution-Share Alike 3.0 Unported](https://creativecommons.org/licenses/by-sa/3.0/deed.en)\n",
        "- Picture of the decision tree:\n",
        " - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Decision_Tree.jpg)\n",
        " - Author: [Gilgold](https://commons.wikimedia.org/w/index.php?title=User:Gilgoldm&action=edit&redlink=1)\n",
        " - License: [Creative Commons Attribution-Share Alike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/deed.en)"
      ]
    }
  ]
}