{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__To do:__\n",
    "- read the documentation [here](http://scikit-learn.org/stable/modules/tree.html)\n",
    "- follow the documentation and try the example on the IRIS dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A more challenging dataset: predict the onset of diabetes in female patients of Pima Indian heritage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_mldata\n",
    "from sklearn import tree\n",
    "from sklearn.externals.six import StringIO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following is for testing various parameters for training a tree classifier, not for proper estimation!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pima = fetch_mldata('diabetes_scale')\n",
    "X = pima.data   # 768 x 8 matrix\n",
    "y = pima.target # =1 for diabetes, =-1 otherwise\n",
    "\n",
    "clf = tree.DecisionTreeClassifier()\n",
    "clf.fit(X, y)\n",
    "\n",
    "yp = clf.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__To do:__\n",
    "- use _yp_ and _y_ to estimate performance for various values of the parameters "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "For this you need Graphviz to be installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(\"pima_tree_1.dot\", 'w') as f:\n",
    "    f = tree.export_graphviz(clf, out_file=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At command prompt (in a terminal) run the command:\n",
    "\n",
    "    dot -Tpdf pima_tree_1.dot -o pima_tree_1.pdf\n",
    "\n",
    "to produce a PDF file with the classification tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__To do:__\n",
    "Now try to change the depth (`max_depth` parameter), the number of points in a node for a split (`min_samples_split`), the number of points in a leaf (`min_samples_leaf`), the maximum number of leaves (`max_leaf_nodes`) and the impurity criterion (`criterion`).\n",
    "\n",
    "For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3)\n",
    "clf.fit(X, y)\n",
    "\n",
    "with open(\"pima_tree_2.dot\", 'w') as f:\n",
    "    f = tree.export_graphviz(clf, out_file=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Combining classifiers\n",
    "\n",
    "__To do:__\n",
    "- train an LDA classifier on PIMA dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,\n",
       "              solver='svd', store_covariance=False, tol=0.0001)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n",
    "clf1 = LDA()\n",
    "clf1.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- train a kNN classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "invalid syntax (<ipython-input-8-f35a0a0f4b19>, line 2)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-8-f35a0a0f4b19>\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m    clf2 = KNeighborsClassifier(...)\u001b[0m\n\u001b[0m                                ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
     ]
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "clf2 = KNeighborsClassifier(...)\n",
    "clf2.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- train a classification tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clf3 = ...\n",
    "clf3.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, write a function that combines the predictions of the classifiers and apply it to those above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def combiner(clfs, X, method):\n",
    "    # clfs - a list of trained classifiers, like [clf1, clf2, clf3]\n",
    "    # X - a data matrix: each rows is a data vector for which we will\n",
    "    # predict the label\n",
    "    # method - a string describing the method to be used\n",
    "    import numpy as np\n",
    "\n",
    "    nc = len(clfs)\n",
    "    method = method.tolower()\n",
    "    y = np.zeros(X.shape[0])     # final labels\n",
    "\n",
    "    # simple code, not necessarily the most \"Pythonic\":\n",
    "    if method == 'majority':\n",
    "        labels = [clf.predict(X) for clf in clfs] \n",
    "        DP = np.array(labels)    # this will result in a matrix with\n",
    "                                 # one row for each classifier and\n",
    "                                 # columns for predictions for each point in X\n",
    "        # here you should put the code for majoity voting\n",
    "        y = ...\n",
    "\n",
    "    elif method == 'average':\n",
    "    # for continuous outputs\n",
    "        pass   # replace with your code\n",
    "    elif method == 'maximum':\n",
    "    # for continuous outputs\n",
    "        pass   # replace with your code\n",
    "    else:\n",
    "        raise ValueError('Unknown method')\n",
    "\n",
    "    return y\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}