{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__To do:__\n", "- read the documentation [here](http://scikit-learn.org/stable/modules/tree.html)\n", "- follow the documentation and try the example on the IRIS dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more challenging dataset: predict the onset of diabetes in female patients of Pima Indian heritage." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import fetch_mldata\n", "from sklearn import tree\n", "from sklearn.externals.six import StringIO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is for testing various parameters for training a tree classifier, not for proper estimation!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pima = fetch_mldata('diabetes_scale')\n", "X = pima.data # 768 x 8 matrix\n", "y = pima.target # =1 for diabetes, =-1 otherwise\n", "\n", "clf = tree.DecisionTreeClassifier()\n", "clf.fit(X, y)\n", "\n", "yp = clf.predict(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__To do:__\n", "- use _yp_ and _y_ to estimate performance for various values of the parameters " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "For this you need Graphviz to be installed:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open(\"pima_tree_1.dot\", 'w') as f:\n", " f = tree.export_graphviz(clf, out_file=f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At command prompt (in a terminal) run the command:\n", "\n", " dot -Tpdf pima_tree_1.dot -o pima_tree_1.pdf\n", "\n", "to produce a PDF file with the classification tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__To do:__\n", "Now try to change the depth (`max_depth` parameter), the number of points in a node for a split (`min_samples_split`), the number of points in a leaf (`min_samples_leaf`), the maximum number of leaves (`max_leaf_nodes`) and the impurity criterion (`criterion`).\n", "\n", "For example," ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3)\n", "clf.fit(X, y)\n", "\n", "with open(\"pima_tree_2.dot\", 'w') as f:\n", " f = tree.export_graphviz(clf, out_file=f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining classifiers\n", "\n", "__To do:__\n", "- train an LDA classifier on PIMA dataset" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,\n", " solver='svd', store_covariance=False, tol=0.0001)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n", "clf1 = LDA()\n", "clf1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- train a kNN classifier" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 2)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m clf2 = KNeighborsClassifier(...)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "clf2 = KNeighborsClassifier(...)\n", "clf2.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- train a classification tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf3 = ...\n", "clf3.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, write a function that combines the predictions of the classifiers and apply it to those above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def combiner(clfs, X, method):\n", " # clfs - a list of trained classifiers, like [clf1, clf2, clf3]\n", " # X - a data matrix: each rows is a data vector for which we will\n", " # predict the label\n", " # method - a string describing the method to be used\n", " import numpy as np\n", "\n", " nc = len(clfs)\n", " method = method.tolower()\n", " y = np.zeros(X.shape[0]) # final labels\n", "\n", " # simple code, not necessarily the most \"Pythonic\":\n", " if method == 'majority':\n", " labels = [clf.predict(X) for clf in clfs] \n", " DP = np.array(labels) # this will result in a matrix with\n", " # one row for each classifier and\n", " # columns for predictions for each point in X\n", " # here you should put the code for majoity voting\n", " y = ...\n", "\n", " elif method == 'average':\n", " # for continuous outputs\n", " pass # replace with your code\n", " elif method == 'maximum':\n", " # for continuous outputs\n", " pass # replace with your code\n", " else:\n", " raise ValueError('Unknown method')\n", "\n", " return y\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }