{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare some real-world data: download data file mda.zip from IS (sources/mda.zip). The data corresponds to an experiment in oncology (breast cancer), in which tens of thousands of genes were profiled and a biomarker for \"pathologic complete response\" was sought. Some details at: https://doi.org/10.1186/bcr2468" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "Xtr = np.load('X-train.npy') # 22283 variables, 130 observations\n", "Ytr = np.load('Y-train.npy') # Ytr[:,0] - ER positive; Ytr[:,1] - pCR\n", "Ytr = Ytr.astype('int32') # make sure the labels are INTs\n", "\n", "Xts = np.load('X-test.npy') # 22283 variables, 100 observations\n", "Yts = np.load('Y-test.npy') # Ytr[:,0] - ER positive; Ytr[:,1] - pCR\n", "Yts = Yts.astype('int32') # make sure the labels are INTs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# AdaBoost and Random Forests classifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AdaBoost\n", "Read the docs: http://scikit-learn.org/stable/modules/ensemble.html and have a look at the examples:\n", " * http://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_hastie_10_2.html\n", " * http://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.metrics import zero_one_loss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classical AdaBoost: discrete AdaBoost algorithm. Weak learner: decision stumps." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "T=200\n", "bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=T,\n", " algorithm='SAMME')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AdaBoostClassifier(algorithm='SAMME',\n", " base_estimator=DecisionTreeClassifier(class_weight=None,\n", " criterion='gini',\n", " max_depth=1,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort=False,\n", " random_state=None,\n", " splitter='best'),\n", " learning_rate=1.0, n_estimators=200, random_state=None)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fit the model (as usual)\n", "bdt.fit(Xtr, Ytr[:,0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of AdaBoost with decision stumps can be analyzed to find the most important variables from the data set. The follwoing command gives the indexes of variables with importance score higher than a threshold (0.01):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([ 1, 1262, 1952, 2320, 2584, 5830, 5921, 6915, 7681,\n", " 7746, 8076, 9810, 10643, 12014, 12172, 12820, 12893, 13199,\n", " 14522, 16159, 19896, 22097]),)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.where(bdt.feature_importances_ > 0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the errors, per step:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# - train error\n", "err_tr = np.zeros((T,)) \n", "for i, yp in enumerate(bdt.staged_predict(Xtr)):\n", " err_tr[i] = zero_one_loss(yp, Ytr[:,0])\n", "\n", "# - test error\n", "err_ts = np.zeros((T,)) \n", "for i, yp in enumerate(bdt.staged_predict(Xts)):\n", " err_ts[i] = zero_one_loss(yp, Yts[:,0])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig = plt.figure()\n", "ax = plt.subplot(111)\n", "ax.set_ylim(-0.01, 0.5)\n", "ax.set_xlim(0, T+2)\n", "ax.plot(np.arange(T)+1, err_tr, color='blue')\n", "ax.plot(np.arange(T)+1, err_ts, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TO DO:\n", "Try other base learners: e.g. a decision tree with 2 levels, in the above example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RealAdaboost" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## Weak learner: decision stumps\n", "\n", "T=200\n", "bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=T,\n", " algorithm='SAMME.R')\n", "\n", "# fit the model (as usual)\n", "bdt.fit(Xtr, Ytr[:,0])\n", "\n", "# The result of AdaBoost with decision stumps can be analyzed\n", "# to find the most important variables from the data set:\n", "\n", "np.where(bdt.feature_importances_ > 0.01)\n", "\n", "# gives the indexes of variables with importance score \n", "# higher than a threshold (0.01)\n", "\n", "# Get the errors, per step:\n", "# - train error\n", "err_tr = np.zeros((T,)) \n", "for i, yp in enumerate(bdt.staged_predict(Xtr)):\n", " err_tr[i] = zero_one_loss(yp, Ytr[:,0])\n", "\n", "# - test error\n", "err_ts = np.zeros((T,)) \n", "for i, yp in enumerate(bdt.staged_predict(Xts)):\n", " err_ts[i] = zero_one_loss(yp, Yts[:,0])\n", "\n", "fig = plt.figure()\n", "ax = plt.subplot(111)\n", "ax.set_ylim(-0.01, 0.5)\n", "ax.set_xlim(0, T+2)\n", "ax.plot(np.arange(T)+1, err_tr, color='blue')\n", "ax.plot(np.arange(T)+1, err_ts, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RadomForest" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10,\n", " n_jobs=None, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = RandomForestClassifier(n_estimators=10)\n", "clf.fit(Xtr, Ytr[:,0])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zero_one_loss(clf.predict(Xtr), Ytr[:,0]) # train error" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.10999999999999999" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zero_one_loss(clf.predict(Xts), Yts[:,0]) # test error" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=2, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10,\n", " n_jobs=None, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Other parameters\n", "clf = RandomForestClassifier(n_estimators=10, max_depth=2)\n", "clf.fit(Xtr, Ytr[:,0])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train error 0.02308\tTest error: 0.14000\n" ] } ], "source": [ "print(\"Train error {:1.5f}\\tTest error: {:1.5f}\".format(zero_one_loss(clf.predict(Xtr), Ytr[:,0]), zero_one_loss(clf.predict(Xts), Yts[:,0])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TO DO:\n", "* What can you say about error rate on the test set in the 2nd example (with respect to 1st example)?\n", "* Try other parameter combinations..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 1 }