{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import print_function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Support Vector Machines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, start with the [documentation](http://scikit-learn.org/stable/modules/svm.html). Of particular interest\n", "is the [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) formulation of the SVMs.\n", "\n", "The goal is to *train* and *tune* an SVM." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.svm import SVC" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import fetch_mldata\n", "leuk = fetch_mldata('leukemia', data_home='.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check a basic classifier - just to have an idea of its performance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf = SVC(***...put your favourite parameters here...***)\n", "clf.fit(leuk.data, leuk.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to find the best parameters for our classifier. Let's try *grid search*: for a number of \n", "combinations of parameters, we estimate the performance and choose the best combination." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/chief/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", " \"This module will be removed in 0.20.\", DeprecationWarning)\n", "/Users/chief/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.\n", " DeprecationWarning)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.grid_search import GridSearchCV\n", "from sklearn.metrics import classification_report" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(leuk.data, leuk.target, test_size=0.25, random_state=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],\n", " 'C': [1, 10, 100, 1000]},\n", " {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]\n", "scores = ['accuracy', 'roc_auc']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# Tuning hyper-parameters for accuracy\n", "\n", "Best parameters set found on development set:\n", "\n", "SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)\n", "\n", "Grid scores on development set:\n", "\n", "0.722 (+/-0.000) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}\n", "0.759 (+/-0.013) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}\n", "0.722 (+/-0.000) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}\n", "0.870 (+/-0.047) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}\n", "0.722 (+/-0.000) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}\n", "0.870 (+/-0.047) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}\n", "0.722 (+/-0.000) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}\n", "0.870 (+/-0.047) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}\n", "0.889 (+/-0.045) for {'kernel': 'linear', 'C': 1}\n", "0.889 (+/-0.045) for {'kernel': 'linear', 'C': 10}\n", "0.889 (+/-0.045) for {'kernel': 'linear', 'C': 100}\n", "0.889 (+/-0.045) for {'kernel': 'linear', 'C': 1000}\n", "\n", "Detailed classification report:\n", "\n", "The model is trained on the full development set.\n", "The scores are computed on the full evaluation set.\n", "\n", " precision recall f1-score support\n", "\n", " -1 1.00 0.90 0.95 10\n", " 1 0.89 1.00 0.94 8\n", "\n", "avg / total 0.95 0.94 0.94 18\n", "\n", "\n", "# Tuning hyper-parameters for roc_auc\n", "\n", "Best parameters set found on development set:\n", "\n", "SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)\n", "\n", "Grid scores on development set:\n", "\n", "0.933 (+/-0.024) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}\n", "0.964 (+/-0.020) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}\n", "0.933 (+/-0.024) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}\n", "0.969 (+/-0.022) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}\n", "0.933 (+/-0.024) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}\n", "0.969 (+/-0.022) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}\n", "0.933 (+/-0.024) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}\n", "0.969 (+/-0.022) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}\n", "0.974 (+/-0.018) for {'kernel': 'linear', 'C': 1}\n", "0.974 (+/-0.018) for {'kernel': 'linear', 'C': 10}\n", "0.974 (+/-0.018) for {'kernel': 'linear', 'C': 100}\n", "0.974 (+/-0.018) for {'kernel': 'linear', 'C': 1000}\n", "\n", "Detailed classification report:\n", "\n", "The model is trained on the full development set.\n", "The scores are computed on the full evaluation set.\n", "\n", " precision recall f1-score support\n", "\n", " -1 1.00 0.90 0.95 10\n", " 1 0.89 1.00 0.94 8\n", "\n", "avg / total 0.95 0.94 0.94 18\n", "\n", "\n" ] } ], "source": [ "for score in scores:\n", " print(\"# Tuning hyper-parameters for %s\" % score)\n", " print()\n", "\n", " clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=3, scoring=score)\n", " clf.fit(X_train, y_train)\n", "\n", " print(\"Best parameters set found on development set:\")\n", " print()\n", " print(clf.best_estimator_)\n", " print()\n", " print(\"Grid scores on development set:\")\n", " print()\n", " for params, mean_score, scores in clf.grid_scores_:\n", " print(\"%0.3f (+/-%0.03f) for %r\" % (mean_score, scores.std() / 2, params))\n", " print()\n", "\n", " print(\"Detailed classification report:\")\n", " print()\n", " print(\"The model is trained on the full development set.\")\n", " print(\"The scores are computed on the full evaluation set.\")\n", " print()\n", " y_true, y_pred = y_test, clf.predict(X_test)\n", " print(classification_report(y_true, y_pred))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**To do:**\n", "- check the random search method (3.2.2 at http://scikit-learn.org/stable/modules/grid_search.html ) and use it!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }