{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Feature selection - filtering methods\n", "Read the documentation at http://scikit-learn.org/stable/modules/feature_selection.html sections 1.13.1, 1.13.2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply the method described to gene expression data. Download data file mda.zip from IS (sources/mda.zip)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "Xtr = np.load('X-train.npy') # 22283 variables, 130 observations\n", "Ytr = np.load('Y-train.npy') # Ytr[:,0] - ER positive; Ytr[:,1] - pCR\n", "Ytr = Ytr.astype('int32') # make sure the labels are INTs\n", "\n", "Xts = np.load('X-test.npy') # 22283 variables, 100 observations\n", "Yts = np.load('Y-test.npy') # Ytr[:,0] - ER positive; Ytr[:,1] - pCR\n", "Yts = Yts.astype('int32') # make sure the labels are INTs" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectKBest, SelectFwe\n", "from sklearn.feature_selection import chi2" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# variables related to ER status prediction:\n", "# (X must contain non negative values)\n", "X_tr_1 = SelectKBest(chi2, k=10).fit_transform(Xtr-Xtr.min()+0.1, Ytr[:,0]) \n", "X_tr_2 = SelectFwe(chi2, alpha=0.1).fit_transform(Xtr-Xtr.min()+0.1, Ytr[:,0])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(130, 22)\n" ] } ], "source": [ "print(X_tr_2.shape) # how many variables were selected?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, True, False, ..., False, False, False])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## If you want to apply the transformation to a validation set:\n", "## -first fit the model:\n", "s2 = SelectFwe(chi2, alpha=0.1).fit(Xtr-Xtr.min()+0.1, Ytr[:,0])\n", "## -then get the set of variables as a mask vector:\n", "s2.get_support() # True will stand for a selected variable..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. RFE\n", "\n", "See section 1.13.3 from the above linked documentation and the example at \n", "http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py\n", "\n", "** TO DO:** Implement the cross validation approach (similar to the example above) for the case of feature filtering (part 1) with chi2 statistic, when the number of features has to be optimized. Apply both RFE and the implemented procedure to the gene expression data (predict ER-status, Y[:,0])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }