Feature selection
Feature extraction
PA196: Pattern Recognition
09. Feature selection and extraction
Dr. Vlad Popovici
popovici@iba.muni.cz
Institute of Biostatistics and Analyses
Masaryk University, Brno
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
What?
the vectors to be classiﬁed are elements of some
d−dimensional space
the problem is to identify those features that do not contribute
to the classiﬁcation task and eliminate them from classiﬁer
training
thus, we seek the only d1 < d features that contribute to
classiﬁcation
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Why?
improve classiﬁcation performance:
redundant features → unstable classiﬁer, poor ﬁt, etc.
sparser models have better generalization properties
improve numerical stability
reduce the required sample size
reduce training time, classiﬁcation time
improve interpretability of the models
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
How?
use some optimality criterion to select the optimal subset of
features → search strategy
optimality criterion:
single-variate or multi-variate
classiﬁer-agnostic: ﬁltering methods
use the classiﬁer performance: wrapper methods
can be implicit in some classiﬁers:
classiﬁcation trees
AdaBoost with decision stumps
penalized methods (L1 penalty)
etc
hybrid: e.g. use a classiﬁcation tree to select the features and
another learner for ﬁnal classiﬁer
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
WARNING
The feature selection should be included inside the cross-validation
loop (or any other equivalent method) for performance estimation.
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Filtering methods
Idea: replace using the classiﬁer as criterion with some other
criterion simpler (faster) to compute and ﬁnd a subset of features
satisfying this objective.
single-feature/variable methods
feature-set methods
do not guarantee that selected features are relevant for a
particular choice of classiﬁer
usually, introduces a new meta-parameter of the modeling
process: the number of features to keep → this needs to be
optimized!
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Single-feature methods
evaluate each variable independently
may or may not take into account the class label
suitable approximation for high dimensional cases
does not avoid selecting correlated features
fast and easily integrated in modeling pipeline
the simplest form: discard features that are constant or have
low variability
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Using hypothesis testing for feature selection:
use a statistical test (e.g. t−test) to test
H0 :there is no difference between classes
H1 :there is a difference between classes
if H0 is rejected, we infer that the variable/feature is
importance (it bears information about the difference between
classes)
the process involves computing a statistic and comparing it
with the expected values under H0; if the value is too
"unusual" then H0 is rejected
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
in practice, this results in a "p-value" that is compared with a
predeﬁned cut-off α (called α−level, traditionally 0.05 or 0.01)
p-value: the probability, under H0, to obtain a statistic at least
as extreme as the one from the data at hand
α−level: probability of falsely rejecting H0
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Strategies for selecting the features:
select top d1 features (rank the features by statistic or
p-values): d1 needs to be chosen somehow (e.g. by
cross-validation)
select all features with p-value below a selected α−level
this requires adjustment of p-values for multiple testing
the easiest is to control family-wise error rate e.g. divide each
p-value by the number of tests (variables)
for indep. variables, the adjustment does not change the
ordering of variables (except for the ties), only the number of
variables with signiﬁcant p-value
popular approach: control for false discovery rate (FDR)
more complex adjustments are possible - also to take into
account the correlation structure
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Mutual information-based feature selection
Let X and Y be two random variables, then their mutual
information (MI) is deﬁned as
I(X; Y) =
Y X
p(x, y) log
p(x, y)
p(x)p(y)
dxdy
provides a measure of dependency between 2 variables
by considering X as a feature and Y the class to be predicted,
I(X; Y) gives the relevance of X in predicting Y
problem: estimation of MI? sample size?
it can be extended to evaluate the relevance of a set of
feature: I({X1, . . . , Xk }; Y)
can be extended to multi-class problems
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Other similar measures:
Kullback-Leibler divergence,
KL(p||q) = p(x) ln
p(x)
q(x)
dx
which is usually used in a symmetrized version
entropy
Gini index (as in classiﬁcation trees)
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Issues with single-variable ﬁltering:
does not account for relationships
between variables
if two variables are not useful for
classiﬁcation when considered
isolated, their combination might
be useful
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Subset feature selection
need a criterion to evaluate a set of features (e.g. MI)
need a strategy to generate all or some of the possible sets of
features
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Criteria
Let J be the criterion function, which is to be maximized by the
best set of features.
previous probabilistic measures can be extended to sets of
features (MI, KL, etc)
some allow immediate extension to multiclass (e.g. MI) others
require a pair-wise approach: J(gi, gj) is computed for all pairs
of classes gi, gj
popular choices measure are based on variance estimates.
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Example:
J = Tr(S−1
W
SB ) where Tr(·) is the trace of a matrix, and SB and
SW are the between-class and within-class scatter matrices,
respectively
J = |ˆΣ|
|SW | where ˆΣ is the estimated pooled variance matrix
J =
Tr(SB )
Tr(SW )
Mahalanobis cr. (for 2 classes):
J = (µ1 − µ2)T Σ1 + Σ2
2
−1
(µ1 − µ2)
in general, the criterion is chosen to allow a recursive
decomposition (to compute J for k + 1 feature sets, one uses
the result for k feature sets)
also, to be monotone:
X ⊆ Y ⇒ J(X) ≤ J(Y)
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Search strategies
Exhaustive search:
affordable only if the total number of features is small
involves enumerating all possible subsets
can be implemented in a breadth-ﬁrst or depth-ﬁrst fashion
can either start from full set of features (top-down) or from an
empty set of features (bottom-up)
if J is monotone, one can use a branch-and-bound procedure
to avoid enumerating all sets; it will still produce a global
optimum
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Suboptimal search strategies:
sequential forward selection (SFS): add sequentially features
to the current set such that at each step the added feature
produces the maximum increase in J
generalized SFS (GSFS): instead of adding a single feature,
add k features
sequential backward selection (SBS): start with the full set
and remove at each step that feature that leads to minimum
decrease in J
generalized SBS (GSBS):...
"plus l minus r": add best l and remove worse r features,
using sequential selection
generalized "plus l minus r"
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Floating search methods:
sequential forward/backward ﬂoating selection
a generalization of "plus l minus r" selection in which l and r
can vary
for example, (sketch of) SFFS:
1 start with empty set of features
2 use SFS to obtain a set of 2 features
3 add the feature that increases J the most and remove the one
that decreases it the least; if it is the same feature, consider
another feature not yet used
4 continue removing features until J decreases or only 2 features
are left; then go to step 3
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Wrapper methods
the optimality criterion J is now the performance of the
classiﬁer to be trained
any of the search strategies discussed before can be used in
this case
usually is it more computationally expensive than ﬁltering
methods
produces features that are adapted to the intended classiﬁer
performance estimation is based on some re-sampling
technique (e.g. cross-validation)
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Recursive feature elimination (RFE)
Idea: use the coefﬁcients from the ﬁtted model (classiﬁer) to rank
the features and eliminate those with low impact. Repeat the
procedure until a convergence criterion is met.
in LDA-like and (logistic) regression methods, the classiﬁer
has the form h(x) = i wixi and |wi| can be used to judge the
importance of i−th feature
in the case of regression, one may compute p-values
associated with the variables which, again, can be used for
ranking them (smaller p-values correspond to more important
variables)
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
RFE for linear SVM
use sensitivity analysis of a some cost function: what is the
inﬂuence of adding/removing a feature?
the decision function is
h(x) =
i∈SV
yiαi x, xi = x,
i∈SV
yiαixi
then, ∇xh(x) = i∈SV yiαixi = w and deﬁne the cost function
as J(w) = w
the smallest change is in the direction
wj = arg min
j
(|w|)
i.e. the smallest amplitude coefﬁcient corresponds to the least
signiﬁcant feature
RFE proceeds by iteratively eliminating the least signiﬁcant
feature(s) and retraining the classiﬁers on the updated data
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
RFE for non-linear SVM
let H = [yiyjK(xi, xj)]ij
deﬁne the cost function J = αT
Hα − α, e ,
be = [1, . . . , 1] ∈ Rd
trick: to compute the change in J after removing feature k,
assume no change in α when retraining on reduced-feature
set (if the feature is not important it would change (much) α)
with this,
H
(−k)
ij
= yiyjK(x
(−k)
i
, x
(−k)
j
)
the change in cost function is
∆J(k) = J − J(−k)
= αT
Hα − αT
H(−k)
α
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Algorithm: to obtain a ranking of features
1 let X∅ be the original data set and let S = ∅ be the initial
ordered list of features (from least to most signiﬁcant), X(S)
will denote the data set with features in S removed
2 repeat 3-6 until no more features are left:
3 train the SVM on X(S) and let α be the solution
4 compute ∆J(i) for all features not yet removed (i S)
5 select k = arg mini ∆J(i)
6 update S = S ∪ {k}
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
FS via regularization
Idea: introduce some constraints on the model such that the
number of effective features is decreased.
regularization has not necessarily as ultimate goal feature
selection
usually, reg. is imposed by penalizing some complexity
measure of the model or the loss functions for classiﬁcation:
Loss + λPenalty
if w is the vector of coefﬁcients, the classical penalties are
w 2
2
and w 1
what about λ?
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
similar approaches exist for various classiﬁers
there is a regularized version for classiﬁcation trees
the best choice for λ is usually obtained from cross-validation
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Feature extraction
Idea: transform the original feature space into a lower dimensional
space where some important property (e.g. separability) of data is
preserved or enhanced.
the methods can be based on linear or non linear
combinations of original variables
some methods are variable-directed: they are primarily
concerned with relations between variables
while others are more individual-directed: the distances
between data points are of main interest
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Principal component analysis - PCA
the goal is to ﬁnd a linear transformation of the original
variables such that the resulting variables are decorrelated
the hope is to ﬁnd a smaller number of variables that explain
the data
it is a basic data exploration technique
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Find an orthogonal transformation A of
the input data, such that the new
variables
ξ = Ax
satisfy one of the following (equivalent)
constraints:
the variance of ξj is stationary
variables ξj are decorrelated
the new axes are the best ﬁt (least
square) of the data and are
orthogonal
x1
x2
ξ1
ξ2
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
The solution is given by the eigenvectors of the covariance matrix
of data vectors {xi}.
let λi be the eigenvalues: their
ranking gives the order of the
principal axes
d
i=1 λi = d
i=1 Var(ξi)
dimensionality reduction: keep only
the ﬁrst k principal components
(e.g. to account for 80% of total
variance)
data approximation: reconstructing
the original space from only k PCs
λi
i
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
PCA in pattern recognition
lots of applications, either alone or in combination with some
classiﬁer
common combinations: PCA + LDA, PCA + SVM
kernel PCA
classical application in computer vision: Turk and Pentland’s
eigenfaces (1991)
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Face recognition procedure (sketch):
collect a number of images, say 4 images, per identity
construct the eigenfaces, an keep the top k
represent each face in terms of its coordinates in eigenface
space Ω = [ω1, . . . , ωk ]
for each identity, average the ω−vectors to obtain a
characteristic proﬁle
for a candidate face Ω∗ compute the Euclidean distance to
each for the Ω−vectors representing the identities
if the distance is too large, the candidate face does not
represent the claimed identity
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
the algorithm can be used for face detection as well
it generated a lot of follow-up methods
the method is easy to implement and can be easily be
adapted to low-bandwidth environments
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Other methods:
PCA is a special case of Karhunen-Loève transform
generalizations of PCA: common principal components,
non-linear PCA
independent component analysis: no longer an orthogonal
transform
factor analysis, etc
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Multidimensional scaling (MDS)
tries to ﬁnd a set of points that would explain the given
dissimilarities (distances)
it is used as an exploratory technique and for data
visualization
the representation follows the idea: similar objects are
represent close together, dissimilar ones, farther apart
can work both with metric and non-metric data
in the classical approach, only one (dis)similarity matrix is
required; other versions may use several
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
Classical MDS:
for two objects i and j the similarity is represented in terms of
Euclidean distance between points dij
dij are related to the given similarity through a transformation
f: δij = f(dij)
the problem is to ﬁnd f such that an error measure is
minimized
Vlad PA196: Pattern Recognition
Feature selection
Feature extraction
Principal component analysis
Multidimensional scaling
MDS is just an instance of larger class of methods: "manifold
learning"
depending on the problem, other transformations could be
more useful for visualization or classiﬁcation
isomap, local linearly embedding spaces, etc etc
Vlad PA196: Pattern Recognition