PA196: Pattern Recognition
09. Feature selection and extraction
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
What?
• the vectors to be classiﬁed are elements of some
d−dimensional space
• the problem is to identify those features that do not contribute
to the classiﬁcation task and eliminate them from classiﬁer
training
• thus, we seek the only d1 < d features that contribute to
classiﬁcation
Why?
• improve classiﬁcation performance:
• redundant features → unstable classiﬁer, poor ﬁt, etc.
• sparser models have better generalization properties
• improve numerical stability
• reduce the required sample size
• reduce training time, classiﬁcation time
• improve interpretability of the models
How?
• use some optimality criterion to select the optimal subset of
features → search strategy
• optimality criterion:
• single-variate or multi-variate
• classiﬁer-agnostic: ﬁltering methods
• use the classiﬁer performance: wrapper methods
• can be implicit in some classiﬁers:
• classiﬁcation trees
• AdaBoost with decision stumps
• penalized methods (L1 penalty)
• etc
• hybrid: e.g. use a classiﬁcation tree to select the features and
another learner for ﬁnal classiﬁer
WARNING
The feature selection should be included inside the cross-validation
loop (or any other equivalent method) for performance estimation.
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Filtering methods
Idea: replace the classiﬁer as criterion with some other criterion
simpler (faster) to compute and ﬁnd a subset of features satisfying
this objective.
• single-feature/variable methods
• feature-set methods
• do not guarantee that selected features are relevant for a
particular choice of classiﬁer
• usually, introduces a new meta-parameter of the modeling
process: the number of features to keep → this needs to be
optimized!
Single-feature methods
• evaluate each variable independently
• may or may not take into account the class label
• suitable approximation for high dimensional cases
• does not avoid selecting correlated features
• fast and easily integrated in modeling pipeline
• the simplest form: discard features that are constant or have
low variability
Using hypothesis testing for feature selection:
• use a statistical test (e.g. t−test) to test
H0 :there is no difference between classes
H1 :there is a difference between classes
• if H0 is rejected, we infer that the variable/feature is of some
importance (it bears information about the difference between
classes)
• the process involves computing a statistic and comparing it
with the expected values under H0; if the value is too
"unusual" then H0 is rejected
• in practice, this results in a "p-value" that is compared with a
predeﬁned cut-off α (called α−level, traditionally 0.05 or 0.01)
• p-value: the probability, under H0, to obtain a statistic at least
as extreme as the one from the data at hand
• α−level: probability of falsely rejecting H0
Strategies for selecting the features:
• select top d1 features (rank the features by statistic or
p-values): d1 needs to be chosen somehow (e.g. by
cross-validation)
• select all features with p-value below a selected α−level
• this requires adjustment of p-values for multiple testing
• the easiest is to control family-wise error rate e.g. multiply
each p-value by the number of tests (variables)
• for indep. variables, the adjustment does not change the
ordering of variables (except for the ties), only the number of
variables with signiﬁcant p-value
• popular approach: control for false discovery rate (FDR)
• more complex adjustments are possible - also to take into
account the correlation structure
Mutual information-based feature
selection
Let X and Y be two random variables, then their mutual
information (MI) is deﬁned as
I(X; Y) =
Y X
p(x, y) log
p(x, y)
p(x)p(y)
dxdy
• provides a measure of dependency between 2 variables
• by considering X as a feature and Y the class to be predicted,
I(X; Y) gives the relevance of X in predicting Y
• problem: estimation of MI? sample size?
• it can be extended to evaluate the relevance of a set of
feature: I({X1, . . . , Xk }; Y)
• can be extended to multi-class problems
Other similar measures:
• Kullback-Leibler divergence,
KL(p||q) = p(x) ln
p(x)
q(x)
dx
which is usually used in a symmetrized version
• entropy
• Gini index (as in classiﬁcation trees)
Issues with single-variable ﬁltering:
• does not account for relationships
between variables
• if two variables are not useful for
classiﬁcation when considered
isolated, their combination might
be useful
Subset feature selection
• need a criterion to evaluate a set of features (e.g. MI)
• need a strategy to generate all or some of the possible sets of
features
Criteria
Let J be the criterion function, which is to be maximized by the
best set of features.
• previous probabilistic measures can be extended to sets of
features (MI, KL, etc)
• some allow immediate extension to multiclass (e.g. MI) others
require a pair-wise approach: J(gi, gj) is computed for all pairs
of classes gi, gj
• popular choices are based on variance estimates.
• Example:
• J = Tr(S−1
W
SB ) where Tr(·) is the trace of a matrix, and SB and
SW are the between-class and within-class scatter matrices,
respectively
• J = |ˆΣ|
|SW | where ˆΣ is the estimated pooled variance matrix
• J =
Tr(SB )
Tr(SW )
• Mahalanobis cr. (for 2 classes):
J = (µ1 − µ2)T Σ1 + Σ2
2
−1
(µ1 − µ2)
• in general, the criterion is chosen to allow a recursive
decomposition (to compute J for k + 1 feature sets, one uses
the result for k feature sets)
• also, to be monotone:
X ⊆ Y ⇒ J(X) ≤ J(Y)
Search strategies
• Exhaustive search:
• affordable only if the total number of features is small
• involves enumerating all possible subsets
• can be implemented in a breadth-ﬁrst or depth-ﬁrst fashion
• can either start from full set of features (top-down) or from an
empty set of features (bottom-up)
• if J is monotone, one can use a branch-and-bound procedure
to avoid enumerating all sets; it will still produce a global
optimum
Suboptimal search strategies:
• sequential forward selection (SFS): add sequentially features
to the current set such that at each step the added feature
produces the maximum increase in J
• generalized SFS (GSFS): instead of adding a single feature,
add k features
• sequential backward selection (SBS): start with the full set
and remove at each step that feature that leads to minimum
decrease in J
• generalized SBS (GSBS):...
• "plus l minus r": add best l and remove worse r features,
using sequential selection
• ...
Floating search methods:
• sequential forward/backward ﬂoating selection
• a generalization of "plus l minus r" selection in which l and r
can vary
• for example, (sketch of) SFFS:
1 start with empty set of features
2 use SFS to obtain a set of 2 features
3 add the feature that increases J the most and remove the one
that decreases it the least; if it is the same feature, consider
another feature not yet used
4 continue removing features until J decreases or only 2 features
are left; then go to step 3
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Wrapper methods
• the optimality criterion J is now the performance of the
classiﬁer to be trained
• any of the search strategies discussed before can be used in
this case
• usually is it more computationally expensive than ﬁltering
methods
• produces features that are adapted to the intended classiﬁer
• performance estimation is based on some re-sampling
technique (e.g. cross-validation)
Recursive feature elimination (RFE)
Idea: use the coefﬁcients from the ﬁtted model (classiﬁer) to rank
the features and eliminate those with low impact. Repeat the
procedure until a convergence criterion is met.
• in LDA-like and (logistic) regression methods, the classiﬁer
has the form h(x) = i wixi and |wi| can be used to judge the
importance of i−th feature
• in the case of regression, one may compute p-values
associated with the variables which, again, can be used for
ranking them (smaller p-values correspond to more important
variables)
RFE for linear SVM
• use sensitivity analysis of some cost function: what is the
inﬂuence of adding/removing a feature?
• the decision function is
h(x) =
i∈SV
yiαi x, xi = x,
i∈SV
yiαixi
• then, xh(x) = i∈SV yiαixi = w and deﬁne the cost function
as J(w) = w
• the smallest change is in the direction
wj = arg min
j
(|w|)
i.e. the smallest amplitude coefﬁcient corresponds to the least
signiﬁcant feature
• RFE proceeds by iteratively eliminating the least signiﬁcant
feature(s) and retraining the classiﬁers on the updated data
set
RFE for non-linear SVM
• let H = [yiyjK(xi, xj)]ij
• deﬁne the cost function J = αT
Hα − α, e , e = [1, . . . , 1] ∈ Rd
• trick: to compute the change in J after removing feature k,
assume no change in α when retraining on reduced-feature
set (if the feature is not important it would change (much) α)
• with this,
H
(−k)
ij
= yiyjK(x
(−k)
i
, x
(−k)
j
)
• the change in cost function is
∆J(k) = J − J(−k)
= αT
Hα − αT
H(−k)
α
Algorithm: to obtain a ranking of features
1 let X∅ be the original data set and let S = ∅ be the initial
ordered list of features (from least to most signiﬁcant), X(S)
will denote the data set with features in S removed
2 repeat 3-6 until no more features are left:
3 train the SVM on X(S) and let α be the solution
4 compute ∆J(i) for all features not yet removed (i S)
5 select k = arg mini ∆J(i)
6 update S = S ∪ {k}
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
FS via regularization
Idea: introduce some constraints on the model such that the
number of effective features is decreased.
• regularization has not necessarily as ultimate goal feature
selection
• usually, reg. is imposed by penalizing some complexity
measure of the model or the loss functions for classiﬁcation:
Loss + λPenalty
• if w is the vector of coefﬁcients, the classical penalties are
w 2
2
and w 1
• what about λ?
Example: penalized logistic regression.
The new loss function is
L(θ = {w, w0}) =
n
i=1
[yi( w, xi + w0) − log(1 + exp( w, xi + w0)]
+ λ
d
j=1
|wj|
From ESL:
• similar approaches exist for various classiﬁers
• there is a regularized version for classiﬁcation trees
• the best choice for λ is usually obtained from cross-validation
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Feature extraction
Idea: transform the original feature space into a lower dimensional
space where some important property (e.g. separability) of data is
preserved or enhanced.
• the methods can be based on linear or non linear
combinations of original variables
• some methods are variable-directed: they are primarily
concerned with relations between variables
• while others are more individual-directed: the distances
between data points are of main interest
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Principal component analysis - PCA
• the goal is to ﬁnd a linear transformation of the original
variables such that the resulting variables are decorrelated
• the hope is to ﬁnd a smaller number of variables that explain
the data
• it is a basic data exploration technique
Find an orthogonal transformation A of
the input data, such that the new
variables
ξ = Ax
satisfy one of the following (equivalent)
constraints:
• the variance of ξj is stationary
• variables ξj are decorrelated
• the new axes are the best ﬁt (least
square) of the data and are
orthogonal
x1
x2
ξ1
ξ2
The solution is given by the eigenvectors of the covariance matrix
of data vectors {xi}.
• let λi be the eigenvalues: their
ranking gives the order of the
principal axes
• d
i=1 λi = d
i=1 Var(ξi)
• dimensionality reduction: keep only
the ﬁrst k principal components
(e.g. to account for 80% of total
variance)
• data approximation: reconstructing
the original space from only k PCs
λi
i
PCA in pattern recognition
• lots of applications, either alone or in combination with some
classiﬁer
• common combinations: PCA + LDA, PCA + SVM
• kernel PCA
• classical application in computer vision: Turk and Pentland’s
eigenfaces (1991)
Eigenfaces
Problem: face recognition. (Turk, Pentland: Eigenfaces for
recognition. 1991)
Face recognition procedure (sketch):
• collect a number of images, say 4 images, per identity
• construct the eigenfaces, an keep the top k
• represent each face in terms of its coordinates in eigenface
space Ω = [ω1, . . . , ωk ]
• for each identity, average the ω−vectors to obtain a
characteristic proﬁle
• for a candidate face Ω∗ compute the Euclidean distance to
each for the Ω−vectors representing the identities
• if the distance is too large, the candidate face does not
represent the claimed identity
• the algorithm can be used for face detection as well
• it generated a lot of follow-up methods
• the method is easy to implement and can be easily be
adapted to low-bandwidth environments
Other methods:
• PCA is a special case of Karhunen-Loève transform
• generalizations of PCA: common principal components,
non-linear PCA
• independent component analysis: no longer an orthogonal
transform
• factor analysis, etc
Outline
1 Feature selection
Introduction
Filtering methods
Wrapper methods
Feature selection via regularization
2 Feature extraction
Principal component analysis
Multidimensional scaling
Multidimensional scaling (MDS)
• tries to ﬁnd a set of points that would explain the given
dissimilarities (distances)
• it is used as an exploratory technique and for data
visualization
• the representation follows the idea: similar objects are
represent close together, dissimilar ones, farther apart
• can work both with metric and non-metric data
• in the classical approach, only one (dis)similarity matrix is
required; other versions may use several
Classical MDS:
• for two objects i and j the similarity is represented in terms of
Euclidean distance between points dij
• dij are related to the given similarity through a transformation
f: δij = f(dij)
• the problem is to ﬁnd f such that an error measure is
minimized
• MDS is just an instance of larger class of methods: "manifold
learning"
• depending on the problem, other transformations could be
more useful for visualization or classiﬁcation
• isomap, local linearly embedding spaces, etc etc
Example: Roweis and Saul: Nonlinear Dimensionality Reduction
by Locally Linear Embedding, Science 2000.