Modern regression and classification
techniques in computational biology
Michael G. Schimek
Medical University of Graz
Institute for Medical Informatics, Statistics and Documentation
8036 Graz, Austria, Europe
Lectures 2009
IBA, Masaryk University
c 2009 of M. G. Schimek Regression and classification techniques
What are typical tasks in computational biology and
the health sciences?
The two most important ones we focus on in this series of
lectures are:
Regression of quantitative (dependent) measurements on
quantitative (explanatory) measurements in a designed
study (prior information)
Classification of quantitative measurements with respect
to a qualitative response in a designed study (prior
information)
Both tasks can be interpreted in the context of supervised
statistical learning in the spirit of Hastie, Tibshirani, and
Friedman (2009) [HTF'09]
However, both tasks are associated with self-contained
statistical methodologies
c 2009 of M. G. Schimek Regression and classification techniques
Statistical methods in computational biology and the
health sciences 1
In terms of regression we have
the classical linear model under Normal errors
the parametric class of generalized linear models under
Exponential Family errors
the nonparametric class of generalized additive models
under Exponential Family errors
several extensions of the above such as semiparametric
models, mixed models, etc.
adaptations of the above methods to control violations of
assumptions (e.g. overdispersion, multicollinearity, high
dimensionality)
c 2009 of M. G. Schimek Regression and classification techniques
Statistical methods in computational biology and the
health sciences 2
In terms of classification we have
Fisher's linear discriminant analysis
Linear and quadratic discriminant analysis
Extensions of linear discriminant analysis (e.g. flexible
discriminant analysis, shrinkage methods)
Nearest neighbor classifiers
Support vector machines
Fundamental statistical learning goal: Useful information
reduction with respect to responses (regression) or classifiers
(classification)
Advanced statistical learning goal: Response (regression) or
class prediction (classification) for new measurements or
objects or cases
c 2009 of M. G. Schimek Regression and classification techniques
Data sources and measurement characteristics
Data sources
Prospective studies and experiments
Retrospective studies (e.g. from data bases)
Ex-post-facto analysis
Meta analysis (combining study outcomes)
Genetic analysis
Image analysis
Measurement characteristics
Quantitative (metric scale, integer or real valued)
Qualitative (ordered or unordered units)
Resolution resp. units on measurement scale
Distributional features (data sparseness)
Proportion and distribution of missing observations
Error sources (systematic or stochastic)
Signal-to-noise ratio
c 2009 of M. G. Schimek Regression and classification techniques
Data structures
Sample data (representative for population)
Reasonable sample size (required for all classical
biostatistics methods)
Very small sample size (unsuitable for most classical
biostatistics methods)
Huge sample size (unsuitable for most classical
biostatistics methods)
N > p, where N is sample size and p is number of
parameters (required for all classical biostatistics methods)
N p leading to ill-posed estimation problems (classical
biostatistics methods fail)
Reasonable number of dimensions (parameters; classical
biostatistics methods apply)
Large number of dimensions (parameters; classical
biostatistics methods fail)
c 2009 of M. G. Schimek Regression and classification techniques
Conventions and goals
We assume
outcome measurements y (also called dependent variable,
response, target)
vector of p predictor measurements x (also called
independent variables, inputs, covariates, features)
in the regression problem, y is quantitative
in the classification problem, y takes values in a finite,
unordered set
empirical (training) data (x1, y1), . . . , (xN, yN)
Goals are
decision support (based on inference)
to understand which input effects output
model fitting (identification, estimation and verification)
prediction (of future cases)
assessment of the quality of inference and prediction
c 2009 of M. G. Schimek Regression and classification techniques
Statistical decision theory basics
Assume quantitative output Y
Assume random input vector X  Rp
Assume arbitrary function f (e.g. separating two classes)
Let us have loss function L(Y, f(X)) for penalizing errors
in prediction
Most common and convenient is squared error loss
L(Y, f(X)) = (Y - f(X))2
We can choose f according to expected squared
prediction error EPE(f) = E(Y - f(X))2
Thus the solution is the conditional expectation
f(x) = E(Y | X = x), also denoted regression function
The elementary statistical method is regression
There are many ways to estimate the regression function from
(training) data
c 2009 of M. G. Schimek Regression and classification techniques
Case of categorical response variable 1
Let us have prediction rule C(X), and Y and C(X) take
values in C = 1, 2, . . . , K
The loss function for penalizing prediction errors is L(k, l),
i.e. the price to be paid for classifying an observation
belonging to class k as l
For a zero-one loss function all misclassifications are
charged one unit
The expected prediction error is EPE = E[L(Y, C(X))]
The solution is
^C(x) = argmincC
K
k=1 L(k, c)P(Y = k | X = x)
Assuming a zero-one loss function we simply get ^C(x) = k
if P(k | X) = maxcCP(c | X), the so-called Bayes
classifier (NB: we should decide for the class with
maximum probability at input x)
c 2009 of M. G. Schimek Regression and classification techniques
Case of categorical response variable 2
Task is to construct a classifier C(X)
by estimating probabilities P(c | X) from the data
or by estimating class densities P(X | c) in combination
with Bayes rule
There are many different approaches (for quantitative as well
as categorical response variables)
It is important to understand the ideas behind these
approaches
Otherwise one cannot decide when to use them
Further it is important to assess their performance
In contrast to machine learning, statistical approaches
(algorithms) suitable for specific responses (output) are not
compared with respect to performance only
c 2009 of M. G. Schimek Regression and classification techniques
Linear regression as elementary method
Regression function (x) = E(Y | X) where X is design or
data matrix
A linear form (xi) = 0 + p-1
j=1 xijj is assumed, which is
an approximation to the truth
In matrix notation  = X, where  is N-dimensional vector
of predicted values, X an N × p matrix with ones in the first
column, and  a p-dimensional vector of parameters
Fitting (estimation) by least squares
^ = argmin i(yi - 0 - p-1
j=1 xijj)2 =
argmin(y - X)T (y - X)
Solution is ^ = (XT X)-1XT y and ^y = X ^
Variance of estimator is var(^) = (XT X)-12
c 2009 of M. G. Schimek Regression and classification techniques
Linear regression of binary response [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
Nearest-neighbor methods
No linear decision boundary as in least-squares regression
No stringent assumption about the data, they act
data-driven (adaptive)
Those observations in the training data are used that are
closest in input space to x to form ^Y
The k-nearest neighbor (k-NN) fit for ^Y is defined
^Y(x) = 1
k xi Nk (x) yi, where Nk (x) is the neighborhood of
x defined by the closest points xi in the training sample
Metric for closeness: e.g. Euclidean distance
Effective number of parameters in k-NN is not k but N
k
(decreases with increasing k due to overlapping
neighborhoods)
A sum-of-squared-error criterion does not work here
(would always pick k = 1!)
c 2009 of M. G. Schimek Regression and classification techniques
15-NN classifier of binary response [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
1-NN classifier of binary response [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
Simple extensions of least squares and k-NN methods
Kernel methods use weights that decrease smoothly to
zero with distance from the target point (in contrast to
0/1-weights in k-NN)
k-NN methods are the simplest form of smoothing methods
Kernel methods belong also to the class of smoothing
methods (they also comprise smoothing splines yet not
motivated by weighting schemes)
For high-dimensional problems kernel methods can be
modified: some variables obtain higher weights than others
Local regression, another smoothing method, fits linear
models by locally weighted least squares
Linear models fit to a basis expansion of the original inputs
can cope with rather complex data structures
c 2009 of M. G. Schimek Regression and classification techniques
Some more decision theory 1
How can we best predict Y at any point X = x?
Least squares regression:
Minimizing the expected prediction error EPE has the solution
f(x) = E(Y | X = x),
thus the best predictor is the conditional mean (in terms of
average squared error)
k-nearest neighbor method:
Implements the above idea directly using the training data
^f = average(yi | xi  Nk (x)),
where Nk (x) is the neighborhood of x defined by the closest
points xi in the training sample
c 2009 of M. G. Schimek Regression and classification techniques
Misclassification curves for simulated data: linear
regression vs. k-NN [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
Some more decision theory 2
There are two approximations
Expectation is approximated by averaging over sample
(training) data
Conditioning at a point is relaxed to conditioning on some
region "close" to the target point
For large N the points in the neighborhood are likely to be close
to x, and as k increases the average will stabilize
Under mild regularity conditions on the joint probability
distribution Pr(X, Y) one can show that
as N, k   such that k
N  0, ^f  E(Y | X = x)
In practice this does not help as we do not have very large N
Further problem: in high dimensions the metric size of the
k-nearest neighborhood gets disproportional large
The convergence still holds but the rate of convergence
decreases as the dimension increases (curse of
dimensionality)
c 2009 of M. G. Schimek Regression and classification techniques
Measures for quality of an estimator 1
A valuable quality measure for ^(x) is mean squared error
Let 0(x) at point x be the true value of ^(x)
Then
MSE(^(x)) = E(^(x) - 0(x))2
This can be written as variance plus squared bias
MSE(^(x)) = var(^(x)) + (E ^(x) - 0(x))2
We have the following relationship: low bias with high
variance, and vice versa
Consequences
Selecting an estimator involves a tradeoff between bias
and variance
Even for the simplest data structure resp. model this is true
For ill-posed problems (e.g. data with N p), where
regularization is required, the selection problem becomes
even more delicate
c 2009 of M. G. Schimek Regression and classification techniques
Measures for quality of an estimator 2
Predictive ability of k-NN regression fit ^fk (x0)
Assume data from Y = f(X) + with E( ) = 0 and var( ) = 2
Let xi being fixed (non-random)
The expected prediction error at x0, also called
generalization (test) error is
EPEk (x0) = E[(Y - ^fk (x0))2
| X = x0]
= 2
+ [bias2
(^fk (x0)) + var(^fk (x0))]
= 2
+ [f(x0) -
1
k
k
l=1
f(x(l))]2
+
2
k
where (l) is the sequence of NN to x0 (k is model complexity)
2, the irreducible error (variance of new test case) cannot be
controlled; mean squared error of ^fk (x0) (a bias and a variance
component) can be controlled
c 2009 of M. G. Schimek Regression and classification techniques
Bias-variance-tradeoff
c 2009 of M. G. Schimek Regression and classification techniques
Tradeoff between bias and variance
How about a linear regression model? If it is correct for a
given data structure, then the least squares predictor ^ is
unbiased and has the lowest variance among all
estimators that are linear functions of y
There can be biased estimators with smaller MSE
When an estimator is regularized (penalized, shrunken), its
variance will be reduced
Examples of regularization: subset selection (forward,
backward, all subsets), ridge regression, the lasso
A further limitation: empirical models are never correct
which leads to an additional model bias between the
closest member of the (e.g. linear) model class and the
(unknown) truth
c 2009 of M. G. Schimek Regression and classification techniques
Sources of bias and estimation variance [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
Subset selection 1
Subset selection (of predictive variables or features) is a
form of shrinkage
Motivations are
prediction accuracy: least squares estimators can have low
bias but large variance (can reduce the variance)
interpretation: easier with small well-predicting subset
Hence only a subset of parameters (e.g. regression
coefficients) is retained and the remaining ones are set to zero
There are several selection strategies
All subset regression: Finds for each s  0, 1, 2, . . . , p the
subset of size s that gives the smallest residual sum of squares
This involves the tradeoff between bias and variance (e.g. can
use cross-validation)
c 2009 of M. G. Schimek Regression and classification techniques
Subset selection 2
A search strategy can improve the performance
Forward stepwise selection: Sequentially starting with
the intercept, variables are added into the model that most
improve the fit; typical criterion is
F1,N-s-2 =
RSS(^old ) - RSS(^new )
RSS(^new )/(N - s - 2)
Backward stepwise selection: A full least squares model
is fitted, then variables are sequentially eliminated based
on F-statistic (requires N > p)
Hybrid stepwise selection: Sequentially for each (best)
variable added the least important variable is deleted
c 2009 of M. G. Schimek Regression and classification techniques
Subset selection and shrinkage
The subset selection procedures need one or more tuning
parameters
Subset size
Position along stepwise path
Shrinkage methods, related to subset selection, have a penalty
or ridge parameter (also true for certain smoothers, e.g.
smoothing splines)
Subset selection does not automatically reduce the prediction
error (the discrete selection process can retain a high variance)
Shrinkage methods are a continuous alternative
Ridge regression shrinks the coefficients by imposing a
penalty on their size
c 2009 of M. G. Schimek Regression and classification techniques
Ridge regression 1
The ridge coefficients minimize a penalized residual sum of
squares
^ridge
= argmin


N
i=1
(yi - 0 -
p
j=1
xijj)2
+ 
p
j=1
2
j


where   0 is the ridge (complexity) parameter weighting the
penalty ­ the coefficients are shrunk towards 0
Equivalently we can write
^ridge
= argmin
N
i=1
(yi - 0 -
p
j=1
xijj)2
subject to
p
j=1
2
j  s,
a constraint on the parameters
c 2009 of M. G. Schimek Regression and classification techniques
Ridge regression 2
Ridge solutions are not equivariant under scaling of inputs
(input has to be standardized)
Further there is no intercept 0 (X has p rather than p + 1
columns)
In matrix notation we have
RSS() = (y - X)T
(y - X) + T

The solutions are
^ridge
= (XT
X + I)-1
XT
y,
where I is p × p
Because of a quadratic L2 ridge penalty T  the solution is a
linear function of y
One might chose other penalties too
c 2009 of M. G. Schimek Regression and classification techniques
The lasso 1
Lasso is another shrinkage method with important differences
The estimate is
^lasso
= argmin
N
i=1
(yi - 0 -
p
j=1
xijj)2
subject to
p
j=1
| j | s
The solution for 0 is y (no intercept needed)
The lasso penalty is L1, leading to a non-linear estimation
problem
A quadratic programming algorithm is required
c 2009 of M. G. Schimek Regression and classification techniques
The lasso 2
The standardized tuning parameter s is
s =
t
p
j=1 | ^j |
,
where t is the penalty parameter
Relationship between lasso and ridge coefficients:
If t is chosen larger than t0 = p
j=1 | ls
j | then ^lasso = ^ls
j
For instance for t = t0/2 the least squares (ls) estimates are
shrunken by around 50%
For s  0 the ls estimates tend to zero
Lasso translates each ls coefficient by a constant factor,
truncating at zero
Ridge however does a proportional shrinkage on the ls
coefficients
t can be chosen by cross-validation
c 2009 of M. G. Schimek Regression and classification techniques
Transformation of least squares coefficients in ridge
regression and lasso
c 2009 of M. G. Schimek Regression and classification techniques
The curse of dimensionality 1
Multidimensional smoothers work well for moderate numbers of
predictors, but the curse of dimensionality hinders them in
higher dimensions
Problems are
local neighborhoods are empty
nearest-neighborhoods are not local
all points are close to the boundary
samples size needs to grow exponentially
Moreover without some structure in the models
high-dimensional functions are hard to represent and interpret
Illustration:
Uniformly distributed data in p-dimensional unit cube
Construct subcube from origin to capture fraction f of the
data
What distance do we have to reach out on each axis?
c 2009 of M. G. Schimek Regression and classification techniques
The curse of dimensionality 2 [HTF'09]
c 2009 of M. G. Schimek Regression and classification techniques
Generalized linear models 1
Generalized linear models (GLM) extend linear (regression)
models to accommodate both non-normal response and
transformations to linearity
Characteristics of a GLM:
Response y observed independently at fixed values of
explanatory variables x1, x2, . . . , xp
The explanatory variables xj may only influence the
distribution of y through a single linear function called
linear predictor  = 0 + 1x1 + 2x2 + . . . + pxp
The probability density function of y is assumed to follow
the form of the exponential family distributions
Mean  is a smooth invertible function of the linear
predictor  = m(),  = m-1() = G(), where the inverse
function G is called the link function
c 2009 of M. G. Schimek Regression and classification techniques
Generalized linear models 2
Probability density function of exponential family
distributions
f(yi; , ) = exp
Ai[yii - (i)]
 + (yi, /Ai)
,
where  is a scale parameter (possibly known), Ai a known
prior weight, and i depends upon the linear predictor
 is an invertible function of ,  = ( )-1()
For known  the distribution of y is a one-parameter canonical
exponential family
The Gaussian case
For Normal distribution  = 2 and
log f(y) =
1

y -
1
2
2
-
1
2
y2
-
1
2
log(2)
so  =  and () = 2
2
c 2009 of M. G. Schimek Regression and classification techniques
Generalized linear models 3
The Poisson case
For a Poisson distribution with mean  we have
log f(y) = y log  -  - log(y!)
so  = log ,  = 1 and () =  = exp{}
Each response (error) distribution allows one or more link
functions to connect the mean  to the linear predictor
Table: Canonical default links and variance functions
Family Canonical link Name Variance
Binomial log(/(1 - )) logit (1 - )
Gamma -1/ inverse 2
Gaussian  identity 1
Inverse Gaussian -2/2 1/2 3
Poisson log  log 
c 2009 of M. G. Schimek Regression and classification techniques
Generalized linear models 4
The log-likelihood of a GLM is
l(, ; Y) =
i
Ai[yii - (i)]

+ [yi,

Ai
]
The score function for  is
U() =
Ai[yi -  (i)]

From this it follows that
E(yi) = i =  (i)
and
Var(yi) =

Ai
 (i)
c 2009 of M. G. Schimek Regression and classification techniques
Generalized linear models 5
Further it follows that
E
2l(, ; y

= 0
Because of that  and , or more generally  and  are
orthogonal parameters
The function V() =  (()) is called variance function
For each response (error) distribution the link function
L = ( )-1 for which    is called the canonical link
For  = X,  known, A = diagAi, and the canonical link it can
be seen that XT Ay is a minimum sufficient statistic for 
Finally the score equations for  reduce to
XT
Ay = XT
A^
Estimation technique: Iterative weighted least squares
Hessian matrix is replaced by its expectation (Fisher scoring)
c 2009 of M. G. Schimek Regression and classification techniques
Smoothing 1
What is a 'smoother'?
A statistical tool for summarizing a response measurement Y
as a function of one or more predictor measurements X1, . . . , Xp
Produces an estimate ('average', 'trend') of Y that is less
variable (wiggly) than Y itself
Principal application:
As exploratory tool for the visualization of scatterplots
As method for the estimation of the dependence of the
mean of Y on the predictors
As a building block in various modelling approaches
What are the primary statistical applications?
1 For density estimation
2 For regression analysis
c 2009 of M. G. Schimek Regression and classification techniques
Smoothing 2
What characterizes a smoother?
For estimation at a point x weighted averaging across a
centered neighborhood of x takes place
The simplest example is k-NN (no weighting) regression
(smoothing)
For kernel-based smoothing (and related procedures) a
neighborhood (i.e. bandwidth) needs to be specified
For spline-based smoothing a choice of basis functions and
of the number and placement of knots resp. of the tuning
(smoothing) parameter is required
Both approaches are nonparametric strategies
In clear contrast to rigid parametric approaches
Here dependence of Y's on X1, . . . , Xp is highly flexible
c 2009 of M. G. Schimek Regression and classification techniques
Kernel smoothing 1
Popular example: Nadaraya-Watson kernel (Nadaraya, 1964,
Watson, 1964)
^mNW (x) =
n
i=1
YiK((x - Xi)/h)
n
i=1
K((x - Xi)/h)
,
where n is the sample size and h the bandwidth (h > 0)
The bandwidth controls the smoothness of the function
estimate
Other kernels are Gasser-M¨uller and Parzen-Rosenblatt
(Parzen, 1962, Rosenblatt, 1956)
Such kernels produce a continuous function
K is a bounded and integrable real-valued function such that
lim
x
|x|K(x) = 0
c 2009 of M. G. Schimek Regression and classification techniques
Kernel smoothing 2
Usually K is taken to be a compactly supported symmetric
probability density function
Examples:
K1(x) = I(x  [-0.5, +0.5])),
K2(x) =
3
4
(1 - x2
)I(x  [-1, +0.1])),
K3(x) =
15
32
(3 - 10x2
+ 7x4
)I(x  [-1, +0.1])),
K4(x) =
1

2
exp(-x2
/2)
Explanation: (a) Uniform kernel K1; (b) Epanechnikov kernel
K2; (c) 4th order kernel K3; (d) Normal kernel K4
c 2009 of M. G. Schimek Regression and classification techniques
Kernel smoothing 3
a
-1.0 -0.5 0.0 0.5 1.0
0.00.20.40.60.81.0
b
-1.0 -0.5 0.0 0.5 1.0
0.00.20.40.6
c
-1.0 -0.5 0.0 0.5 1.0
0.00.51.0
d
-3 -2 -1 0 1 2 3
0.00.10.20.30.4
Figure: a: Uniformer Kern K1; b: Epanechnikov Kern K2; c: Kern 4.
Ordnung K3; d: Normaler Kern K4. [S'00]
c 2009 of M. G. Schimek Regression and classification techniques
Kernel smoothing 4
a
-1.0 -0.5 0.0 0.5 1.0
-1.5-0.50.51.5
b
-1.0 -0.5 0.0 0.5 1.0
-1.5-0.50.51.5
c
-1.0 -0.5 0.0 0.5 1.0
-1.5-0.50.51.5
Figure: Curve estimates for bandwidth (a): h = 0.05; (b): h = 0.2; (c):
h = 0.5. [S'00]
c 2009 of M. G. Schimek Regression and classification techniques
Motivation for smoothing splines
Suppose responses y1, . . . , yn observed at (nonstochastic)
ordered design points t1 <    < tn
Regression model
yi = f(ti) + i, i = 1, . . . , n,
where f() is an unknown regression function and 1, . . . , n are
zero mean, uncorrelated random errors
Task: Estimation of f from the observed data
Classic approach: linear regression
^f(t) = ^a + ^bt with ^a and ^b the least squares intercept and slope
estimators obtained by minimizing the residual sum of squares
RSS(g) =
n
i=1
(yi - g(ti))2
over all functions of the form g(t) = a + bt
c 2009 of M. G. Schimek Regression and classification techniques
Smoothing splines 1
Linear parametric model only has acceptable model bias if f is
approximately linear
Ultimate slope flexibility: every two yi's are connected by
lines with their own individual slopes (for this estimator
RSS(g) = 0), resulting in a summary not being useful
Motivation behind spline smoothing:
Need to compromise between fits with constant and completely
flexible slopes
Can be achieved by penalizing functions whose slopes vary
too rapidly
Rate of change in the slope of a function g is given by g and
hence an overall measure of the change in slope of a
potential fitted function is
J(g) =
tn
t1
g (t)2
dt
c 2009 of M. G. Schimek Regression and classification techniques
Smoothing splines 2
Modified version of original estimation criterion comprises
a penalty function
RSS(g) + J(g),   0,
and can be minimized over, e.g., all functions with two
continuous derivatives
 is the smoothing (tuning) parameter and can be viewed as a
measure of the importance we place on the flexibility for the
slope of a fit
Behavior of :
 =  produces constant slope (i.e. the straight line)
 tending to zero produce completely flexible slopes (i.e.
interpolation).
The solution of the above optimization problem is a
smoothing spline function (Thompson and Tapia, 1978, were
the first to study such penalized problems)
The idea of penalization goes back to Whittaker (1923)
c 2009 of M. G. Schimek Regression and classification techniques
Local polynomial smoothing
Local polynomials have another motivation than splines
Were introduced by Stone (1977) and Cleveland (1979)
The idea is that a smooth function can be well
approximated by a low order polynomial in the
neighborhood of an arbitrary point x
Very popular is the local linear approximation
(xi)  a0 + a1(xi - x)
for x - h  xi  x + h (h is the bandwidth)
The fitting can be obtained from locally weighted least
squares
The weights are defined via kernels K (as in kernel smoothing)
The coefficients a0 and a1 are obtained by minimizing
n
i=1
K((x - xi)/h)(yi - (a0 + a1(xi - x)))2
c 2009 of M. G. Schimek Regression and classification techniques
Linear smoothers
Smoothers that can be represented as
^y = Sy
are called linear smoothers
y = (y1, y2, . . . , yn) and
^y = (^f(x1), (^f(x2), . . . , (^f(xn)), and
S an n × n smoother matrix depending on x = (x1, x2, . . . , xn)
and the smoothing (tuning) parameter , but not on y
S has nearly banded shape and is often called ^hat matrix
(compare with projection matrix in linear models)
Examples for linear smoothers: kernels, local polynomials,
smoothing splines
An example for a non-linear smoother: adaptive smoothers
c 2009 of M. G. Schimek Regression and classification techniques
Degrees of freedom and  of linear smoothers
The (equivalent) degrees of freedom (equivalent number of
parameters; effective dimension) of a linear smoother is
df(S) = trace(S)
the dimension of the space of fits
Compare with number of regressors in a linear model
If  N(0, 2I)
var(^y) = S2
IST
= 2
SST
and the diagonals give the pointwise variances
CV for  selection estimates the mean predictive error
PSE = MSE + 2, where
MSE(^f) = EX var(^f(x)) + (f(x) - EY (^f(x))2
c 2009 of M. G. Schimek Regression and classification techniques
Eigen-analysis of S
If S is symmetric we can write
Sei = iei
for i = 1, 2, . . . , n (the ei can be seen as response vectors y)
i takes values 0  i  1
Smoothers with 0 <  < 1 are called shrinking smoothers
If all i are 0 or 1, the smoother is called regression smoother
For the popular case of cubic smoothing splines the ei are
approximately orthogonal polynomials of increasing order, and
i = 1/(1 + i)
with 1  2 . . .  n, the eigenvalues of the penalty matrix
c 2009 of M. G. Schimek Regression and classification techniques