Modern regression and classification techniques in computational biology Michael G. Schimek Medical University of Graz Institute for Medical Informatics, Statistics and Documentation 8036 Graz, Austria, Europe Lectures 2009 IBA, Masaryk University c 2009 of M. G. Schimek Regression and classification techniques What are typical tasks in computational biology and the health sciences? The two most important ones we focus on in this series of lectures are: Regression of quantitative (dependent) measurements on quantitative (explanatory) measurements in a designed study (prior information) Classification of quantitative measurements with respect to a qualitative response in a designed study (prior information) Both tasks can be interpreted in the context of supervised statistical learning in the spirit of Hastie, Tibshirani, and Friedman (2009) [HTF'09] However, both tasks are associated with self-contained statistical methodologies c 2009 of M. G. Schimek Regression and classification techniques Statistical methods in computational biology and the health sciences 1 In terms of regression we have the classical linear model under Normal errors the parametric class of generalized linear models under Exponential Family errors the nonparametric class of generalized additive models under Exponential Family errors several extensions of the above such as semiparametric models, mixed models, etc. adaptations of the above methods to control violations of assumptions (e.g. overdispersion, multicollinearity, high dimensionality) c 2009 of M. G. Schimek Regression and classification techniques Statistical methods in computational biology and the health sciences 2 In terms of classification we have Fisher's linear discriminant analysis Linear and quadratic discriminant analysis Extensions of linear discriminant analysis (e.g. flexible discriminant analysis, shrinkage methods) Nearest neighbor classifiers Support vector machines Fundamental statistical learning goal: Useful information reduction with respect to responses (regression) or classifiers (classification) Advanced statistical learning goal: Response (regression) or class prediction (classification) for new measurements or objects or cases c 2009 of M. G. Schimek Regression and classification techniques Data sources and measurement characteristics Data sources Prospective studies and experiments Retrospective studies (e.g. from data bases) Ex-post-facto analysis Meta analysis (combining study outcomes) Genetic analysis Image analysis Measurement characteristics Quantitative (metric scale, integer or real valued) Qualitative (ordered or unordered units) Resolution resp. units on measurement scale Distributional features (data sparseness) Proportion and distribution of missing observations Error sources (systematic or stochastic) Signal-to-noise ratio c 2009 of M. G. Schimek Regression and classification techniques Data structures Sample data (representative for population) Reasonable sample size (required for all classical biostatistics methods) Very small sample size (unsuitable for most classical biostatistics methods) Huge sample size (unsuitable for most classical biostatistics methods) N > p, where N is sample size and p is number of parameters (required for all classical biostatistics methods) N p leading to ill-posed estimation problems (classical biostatistics methods fail) Reasonable number of dimensions (parameters; classical biostatistics methods apply) Large number of dimensions (parameters; classical biostatistics methods fail) c 2009 of M. G. Schimek Regression and classification techniques Conventions and goals We assume outcome measurements y (also called dependent variable, response, target) vector of p predictor measurements x (also called independent variables, inputs, covariates, features) in the regression problem, y is quantitative in the classification problem, y takes values in a finite, unordered set empirical (training) data (x1, y1), . . . , (xN, yN) Goals are decision support (based on inference) to understand which input effects output model fitting (identification, estimation and verification) prediction (of future cases) assessment of the quality of inference and prediction c 2009 of M. G. Schimek Regression and classification techniques Statistical decision theory basics Assume quantitative output Y Assume random input vector X Rp Assume arbitrary function f (e.g. separating two classes) Let us have loss function L(Y, f(X)) for penalizing errors in prediction Most common and convenient is squared error loss L(Y, f(X)) = (Y - f(X))2 We can choose f according to expected squared prediction error EPE(f) = E(Y - f(X))2 Thus the solution is the conditional expectation f(x) = E(Y | X = x), also denoted regression function The elementary statistical method is regression There are many ways to estimate the regression function from (training) data c 2009 of M. G. Schimek Regression and classification techniques Case of categorical response variable 1 Let us have prediction rule C(X), and Y and C(X) take values in C = 1, 2, . . . , K The loss function for penalizing prediction errors is L(k, l), i.e. the price to be paid for classifying an observation belonging to class k as l For a zero-one loss function all misclassifications are charged one unit The expected prediction error is EPE = E[L(Y, C(X))] The solution is ^C(x) = argmincC K k=1 L(k, c)P(Y = k | X = x) Assuming a zero-one loss function we simply get ^C(x) = k if P(k | X) = maxcCP(c | X), the so-called Bayes classifier (NB: we should decide for the class with maximum probability at input x) c 2009 of M. G. Schimek Regression and classification techniques Case of categorical response variable 2 Task is to construct a classifier C(X) by estimating probabilities P(c | X) from the data or by estimating class densities P(X | c) in combination with Bayes rule There are many different approaches (for quantitative as well as categorical response variables) It is important to understand the ideas behind these approaches Otherwise one cannot decide when to use them Further it is important to assess their performance In contrast to machine learning, statistical approaches (algorithms) suitable for specific responses (output) are not compared with respect to performance only c 2009 of M. G. Schimek Regression and classification techniques Linear regression as elementary method Regression function (x) = E(Y | X) where X is design or data matrix A linear form (xi) = 0 + p-1 j=1 xijj is assumed, which is an approximation to the truth In matrix notation = X, where is N-dimensional vector of predicted values, X an N × p matrix with ones in the first column, and a p-dimensional vector of parameters Fitting (estimation) by least squares ^ = argmin i(yi - 0 - p-1 j=1 xijj)2 = argmin(y - X)T (y - X) Solution is ^ = (XT X)-1XT y and ^y = X ^ Variance of estimator is var(^) = (XT X)-12 c 2009 of M. G. Schimek Regression and classification techniques Linear regression of binary response [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques Nearest-neighbor methods No linear decision boundary as in least-squares regression No stringent assumption about the data, they act data-driven (adaptive) Those observations in the training data are used that are closest in input space to x to form ^Y The k-nearest neighbor (k-NN) fit for ^Y is defined ^Y(x) = 1 k xi Nk (x) yi, where Nk (x) is the neighborhood of x defined by the closest points xi in the training sample Metric for closeness: e.g. Euclidean distance Effective number of parameters in k-NN is not k but N k (decreases with increasing k due to overlapping neighborhoods) A sum-of-squared-error criterion does not work here (would always pick k = 1!) c 2009 of M. G. Schimek Regression and classification techniques 15-NN classifier of binary response [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques 1-NN classifier of binary response [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques Simple extensions of least squares and k-NN methods Kernel methods use weights that decrease smoothly to zero with distance from the target point (in contrast to 0/1-weights in k-NN) k-NN methods are the simplest form of smoothing methods Kernel methods belong also to the class of smoothing methods (they also comprise smoothing splines yet not motivated by weighting schemes) For high-dimensional problems kernel methods can be modified: some variables obtain higher weights than others Local regression, another smoothing method, fits linear models by locally weighted least squares Linear models fit to a basis expansion of the original inputs can cope with rather complex data structures c 2009 of M. G. Schimek Regression and classification techniques Some more decision theory 1 How can we best predict Y at any point X = x? Least squares regression: Minimizing the expected prediction error EPE has the solution f(x) = E(Y | X = x), thus the best predictor is the conditional mean (in terms of average squared error) k-nearest neighbor method: Implements the above idea directly using the training data ^f = average(yi | xi Nk (x)), where Nk (x) is the neighborhood of x defined by the closest points xi in the training sample c 2009 of M. G. Schimek Regression and classification techniques Misclassification curves for simulated data: linear regression vs. k-NN [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques Some more decision theory 2 There are two approximations Expectation is approximated by averaging over sample (training) data Conditioning at a point is relaxed to conditioning on some region "close" to the target point For large N the points in the neighborhood are likely to be close to x, and as k increases the average will stabilize Under mild regularity conditions on the joint probability distribution Pr(X, Y) one can show that as N, k such that k N 0, ^f E(Y | X = x) In practice this does not help as we do not have very large N Further problem: in high dimensions the metric size of the k-nearest neighborhood gets disproportional large The convergence still holds but the rate of convergence decreases as the dimension increases (curse of dimensionality) c 2009 of M. G. Schimek Regression and classification techniques Measures for quality of an estimator 1 A valuable quality measure for ^(x) is mean squared error Let 0(x) at point x be the true value of ^(x) Then MSE(^(x)) = E(^(x) - 0(x))2 This can be written as variance plus squared bias MSE(^(x)) = var(^(x)) + (E ^(x) - 0(x))2 We have the following relationship: low bias with high variance, and vice versa Consequences Selecting an estimator involves a tradeoff between bias and variance Even for the simplest data structure resp. model this is true For ill-posed problems (e.g. data with N p), where regularization is required, the selection problem becomes even more delicate c 2009 of M. G. Schimek Regression and classification techniques Measures for quality of an estimator 2 Predictive ability of k-NN regression fit ^fk (x0) Assume data from Y = f(X) + with E( ) = 0 and var( ) = 2 Let xi being fixed (non-random) The expected prediction error at x0, also called generalization (test) error is EPEk (x0) = E[(Y - ^fk (x0))2 | X = x0] = 2 + [bias2 (^fk (x0)) + var(^fk (x0))] = 2 + [f(x0) - 1 k k l=1 f(x(l))]2 + 2 k where (l) is the sequence of NN to x0 (k is model complexity) 2, the irreducible error (variance of new test case) cannot be controlled; mean squared error of ^fk (x0) (a bias and a variance component) can be controlled c 2009 of M. G. Schimek Regression and classification techniques Bias-variance-tradeoff c 2009 of M. G. Schimek Regression and classification techniques Tradeoff between bias and variance How about a linear regression model? If it is correct for a given data structure, then the least squares predictor ^ is unbiased and has the lowest variance among all estimators that are linear functions of y There can be biased estimators with smaller MSE When an estimator is regularized (penalized, shrunken), its variance will be reduced Examples of regularization: subset selection (forward, backward, all subsets), ridge regression, the lasso A further limitation: empirical models are never correct which leads to an additional model bias between the closest member of the (e.g. linear) model class and the (unknown) truth c 2009 of M. G. Schimek Regression and classification techniques Sources of bias and estimation variance [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques Subset selection 1 Subset selection (of predictive variables or features) is a form of shrinkage Motivations are prediction accuracy: least squares estimators can have low bias but large variance (can reduce the variance) interpretation: easier with small well-predicting subset Hence only a subset of parameters (e.g. regression coefficients) is retained and the remaining ones are set to zero There are several selection strategies All subset regression: Finds for each s 0, 1, 2, . . . , p the subset of size s that gives the smallest residual sum of squares This involves the tradeoff between bias and variance (e.g. can use cross-validation) c 2009 of M. G. Schimek Regression and classification techniques Subset selection 2 A search strategy can improve the performance Forward stepwise selection: Sequentially starting with the intercept, variables are added into the model that most improve the fit; typical criterion is F1,N-s-2 = RSS(^old ) - RSS(^new ) RSS(^new )/(N - s - 2) Backward stepwise selection: A full least squares model is fitted, then variables are sequentially eliminated based on F-statistic (requires N > p) Hybrid stepwise selection: Sequentially for each (best) variable added the least important variable is deleted c 2009 of M. G. Schimek Regression and classification techniques Subset selection and shrinkage The subset selection procedures need one or more tuning parameters Subset size Position along stepwise path Shrinkage methods, related to subset selection, have a penalty or ridge parameter (also true for certain smoothers, e.g. smoothing splines) Subset selection does not automatically reduce the prediction error (the discrete selection process can retain a high variance) Shrinkage methods are a continuous alternative Ridge regression shrinks the coefficients by imposing a penalty on their size c 2009 of M. G. Schimek Regression and classification techniques Ridge regression 1 The ridge coefficients minimize a penalized residual sum of squares ^ridge = argmin N i=1 (yi - 0 - p j=1 xijj)2 + p j=1 2 j where 0 is the ridge (complexity) parameter weighting the penalty ­ the coefficients are shrunk towards 0 Equivalently we can write ^ridge = argmin N i=1 (yi - 0 - p j=1 xijj)2 subject to p j=1 2 j s, a constraint on the parameters c 2009 of M. G. Schimek Regression and classification techniques Ridge regression 2 Ridge solutions are not equivariant under scaling of inputs (input has to be standardized) Further there is no intercept 0 (X has p rather than p + 1 columns) In matrix notation we have RSS() = (y - X)T (y - X) + T The solutions are ^ridge = (XT X + I)-1 XT y, where I is p × p Because of a quadratic L2 ridge penalty T the solution is a linear function of y One might chose other penalties too c 2009 of M. G. Schimek Regression and classification techniques The lasso 1 Lasso is another shrinkage method with important differences The estimate is ^lasso = argmin N i=1 (yi - 0 - p j=1 xijj)2 subject to p j=1 | j | s The solution for 0 is y (no intercept needed) The lasso penalty is L1, leading to a non-linear estimation problem A quadratic programming algorithm is required c 2009 of M. G. Schimek Regression and classification techniques The lasso 2 The standardized tuning parameter s is s = t p j=1 | ^j | , where t is the penalty parameter Relationship between lasso and ridge coefficients: If t is chosen larger than t0 = p j=1 | ls j | then ^lasso = ^ls j For instance for t = t0/2 the least squares (ls) estimates are shrunken by around 50% For s 0 the ls estimates tend to zero Lasso translates each ls coefficient by a constant factor, truncating at zero Ridge however does a proportional shrinkage on the ls coefficients t can be chosen by cross-validation c 2009 of M. G. Schimek Regression and classification techniques Transformation of least squares coefficients in ridge regression and lasso c 2009 of M. G. Schimek Regression and classification techniques The curse of dimensionality 1 Multidimensional smoothers work well for moderate numbers of predictors, but the curse of dimensionality hinders them in higher dimensions Problems are local neighborhoods are empty nearest-neighborhoods are not local all points are close to the boundary samples size needs to grow exponentially Moreover without some structure in the models high-dimensional functions are hard to represent and interpret Illustration: Uniformly distributed data in p-dimensional unit cube Construct subcube from origin to capture fraction f of the data What distance do we have to reach out on each axis? c 2009 of M. G. Schimek Regression and classification techniques The curse of dimensionality 2 [HTF'09] c 2009 of M. G. Schimek Regression and classification techniques Generalized linear models 1 Generalized linear models (GLM) extend linear (regression) models to accommodate both non-normal response and transformations to linearity Characteristics of a GLM: Response y observed independently at fixed values of explanatory variables x1, x2, . . . , xp The explanatory variables xj may only influence the distribution of y through a single linear function called linear predictor = 0 + 1x1 + 2x2 + . . . + pxp The probability density function of y is assumed to follow the form of the exponential family distributions Mean is a smooth invertible function of the linear predictor = m(), = m-1() = G(), where the inverse function G is called the link function c 2009 of M. G. Schimek Regression and classification techniques Generalized linear models 2 Probability density function of exponential family distributions f(yi; , ) = exp Ai[yii - (i)] + (yi, /Ai) , where is a scale parameter (possibly known), Ai a known prior weight, and i depends upon the linear predictor is an invertible function of , = ( )-1() For known the distribution of y is a one-parameter canonical exponential family The Gaussian case For Normal distribution = 2 and log f(y) = 1 y - 1 2 2 - 1 2 y2 - 1 2 log(2) so = and () = 2 2 c 2009 of M. G. Schimek Regression and classification techniques Generalized linear models 3 The Poisson case For a Poisson distribution with mean we have log f(y) = y log - - log(y!) so = log , = 1 and () = = exp{} Each response (error) distribution allows one or more link functions to connect the mean to the linear predictor Table: Canonical default links and variance functions Family Canonical link Name Variance Binomial log(/(1 - )) logit (1 - ) Gamma -1/ inverse 2 Gaussian identity 1 Inverse Gaussian -2/2 1/2 3 Poisson log log c 2009 of M. G. Schimek Regression and classification techniques Generalized linear models 4 The log-likelihood of a GLM is l(, ; Y) = i Ai[yii - (i)] + [yi, Ai ] The score function for is U() = Ai[yi - (i)] From this it follows that E(yi) = i = (i) and Var(yi) = Ai (i) c 2009 of M. G. Schimek Regression and classification techniques Generalized linear models 5 Further it follows that E 2l(, ; y = 0 Because of that and , or more generally and are orthogonal parameters The function V() = (()) is called variance function For each response (error) distribution the link function L = ( )-1 for which is called the canonical link For = X, known, A = diagAi, and the canonical link it can be seen that XT Ay is a minimum sufficient statistic for Finally the score equations for reduce to XT Ay = XT A^ Estimation technique: Iterative weighted least squares Hessian matrix is replaced by its expectation (Fisher scoring) c 2009 of M. G. Schimek Regression and classification techniques Smoothing 1 What is a 'smoother'? A statistical tool for summarizing a response measurement Y as a function of one or more predictor measurements X1, . . . , Xp Produces an estimate ('average', 'trend') of Y that is less variable (wiggly) than Y itself Principal application: As exploratory tool for the visualization of scatterplots As method for the estimation of the dependence of the mean of Y on the predictors As a building block in various modelling approaches What are the primary statistical applications? 1 For density estimation 2 For regression analysis c 2009 of M. G. Schimek Regression and classification techniques Smoothing 2 What characterizes a smoother? For estimation at a point x weighted averaging across a centered neighborhood of x takes place The simplest example is k-NN (no weighting) regression (smoothing) For kernel-based smoothing (and related procedures) a neighborhood (i.e. bandwidth) needs to be specified For spline-based smoothing a choice of basis functions and of the number and placement of knots resp. of the tuning (smoothing) parameter is required Both approaches are nonparametric strategies In clear contrast to rigid parametric approaches Here dependence of Y's on X1, . . . , Xp is highly flexible c 2009 of M. G. Schimek Regression and classification techniques Kernel smoothing 1 Popular example: Nadaraya-Watson kernel (Nadaraya, 1964, Watson, 1964) ^mNW (x) = n i=1 YiK((x - Xi)/h) n i=1 K((x - Xi)/h) , where n is the sample size and h the bandwidth (h > 0) The bandwidth controls the smoothness of the function estimate Other kernels are Gasser-M¨uller and Parzen-Rosenblatt (Parzen, 1962, Rosenblatt, 1956) Such kernels produce a continuous function K is a bounded and integrable real-valued function such that lim x |x|K(x) = 0 c 2009 of M. G. Schimek Regression and classification techniques Kernel smoothing 2 Usually K is taken to be a compactly supported symmetric probability density function Examples: K1(x) = I(x [-0.5, +0.5])), K2(x) = 3 4 (1 - x2 )I(x [-1, +0.1])), K3(x) = 15 32 (3 - 10x2 + 7x4 )I(x [-1, +0.1])), K4(x) = 1 2 exp(-x2 /2) Explanation: (a) Uniform kernel K1; (b) Epanechnikov kernel K2; (c) 4th order kernel K3; (d) Normal kernel K4 c 2009 of M. G. Schimek Regression and classification techniques Kernel smoothing 3 a -1.0 -0.5 0.0 0.5 1.0 0.00.20.40.60.81.0 b -1.0 -0.5 0.0 0.5 1.0 0.00.20.40.6 c -1.0 -0.5 0.0 0.5 1.0 0.00.51.0 d -3 -2 -1 0 1 2 3 0.00.10.20.30.4 Figure: a: Uniformer Kern K1; b: Epanechnikov Kern K2; c: Kern 4. Ordnung K3; d: Normaler Kern K4. [S'00] c 2009 of M. G. Schimek Regression and classification techniques Kernel smoothing 4 a -1.0 -0.5 0.0 0.5 1.0 -1.5-0.50.51.5 b -1.0 -0.5 0.0 0.5 1.0 -1.5-0.50.51.5 c -1.0 -0.5 0.0 0.5 1.0 -1.5-0.50.51.5 Figure: Curve estimates for bandwidth (a): h = 0.05; (b): h = 0.2; (c): h = 0.5. [S'00] c 2009 of M. G. Schimek Regression and classification techniques Motivation for smoothing splines Suppose responses y1, . . . , yn observed at (nonstochastic) ordered design points t1 < < tn Regression model yi = f(ti) + i, i = 1, . . . , n, where f() is an unknown regression function and 1, . . . , n are zero mean, uncorrelated random errors Task: Estimation of f from the observed data Classic approach: linear regression ^f(t) = ^a + ^bt with ^a and ^b the least squares intercept and slope estimators obtained by minimizing the residual sum of squares RSS(g) = n i=1 (yi - g(ti))2 over all functions of the form g(t) = a + bt c 2009 of M. G. Schimek Regression and classification techniques Smoothing splines 1 Linear parametric model only has acceptable model bias if f is approximately linear Ultimate slope flexibility: every two yi's are connected by lines with their own individual slopes (for this estimator RSS(g) = 0), resulting in a summary not being useful Motivation behind spline smoothing: Need to compromise between fits with constant and completely flexible slopes Can be achieved by penalizing functions whose slopes vary too rapidly Rate of change in the slope of a function g is given by g and hence an overall measure of the change in slope of a potential fitted function is J(g) = tn t1 g (t)2 dt c 2009 of M. G. Schimek Regression and classification techniques Smoothing splines 2 Modified version of original estimation criterion comprises a penalty function RSS(g) + J(g), 0, and can be minimized over, e.g., all functions with two continuous derivatives is the smoothing (tuning) parameter and can be viewed as a measure of the importance we place on the flexibility for the slope of a fit Behavior of : = produces constant slope (i.e. the straight line) tending to zero produce completely flexible slopes (i.e. interpolation). The solution of the above optimization problem is a smoothing spline function (Thompson and Tapia, 1978, were the first to study such penalized problems) The idea of penalization goes back to Whittaker (1923) c 2009 of M. G. Schimek Regression and classification techniques Local polynomial smoothing Local polynomials have another motivation than splines Were introduced by Stone (1977) and Cleveland (1979) The idea is that a smooth function can be well approximated by a low order polynomial in the neighborhood of an arbitrary point x Very popular is the local linear approximation (xi) a0 + a1(xi - x) for x - h xi x + h (h is the bandwidth) The fitting can be obtained from locally weighted least squares The weights are defined via kernels K (as in kernel smoothing) The coefficients a0 and a1 are obtained by minimizing n i=1 K((x - xi)/h)(yi - (a0 + a1(xi - x)))2 c 2009 of M. G. Schimek Regression and classification techniques Linear smoothers Smoothers that can be represented as ^y = Sy are called linear smoothers y = (y1, y2, . . . , yn) and ^y = (^f(x1), (^f(x2), . . . , (^f(xn)), and S an n × n smoother matrix depending on x = (x1, x2, . . . , xn) and the smoothing (tuning) parameter , but not on y S has nearly banded shape and is often called ^hat matrix (compare with projection matrix in linear models) Examples for linear smoothers: kernels, local polynomials, smoothing splines An example for a non-linear smoother: adaptive smoothers c 2009 of M. G. Schimek Regression and classification techniques Degrees of freedom and of linear smoothers The (equivalent) degrees of freedom (equivalent number of parameters; effective dimension) of a linear smoother is df(S) = trace(S) the dimension of the space of fits Compare with number of regressors in a linear model If N(0, 2I) var(^y) = S2 IST = 2 SST and the diagonals give the pointwise variances CV for selection estimates the mean predictive error PSE = MSE + 2, where MSE(^f) = EX var(^f(x)) + (f(x) - EY (^f(x))2 c 2009 of M. G. Schimek Regression and classification techniques Eigen-analysis of S If S is symmetric we can write Sei = iei for i = 1, 2, . . . , n (the ei can be seen as response vectors y) i takes values 0 i 1 Smoothers with 0 < < 1 are called shrinking smoothers If all i are 0 or 1, the smoother is called regression smoother For the popular case of cubic smoothing splines the ei are approximately orthogonal polynomials of increasing order, and i = 1/(1 + i) with 1 2 . . . n, the eigenvalues of the penalty matrix c 2009 of M. G. Schimek Regression and classification techniques