Topics in Functional Data Analysis Habilitation Thesis David Kraus August 2021 Masaryk University Faculty of Science Department of Mathematics and Statistics Contents 1. Introduction and summary 2 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Summary of Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3. Summary of Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4. Summary of Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5. Summary of Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6. Summary of Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 References 17 A. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles 19 B. Dispersion operators and resistant second-order functional data analysis 52 C. Components and completion of partially observed functional data 81 D. Classification of functional fragments by regularized linear classifiers with domain selection 112 E. Inferential procedures for partially observed functional data 139 1. Introduction and summary 1.1. Introduction Functional data analysis is an active area of statistics that deals with data that can be seen as mathematical functions. These could be curves, surfaces, images etc. Due to the development of modern technology, contemporary data sets indeed often consist of data units that are complex object. A functional data set is a collection of observations of such functions (mathematically regarded as realizations of random processes, i.e., random variables in a function space), whereas more traditional data sets consist of observations of numbers or vectors. For a general background, see, e.g., Bosq (2000), Ramsay and Silverman (2005), Ferraty and Vieu (2006), Ferraty and Romain (2011), Horv´ath and Kokoszka (2012), Hsing and Eubank (2015) or Kokoszka and Reimherr (2017). My research concentrates on the development of statistical methodology driven by applications. This text comprises five research articles containing my and my co-authors’ contributions to the field of functional data analysis accompanied by this introductory section, which summarizes the contents of the papers. The presentation is simplified to provide only the basic ideas and results of each paper. Thus, for example, references to preceding and subsequent relevant publications are not included and results are described in a stylized way rather than as rigorous formal statements. 2 The papers included in the appendix are: (A) Panaretos, V. M., Kraus, D., and Maddocks, J. H. (2010). Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. Journal of the American Statistical Association, 105(490):670–682. (B) Kraus, D. and Panaretos, V. M. (2012). Dispersion operators and resistant secondorder functional data analysis. Biometrika, 99(4):813–832. (C) Kraus, D. (2015). Components and completion of partially observed functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):777–801. (D) Kraus, D. and Stefanucci, M. (2019). Classification of functional fragments by regularized linear classifiers with domain selection. Biometrika, 106(1):161–180. (E) Kraus, D. (2019). Inferential procedures for partially observed functional data. Journal of Multivariate Analysis, 173:583–603 Four papers (A, B, C, D) have been published in the Journal of the American Statistical Association, Biometrika and Journal of the Royal Statistical Society: Series B (Statistical Methodology), which are regarded by the scientific community among the leading 5–7 journals in the field of methodological statistics. Paper E has been published in the Journal of Multivariate Analysis, which is a standard, respected journal in the field. Two papers (C, E) are single-authored, the other three are collaborative with equal contribution of each co-author. The papers have been published with peerreviewed supplements, which are included as well. 1.2. Summary of Paper A Paper A (Panaretos et al., 2010) studies methods of statistical inference on the covariance structure of random functions. Although its main focus is the development of statistical methodology and related theory, the motivation for this work comes from another field, namely molecular biology. The understanding of the mechanical properties of the DNA molecule constitutes a fundamental biophysical task, as important biological processes can be affected by properties such as stiffness and shape. In addition to holding the genetic code, the DNA base-pair sequence may influence the geometric properties of the molecule. However, empirical detection of this effect on stereological data acquired through the electron microscope had previously been elusive. The data set of interest consists of closed curves (DNA minicircles obtained from short strands of DNA) in R3 of two types: both types have identical base pair sequences, except for a short base-pair window, where two different sequences are present (one of them, a TATA box, is of special interest). Biophysical considerations suggest this will have a significant effect on the geometry of the minicircle, and the goal is to compare these two groups to probe for such an effect. 3 Motivated by the need of two-sample comparison of loops, as exemplified in DNA minicircle experiments, this article considers the problem of second-order comparison of two samples of random functions, within a functional data analysis framework. In particular, given realisations of n1 and n2 independent copies of two continuous zero mean Gaussian processes X and Y on a compact set, we consider the problem of testing the hypothesis that their covariance operators RX, RY are equal against the alternative that they are different. Although this problem is now well studied, by the time of writing of this paper it had received relatively little attention. Our paper proposes a test based on the approximation of the Hilbert–Schmidt distance of the empirical covariance operators of the two samples of functions based on the Karhunen–Lo`eve expansion. The asymptotic distribution of the test statistic is determined and its performance is investigated computationally. The application of our methodology to the data set of two groups of minicirles characterized by the presence or absence of a TATA box suggests the potential existence of significant differences in the two groups, which eluded previous analyses as these focused on the mean (the shape of the minicircle), whereas we detect the differences in the covariance structure (the flexibility/stiffness). Let us give a more detailed description of the contents of Paper A. Since this work is data-driven, the paper first explains the scientific background and questions in molecular biology and the source, properties and pre-processing of available data. To perform a functional data analysis of the minicircles it is required to register the data. Each curve has thus been centered and scaled, so that the center of mass is at zero and the length of the curve is one. Since the data were obtained by electron microscopy of the minicircles imbedded in a liquid, the reconstructed curves are not aligned (they are subject to a random unobservable orthogonal transformation). We describe a procedure that rigidly aligns curves by their intrinsic characteristics: each curve was individually aligned using the coordinate system induced by its moments of inertia tensor. We thus arrive at a functional data set consisting of smooth curves indexed by the arc length taking values in R3 (corresponding to the coordinates on the three principal axes of inertia). We assume that we have two independent collections X1, . . . , Xn1 and Y1, . . . , Yn2 of iid Gaussian processes on [0, 1], considered as random elements of the Hilbert space L2[0, 1] of coordinate-wise square-integrable R3-valued functions with the inner product f, g = 1 0 f(t)Tg(t)dt (but everything readily extends to more general cases). Assuming, without loss of generality, that the mean functions are zero, the processes are characterized by their respective covariance kernels RX(s, t) = cov{Xi(s), Xi(t)}) = E{Xi(s)Xi(t)T}, and RY (s, t), respectively. Associated with the covariance kernel is the covariance operator RX: L2[0, 1] → L2[0, 1] defined as RX(f)(t) = cov{ Xi, f , Xi(t)} = 1 0 RX(t, s)f(s)ds. The Karhunen–Lo`eve theorem allows for a representation of the process by a stochastic Fourier series with respect to the orthonormal eigenfunctions {ϕ (j) X }∞ j=1 of the operator RX, Xi(t) = ∞ j=1 λ (j) X ξijϕ (j) X (t), 4 where {λ (j) X }∞ j=1 is the nonincreasing sequence of corresponding eigenvalues and {ξij} is an iid array of standard Gaussian random variables. The empirical covariance kernel may be used to “optimally” reduce infinite-dimensional inferential problems to multivariate ones. Letting ˆRX stand for the empirical covariance kernel, we denote its eigenvalues by ˆλk,n1 X and its eigenfunctions by ˆϕk,n1 X . The finite-dimensional reduction is then achieved by retaining a finite number of principal components Xi − ¯X, ˆϕk,n1 X , k = 1, . . . , K in lieu of each Xi, and similarly for the second sample. The dimension reduction afforded by the Karhunen–Lo`eve expansion is the tool we employ to construct our test. We wish to test the null hypothesis RX = RY against the alternative RX = RY . We propose the use of a test statistic based on the norm of the difference of the two empirical covariance operators. The Hilbert–Schmidt norm of a trace-class operator R is defined as R HS = 1 0 1 0 trace{R(s, t)TR(s, t)}dsdt. A test may be based on the squared Hilbert–Schmidt distance ˆRX − ˆRY 2 HS. The sampling distribution of this quantity will depend on the unknown covariance operators even asymptotically. To be able to “normalize” the test statistic, we employ the property that for any orthonormal system {ei} of L2[0, 1], we have R 2 HS = ∞ i=1 Rei 2 = ∞ i=1 ∞ j=1 Rei, ej 2 ˆRX − ˆRY 2 HS. In practice, we need to truncate the series to obtain a finite-dimensional reduction and choose the contrasts {ei} so that the truncation retains the bulk of the norm. For each of the two empirical operators, the optimal contrasts will coincide with their eigenfunctions, but we need to use a common basis. We thus choose the eigenfunctions ˆϕk,N XY of the empirical covariance operator of the pooled sample as a compromise for the common coordinate system. Our proposed test statistic is a linear combination of the terms ( ˆRX − ˆRY ) ˆϕi,N XY , ˆϕj,N XY 2, i, j = 1, . . . , K with weights corresponding to their asymptotic covariance structure. Theorem 1 in the paper shows that under the null hypothesis and certain assumptions, this test statistic is asymptotically chi-squared distributed with K(K + 1)/2 degrees of freedom, which is the basis of a hypothesis test. The paper then introduces a modified test statistic that can be useful when one a priori knows that the eigenfunctions of both covariance operators are are equal. Then one can focus only on the diagonal terms (those with i = j), which leads to a test statistic with asymptotic chi-squared distribution with K degrees of freedom. Furthermore, we consider variance-stabilized variants of these statistics, where we apply a log transformation to the diagonal terms and Fisher’s z-transformation to the off-diagonal terms. We then discuss methods to choose the truncation level. To assess the behaviour of the proposed tests under the null hypothesis and under various alternatives we carry out a number of simulations. We consider one situation with equal covariance functions and several alternative configurations. The general and 5 diagonal test statistics are considered under various fixed choices of K and with automatically selected K. The study provides a useful insight into the performance and capabilities of the tests depending on the type of deviation from the null hypothesis. Next, we present an analysis of the data set of DNA minicircles. First, we show both graphically and numerically that there is no important difference between the means of the two types of curves. Then we focus on the comparison of their second order properties. The analysis shows a significant difference on the third (most important) principal axis of inertia and also jointly in the plane given by the third and second axis. A proof of Theorem 1 is provided in the Appendix. Additional plots and tables are available in a supplementary file. In addition, the supplementary file contains a more detailed study of the problem of comparing the complete spectrum. 1.3. Summary of Paper B Paper B (Kraus and Panaretos, 2012) focuses on the second-order structure of a random function, which is key to understanding the nature of the functional observations that it induces, as it is inextricably linked with the smoothness properties of the stochastic fluctuations of the function. These second-order properties are encapsulated in the covariance operator. The link with the smoothness properties of the random function is then given by the Karhunen–Lo`eve expansion, which provides an optimal Fourier representation of the random function, using a basis comprised by the eigenfunctions of this operator. A natural inference problem is that of comparing the covariance structures of two samples of functional data, in order to decide whether they share the same fluctuation properties. We focus on situations where the data are not Gaussian, and indeed may be characterized by the presence of influential observations. The infinite-dimensional nature of the data means that an observation can be atypical in many ways, the deviation from the mean being only one; observations close to the mean may contain unusual frequency components. Detection of such observations via exploratory techniques may be non-trivial. Such influential observations might significantly influence the estimation of the covariance, and, even more profoundly, the quality of the estimators of its spectrum. The sensitivity of the empirical covariance operator and its spectrum to the presence of influential observations can have an impact on testing procedures for the covariance operator. To cope with these issues, this paper introduces a class of operators that we term dispersion operators that are implicitly defined through a variational problem, motivated by M-estimators of location for the tensor product of the centred functional observations. It is then proposed that these operators be used as proxies for the covariance operator, when inferences on the second-order structure are to be drawn for non-Gaussian and potentially contaminated functional samples. The implicit definition of a dispersion operator gives rise to a score equation, as the dispersion operator is a zero of the Fr´echet derivative of the variational problem with respect to the operator argument. This functional score equation is then used as a basis to construct a test for the second-order comparison of two functional samples. The test is based on the distance of the functional score equation under the null hypothesis from zero, measured by an appropriately 6 renormalized Hilbert–Schmidt distance. This work is motivated by and illustrated on a data set of DNA strands, which indeed is contaminated by atypical curves. We now recapitulate the contributions of Paper B in more detail. First, the paper introduces the notion of a dispersion operator as a substitute for the usual covariance operator that is more suitable for contaminated data while still characterizing the second-order structure of the random function. To describe the second-order properties of a random element X in a separable Hilbert space H (without loss of generality L2[0, 1]), one typically considers the covariance operator C = E{(X − µ) ⊗ (X − µ)}, where ⊗ stands for the tensor product and µ = E(X) is the mean. The covariance operator can be seen as the Hilbert–Schmidt operator that solves the variational problem min R∈HS(H,H) E{ (X − µ) ⊗ (X − µ) − R 2 } (HS(H, H) are Hilbert–Schmidt operators from H to H). The empirical covariance operator can be represented as the solution to the above optimization problem with expectation computed with respect to the empirical distribution of the data. This being essentially a least squares problem, both the empirical covariance operator and methods based on it will be sensitive to the presence of atypical observations in the dataset. We obtain procedures pertaining to the second-order structure of X that are more resistant to departures from normality and to the presence of influential observations by replacing the squared norm in the variational problem defining the covariance by a less sensitive loss function. This gives rise to a new class of second-order characteristics, which we call dispersion operators. Within this class, the most useful new choice of the loss function leads to what we call the spatial dispersion operator. It is defined via M-estimation of the location of (X − µ) ⊗ (X − µ) as arg min R∈HS(H,H) E{ (X − µ) ⊗ (X − µ) − R − (X − µ) ⊗ (X − µ) }, where µ is a suitable element of H with the interpretation of a location parameter (the spatial median is a natural choice). The empirical spatial dispersion operator minimizes the sample version of the objective. By taking Fr´echet derivative we arrive at an equivalent definition of the dispersion operator as a Z-estimator solving a score equation. Proposition 1 in the paper establishes the existence and uniqueness of the (population) dispersion operator under non-restrictive assumptions on the data-generating distribution. In Corollary 1 we show that the sample dispersion operator exists and is unique under weak assumptions on the observed data, and that it is consistent for the true dispersion operator. We continue our theoretical analysis by showing an interesting link between the spectra of the dispersion and covariance operator. Although the operators are in general different, they both carry useful information on second-order properties. Proposition 2 shows that the dispersion operator has the same set of eigenfunctions as the covariance operator. 7 Having defined the notion of a dispersion operator, we then construct a two-sample second-order test based upon it. Let there be two independent random samples of functions, whose location parameters are µ1, µ2 and dispersion operators are R1, R2. The goal is to test the null hypothesis H0: R1 = R2 against the general alternative H1: R1 = R2. We propose to employ the general idea of score tests, that is, to base the test on the estimating score for the general model, without assuming H0, evaluated at the null estimate of the parameter. As the centres µ1, µ2 are not restricted under the null hypothesis, they can be estimated separately. On the other hand, the common null estimator of the dispersion is estimated by ˆR, which minimizes a combination of objectives for each sample under the restriction induced by the null. Equivalently, ˆR solves a score equation under H0. After a reparametrization, we arrive at a score operator whose component corresponding to the difference between the two dispersion operators reflects the validity of H0. When the null hypothesis holds, the score operator is expected to be close to the zero operator, otherwise it should be far from the zero operator. To perform the test, we need to measure its distance from the zero operator and assess the significance of the resulting test statistic. We especially develop one way of doing it. It is based on spectral truncation of the score operator, which is an infinite dimensional object (a Hilbert–Schmidt operator on H). We use a projection of this operator on a finite dimensional subspace, in particular the one defined by the tensor products of the eigenfunctions of the dispersion operator. The test statistic is then obtained by combining the projection coefficients in a quadratic form. Theorem 1 establishes the weak convergence of the score operator to a mean zero Gaussian random operator under the null hypothesis and provides a consistent estimator of its covariance operator (which is an operator on operators). Then it provides the asymptotic null distribution of the score test statistic. Next, the paper presents empirical results. In a simulation study, we investigate the behaviour of the test based on the spatial dispersion and the non-resistant L2 test under the null hypothesis without and with contamination and the impact of contamination on the power of these tests under various alternative and contamination scenarios. We also apply the proposed methodology to the data set of DNA minicircles studied in Paper A. The supplementary document contains proofs of theoretical results. 1.4. Summary of Paper C It is standard in the field of functional data analysis to assume that all functions are observed on the same domain. In Paper C (Kraus, 2015), we develop methods of analysis for functional data that are observed incompletely in the sense that each function might be observed only on a subset of the domain, whereas no information about the curve is available on the complement of this subset. Our work is motivated by an ambulatory blood pressure monitoring data set that is part of the “Swiss Kidney Project on Genes in Hypertension.” The data set consists of automatically recorded temporal heart rate profiles of several hundred participants. Due to either the failure of the recording device or participant’s discomfort some values have not been measured and the time points corresponding to unobserved values form 8 series (intervals) of non-negligible length. The resulting data set thus consists of partially observed curves (functional fragments). Since there is only a relatively small fraction of complete curves, removing incomplete curves would considerably reduce (and possibly destroy) the accuracy of the statistical analysis. Therefore, this type of functional data necessitates the development of special methodological approaches, which is the subject of this paper. Before the appearance of this paper, relatively little work had been published on missing data in the functional context. In this paper we introduce a formal framework for analysing incompletely observed functional data and develop basic nonparametric, fully functional (infinite-dimensional) inferential procedures. We first focus on the main building blocks of the analysis of the second-order properties: estimation of the covariance operator and principal component analysis. We propose an estimator of the covariance operator and its eigenvalues and eigenfunctions for partially observed functions and derive their properties. We deal with the estimation of projections (principal scores) of individual incomplete functions which is especially challenging. We develop a procedure that enables to predict the value of a principal score of a function when only a fragment of the function is available and direct computation is thus impossible. Next, we propose a method that can recover the unobserved part of the function from the observed part, using the information about the distribution of the data that it learns from the sample. We develop automatic procedures for the selection of the tuning parameter of the method that is based on generalised cross-validation for incompletely observed functions. We quantify the uncertainty of the predictions of unobserved quantities and provide approximate prediction regions (intervals and bands) covering the unobserved random quantity with high probability. Simulations confirm the usefulness and good performance of the proposed methodology. We now describe the main methodological, theoretical and numerical contributions of Paper C. First, the paper formalizes the framework of partially observed functional data. Functional data X1, . . . , Xn are seen as independent identically distributed random variables in the separable Hilbert space of square integrable functions on a bounded domain. Without loss of generality, we consider the space L2([0, 1]) with inner product f, g = 1 0 f(t)g(t)dt, f, g ∈ L2([0, 1]) and norm f = f, f 1/2. In traditional functional data analysis, it is assumed that the functions X1, . . . , Xn are observed on the whole interval [0, 1]. We consider situations where each curve Xi is observed only on a subset of [0, 1]. Specifically, let the observation periods be Oi ⊂ [0, 1], i = 1, . . . , n. Then the observed data for the ith curve are Xi(t), t ∈ Oi. We collectively denote the observed part of the curve as XiOi , which can be seen as a random element of the space L2(Oi). The values of Xi on the complement of Oi, Mi = [0, 1] \ Oi, are not observed; the missing part of the trajectory is denoted as XiMi . The observation periods Oi, i = 1, . . . , n are modelled as random subsets of [0, 1]. We assume that the observation periods are independent of the functions X1, . . . , Xn, that is, the data are missing completely at random. Next, the paper focuses on the estimation of the main characteristics of the distribution that generates the data, that is, the mean function and the covariance operator. Let the 9 mean function be µ = E X1. The covariance operator R : L2([0, 1]) → L2([0, 1]) is defined as Rf = E{ f, X1 − µ (X1 − µ)} = 1 0 ρ(·, t)f(t)dt, where ρ(s, t) = cov{X1(s), X1(t)} is the covariance kernel of the stochastic process X1. Like in the multivariate case, the mean function µ at point t ∈ [0, 1] can be estimated by the sample mean of observed values at this point. The estimator ˆR of the covariance operator R is defined through an estimator of its covariance kernel ρ. We estimate ρ(s, t) by the sample covariance computed from all complete pairs of functional values at s and t. It is seen that ˆµ(t) is an unbiased estimator of µ(t). Similarly, if we subtract 1 in the denominator of ˆρ(s, t), the estimator becomes unbiased for ρ(s, t). For the estimators ˆµ and ˆR to be consistent, we need to assume that the observation pattern asymptotically provides enough information. The exact formulation of such assumptions is provided in equations (2) and (3) in the paper. Under these weak assumptions, we obtain a consistency result in Proposition 1 of the paper. In particular, we show that the L2 distance between the ˆµ and µ and the Hilbert–Schmidt distance between ˆR and ˆR converge to zero in quadratic mean (and hence in probability). Interestingly, the properties of the estimators are unaffected by the fact that the functions are observed only partially. The full (dense) observation regime, albeit only on subsets of the domain, preserves the convergence rates known for complete functional data. The paper then focuses on principal component analysis, which is probably the most fundamental method for functional data since it provides insight into the complex covariance structure of functional data, can be used to identify main sources of variability and quantify their importance and to reduce the dimension of the data. The theoretical foundation of functional principal component analysis is the Karhunen–Lo`eve theorem stating that there exist random variables βij and nonrandom functions ϕj such that the stochastic process Xi admits the decomposition Xi(t) = µ(t) + ∞ j=1 βijϕj(t), t ∈ [0, 1], where the series converges in mean square, uniformly in t. Here ϕj, j = 1, 2, . . . are the orthonormal eigenfunctions of the operator R and βij, j = 1, 2, . . . are uncorrelated mean zero variables with variances λj, where λ1 ≥ λ2 ≥ · · · > 0 are the eigenvalues of R. Functional principal component analysis is the empirical version of the Karhunen–Lo`eve expansion that aims to estimate the elements involved in the expansion. In the case of completely observed functional data, to estimate the eigenvalues λj and eigenfunctions ϕj, one performs eigen-decomposition of the usual sample covariance operator. When the functions are observed partially, one can proceed similarly and define the estimators ˆλj and ˆϕj as the eigenvalues and eigenfunctions of the operator ˆR given by the kernel ˆρ. The paper shows that the asymptotic properties of the empirical eigenvalues and eigenfunctions remain unchanged by the incompleteness of the observed functions. Proposition 2 in the paper establishes that first, the empirical eigenvalues are consistent estimators of the true eigenvalues and this consistency is uniform over all indices, and second, the empirical eigenfunctions are consistent estimators of the true eigenfunctions, up to the usual sign ambiguity. The rates of convergence are parametric due to the full 10 observation regime on subsets. The paper then moves to the most challenging contributions which are methods of inference for individual curves based on their incomplete observation. These are prediction rather than estimation problems since they aim to provide information on random targets: the principal scores βij and the missing part of the curve XiMi . In the standard situation of complete functional data, the scores are easily estimated by ˆβij = Xi − ˆµ, ˆϕj . When the functional observations are incomplete, the direct computation of Xi − ˆµ, ˆϕj is impossible because the last term in the expression Xi − ˆµ, ˆϕj = XiOi − ˆµOi , ˆϕjOi + XiMi − ˆµMi , ˆϕjMi is not available. In Section 3.2 of the paper we develop a procedure to estimate (or rather predict) the missing quantity XiMi − ˆµMi , ˆϕjMi from the observed data and establish its theoretical properties (Theorem 1, Proposition 3). We skip the description of this part in this summary and instead describe the results on prediction of the missing part of an incomplete curve. This task of function reconstruction (completion) is studied in Section 4 of the paper. In the population version of the problem, the best prediction of XM by a function of XO in the sense of the mean integrated prediction squared error is the conditional expectation E(XM |XO). It is in general a nonlinear operator from L2(O) to L2(M) and similarly to the case of principal scores, we consider its best continuous linear approximation. Assuming for simplicity that the functional variable has mean zero, the minimisation problem to be solved is min A : A ∞<∞ E XM − A XO 2 , where the solution is looked for in the class of continuous (bounded) linear operators from L2(O) to L2(M) (by · ∞ we denote the operator norm). We see (by Fr´echet differentiation or direct computation) that solving this minimisation is equivalent to solving the (normal) equation A ROO = RMO. This suggests the solution ˜A = RMOR−1 OO and the best linear prediction of XM in the form ˜XM = ˜A XO. From now on, we assume the existence of a bounded solution, that is, we assume that RMOR−1 OO ∞ < ∞. Similarly to the case of principal scores, the inverse problem A ROO = RMO to be solved is ill-posed, that is, small perturbations of the right-hand side RMO can lead to large perturbations of the solution (recall that ROO is compact, hence its inverse is unbounded); perturbations of the right-hand side indeed need to be considered since RMO will be only estimated from the data in the sample version of the problem. Regularization (i.e., modification of an ill-posed inverse problem into a well-posed inverse problem) is necessary for a stable solution. Using ridge regularisation we obtain the solution ˜A (α) = RMO(ROO +αIO)−1 (α > 0 is a regularization parameter, IO is the identity operator of L2(O)). The regularised best linear prediction equals ˜X (α) M = ˜A (α)XO. Practically, when the sample X1O1 , . . . , XnOn is observed on the subsets O1, . . . , On, we replace the covariance operator by its estimate and set ˆA (α) i = ˆRMiOi ˆR (α)−1 OiOi . The mean function needs to be estimated as well. For the ith curve, the best linear prediction of XiMi is estimated by ˆX (α) iMi = ˆµMi + ˆA (α) i (XiOi − ˆµOi ). 11 Under the assumption that the optimal reconstruction operator ˜A (α) is Hilbert–Schmidt, Theorem 1 of the paper proves the consistency of the estimated best linear reconstruction. That is, we show that, as the size of the training sample increases and the amount of regularization decreases, the L2-distance between the theoretical best reconstruction and its regularized estimate converges to zero in quadratic mean and provide the rate of this convergence. It was later pointed out in the literature that our results are obtained under unnecessarily strong assumptions. Therefore, in a follow-up paper (Kraus and Stefanucci, 2020, not included here), we generalize the consistency result by relaxing the assumption that the true optimal linear reconstruction operator is Hilbert–Schmidt. It turns out that it is not even necessary to assume that the optimal reconstruction operator is bounded, and the ridge regularization method (which is Hilbert–Schmidt) still performs optimally in the limit. The follow-up paper explains this in the context of the Reproducing Kernel Hilbert Space theory. The paper provides an estimator of the asymptotic covariance operator of the predictive distribution (error between the prediction and the target random process) and proves its consistency (Proposition 5). This enables the construction of prediction intervals. To address the problem of selection of the regularization parameter α, the paper develops a generalized cross-validation procedure for partially observed data. A simulation study is carried out to address the following goals: to investigate the performance of generalized cross-validation as a selector of the regularization parameter, to verify the validity and accuracy of the prediction intervals and bands and to explore the effect of the observation pattern. Finally, the performance of the proposed methodology is illustrated on the motivating data set of incomplete heart rate temporal profiles. Proofs of all formal statements are provided in the appendix and in a supplement. 1.5. Summary of Paper D In Paper D (Kraus and Stefanucci, 2019), we consider classification of a functional observation into one of two groups. We formulate the theoretical (population) problem of determining the best classifier as a quadratic optimization problem on a function space, or, equivalently, as a linear inverse problem. These problems are ill-posed but, unlike in most inverse problems, this is not a complication but rather an advantage in the sense that the more ill-posed the problem is, the better optimal misclassification probability. We use regularization techniques, such as the method of conjugate gradients with early stopping and ridge regularization, to solve the optimization problem, yielding a class of regularized linear classifiers. The optimal misclassification rate is the limit along the regularization path of solutions which themselves may not converge. We study the empirical (sample) version of the problem, where the objective function in the constrained minimization must be estimated from finite training data. We show that it is possible to construct an empirical regularization path towards the possibly non-existent unconstrained solution so that the classification error converges to its best value, possibly zero. We do this for conjugate gradient, principal component and ridge classification, in a truly infinite-dimensional manner, in the sense that the convergence takes place along a path with decreasing regularization and holds without restrictions 12 on the mean difference between classes. All our methodology and theory is developed in the setting of partially observed functional data, where trajectories are observed only on subsets of the domain. The principal difficulty for inference with fragments is that temporal averaging is precluded by the incompleteness of the observed functions. Our formulation as an optimization problem enables us to overcome this issue under certain assumptions because only averaging across individuals in the training data is needed, and not individual curves. We propose a domain selection strategy that looks for the best classifier with domain ranging from a minimum common domain of the training sample to the entire domain of the function to be classified. Our simulation study confirms that domain selection can considerably reduce the misclassification rate. Further simulations compare the performance of the three types of regularization. Among other findings, this study shows that the principal component and conjugate gradient classifiers often achieve comparable error rates but the latter usually needs a lower dimension of the regularization subspace, in agreement with a theoretical result we provide. Application to a data set on the geometric features of the internal carotid artery in patients with and without aneurysm demonstrates the utility of the proposed methodology. A more detailed overview of the results of Paper D follows. We consider classification of a Gaussian random function, X, into one of two groups of Gaussian random functions. Group 0 has mean µ0, group 1 has mean µ1. Both groups have covariance operator R. We first assume that µ0, µ1 and R are known, which corresponds to the asymptotic situation with an infinite training sample. We consider the class of centroid classifiers that are based on one-dimensional projections of the form X, ψ , where ψ is a function in L2(I). Given ψ, the optimal classifier based on X, ψ assigns X to the class Cψ(X) given by Cψ(X) = 1{Tψ(X)>0}, where Tψ(X) = X − ¯µ, ψ µ, ψ with ¯µ = (µ0 + µ1)/2 and µ = µ1 − µ0. The misclassification probability of this classifier is 1 − Φ | µ, ψ | 2 ψ, Rψ 1/2 . The task to find the best function ψ ∈ L2(I) leads to the maximization of the argument in Φ above. We discuss when this problem can be solved within L2 (i.e., there is an L2function ψ that achieves the best error rate), when it cannot be solved within L2 (i.e., the best error rate is achieved by a linear functional but it is unbounded, not of the form X, ψ ) and what value the optimal error rate can take (remarkably, it may be zero, corresponding to perfect classification). This discussion connects the H´ajek–Feldman dichotomy between Gaussian measures, the theory of reproducing kernel Hilbert spaces and constrained convex optimization. The optimization to be solved corresponds to the task to maximize µ, ψ subject to ψ, Rψ = 1, which translates into the unconstrained quadratic optimization problem to minimize ψ, Rψ /2− µ, ψ , i.e., to the linear inverse problem Rψ = µ. 13 This formulation is the starting point for the definition of regularized classifiers. Regardless of whether there is a solution (i.e., whether ψ = R−1µ exists in L2(I)), one can consider an approximating, regularized problem that can be solved. Regularization is typically used to solve ill-posed inverse problems, whose solution exists, in a stable way. There, the path of regularized solutions converges to the solution to the problem of interest. Here no solution may exist, but paths of regularized solutions towards the possibly non-existent solution still turn out to be useful, since the misclassification probability converges to the optimal value along these paths. We consider three regularization methods: the principal component method (which solves the optimization in a subspace spanned by leading principal components), the conjugate gradient method (which uses the numerical method of conjugate gradients with early stopping) and the ridge method (which solves the optimization in a ball). In Propositions 1 and 3 in the paper we provide an asymptotic analysis of these methods which shows that as the amount of regularization decreases, the misclassification rate along the regularization path converges to the optimal value. This is true even when there is no bounded solution to the problem (i.e., R−1µ ∈ L2(I)) and also in the “even more ill-posed” case of perfect classification (i.e., R−1/2µ ∈ L2(I)). Proposition 2 compares the two methods that use a subspace for regularization, i.e., principal components and conjugate gradients, and shows that the error rate of the former is always higher than or equal to that of the latter when the same dimension is used. We then present the empirical version with a finite training data set. Motivated by a medical dataset, we do it in the case of incomplete curves. Incompleteness can occur in the training data, with each curve possibly observed on a different domain, and in the new curve we wish to classify. A simple approach would be to consider all curves on the intersection of their observation domains, if it is non-empty, or to discard incomplete curves. However, such restrictions may be too severe and can be avoided. For group j let there be a training sample consisting of nj independent curves Xj1, . . . , Xjnj that may be observed incompletely with values known only on a subset Oji of the domain. Then, similarly to Paper C, the mean µj of group j can be estimated by the crosssectional average and the covariance kernel ρ(s, t) can be estimated by the empirical covariance using pairwise complete observations of groupwise centred curves. Let the new, independent curve to be classified, Xnew, be observed on the domain Onew. The empirical classifier ˆC ˆψ trained on partially observed curves is defined like the theoretical one but with unknown quantities replaced by their estimators. The projection direction ˆψ is constructed by conjugate gradient, principal component or ridge regularization applied to estimates ˆµ and ˆR (defined through the estimated kernel ˆρ(s, t)), restricted to the domain of the new curve to be classified (or, possibly, a subset of that domain). In the theoretical analysis, we study the behaviour of classifiers for incomplete training samples of increasing size with decreasing amount of regularization. We study the conjugate gradient method with increasing number of steps, principal component method with increasing number of eigenfunctions and ridge method with decreasing ridge parameter in Theorems 1, 2 and 3, respectively. The theorems show that under specific regularity conditions they all asymptotically achieve the optimal (Bayes) misclassifica- 14 tion probability along the empirical regularization path as if there were infinite training data. This holds regardless of whether the theoretical best projection classifier exists as a bounded linear functional and whether the best error rate is positive or zero. Similarly to the problem of function reconstruction in Paper C and the follow-up paper Kraus and Stefanucci (2020), classification is also a prediction rather than an estimation task and we observe a similar interesting phenomenon that involves a possibly non-convergent regularization path along which the predictive performance converges to its optimum. Further, we propose a domain selection procedure that aims to find the best domain on which the classification is performed. The method searches for the best domain between two extremes, the common domain of all training curves and the domain of the curve to be classified, to capture the location in the domain, where maximum discrimination between the two classes is. The numerical part of the paper presents a simulation study, which compares the behaviour of the different regularization methods, investigates the performance of crossvalidation for the selection of the regularization parameters, studies the impacts of partial observation and demonstrates the usefulness of the domain selection procedure. In a data example, we analyze a set of curves describing the blood vessel morphology in persons with and without aneurysm. The analysis shows an improvement of classification accuracy in comparison with existing methods due to the use of incomplete data and domain selection. Further generalizations and numerical results are contained in the supplementary document. 1.6. Summary of Paper E Inspired by the data set of heart rate profiles, Paper E (Kraus, 2019) deals with another aspect of partially observed functional data. Although some advanced procedures, such as goodness-of-fit tests, regression, classification and reconstruction methods, have been developed for functional fragments, basic methods of inference about the fundamental characteristics of functional variables were still missing at the time of writing. In particular, the asymptotic distribution of estimators of the mean function and covariance operator, K-sample tests of equal means or covariances, and confidence intervals for eigenvalues and eigenfunctions had not been studied yet in the setting of incomplete functions. Users who wish to perform these basic tasks had the only option: to omit the partially observed functions and apply existing procedures to the complete data only. This approach is not only clearly sub-optimal due to a possibly large loss of information and resulting decay of power and accuracy, but also hardly or totally inapplicable in situations where the data contain few or no complete curves. In this paper, we address this deficiency of existing methodology and develop essential methods of inference about the mean and covariance structure of incomplete functional data. We find appropriate assumptions on the observation pattern that enable us to establish the asymptotic distribution of estimators of µ and R. We develop tests for comparing the mean functions in K populations of functional data based on samples of fragments. Next, we propose several tests of equal covariance operators in K samples. We also construct confidence intervals for the eigenvalues and eigenfunctions estimated 15 from incomplete data. The practical implementation of methods for functional fragments is more complicated than for complete curves. The main difficulty is that temporal averaging (e.g., in inner products for dimension reduction) is impossible due to missing values. This leads to asymptotic distributions whose parameters follow rather complicated formulas. More importantly, since dimension reduction is not possible, the asymptotic distributions are, upon discretization, characterized by large objects (matrices or arrays) that are difficult or even impossible to store and manipulate in computer memory. The bootstrap turns out to be a solution to this problem. We provide specific algorithms for resampling functional fragments for mean and covariance testing and for confidence intervals for eigenelements. Our simulation study shows that the proposed methods are superior to the currently only available approach based on omitting incomplete curves. Let us now describe the contributions of Paper E more specifically. First, we focus on inference about the mean of functional data. We consider estimation of the mean function µ by the cross-sectional average of available observations as before. In Kraus (2015, Proposition 1) (Paper C) it was shown that under non-restrictive assumptions on the observation pattern such an estimator, ˆµ, is consistent. Paper C goes further and provides the asymptotic distribution of the estimator, which is essential in the derivation of the limiting distribution of a test statistics. The paper introduces sets of conditions on the observation pattern. Then it is shown in Theorem 1 that the estimator ˆµ is asymptotically distributed as a Gaussian process and a consistent estimator of the limiting covariance operator is provided. Next, we consider K independent samples of incompletely observed functional data. Our aim is to test the null hypothesis that all K mean functions are equal against the general alternative that the null does not hold. In the literature on complete functional samples there exist two main approaches to comparing mean functions. One is based on the L2 distance between the means and one uses projections on finite dimensional subspaces. We explore both approaches in the fragmentary setting. Test statistics are constructed and their null asymptotic distributions are obtained under appropriate assumptions. Next, we develop methods of second-order inference for functional fragments. The covariance function ρ(s, t) can be estimated by the empirical covariance using pairwise complete observations. We previously showed that under certain assumptions on the observation pattern, the operator ˆR with kernel ˆρ(s, t) consistently estimates R. Paper E provides a deeper asymptotic study. We determine conditions on the pattern of missingness that guarantee the weak convergence of the properly normalized difference between ˆR and R to a Gaussian random operator (Theorem 3). These conditions in particular do not require the existence of any completely observed curves in the data. An estimator of the limiting covariance structure is provided. Then we study the estimators ˆλm and ˆϕm of the eigenvalues and eigenfunctions of R. The estimators are obtained by the eigendecomposition of ˆR. Theorem 4 establishes their asymptotic distributions with the help of perturbation theory. The theorem generalizes the classic results for completely observed functions. Next, we study tests for equality of covariance operators of several populations. Tests of this null hypothesis can be based on the differences between the estimators ˆRj and a null estimator ˆR. We propose two types of tests measuring the 16 importance of these contrasts: one approach is based on the Hilbert–Schmidt norm of the contrasts and one is based on their projections on a subspace. We give the asymptotic distribution of the Hilbert–Schmidt and projection statistics in Theorem 5. As an alternative we explore an approach (previously proposed by other authors for complete curves) that takes into account the fact that covariance operators do not form a linear subspace of the Hilbert space of Hilbert–Schmidt operators and uses the square root distance instead of the difference of covariances. Section 4 of the paper deals with practical issues that arise due to partial observation. Functional data procedures are practically implemented by discretization. Functions then correspond to vectors (possibly with missing values), operators on the function space correspond to matrices and operators on operators correspond to four-way arrays. The direct implementation of the confidence sets and tests using the asymptotic distributions may be excessively demanding in terms of computer memory, especially in the case of covariance inference. Projection covariance tests for complete functions can avoid the computation, storage and manipulation with large arrays by computing principal scores of each function with respect to the required low number d of eigenfunctions (for example, our Paper A here does it). This dimension reduction approach is not applicable in the case of incomplete functions because the principal scores cannot be computed (temporal averaging is precluded by the incompleteness of the curves). Similar problems arise with Hilbert–Schmidt norm tests which involve a large eigenproblem that cannot be reduced due to missingness. To overcome these difficulties we use the bootstrap. We propose algorithms for mean and covariance testing and for the construction of confidence intervals that are based on the resampling of functional fragments. In the numerical part of the paper, we perform a simulation study whose main goal is to investigate the impact of partial observation on the performance of the different mean and covariance tests and compare the proposed tests using complete and incomplete curves with the simple approach using complete curves only. We also analyze the data set of incomplete heart rate curves. All technical proofs are collected in the appendix. A supplement provides further numerical results. References Bosq, D. (2000). Linear Processes in Function Spaces. Springer, New York. Ferraty, F. and Romain, Y., editors (2011). The Oxford Handbook of Functional Data Analysis. Oxford University Press, Oxford. Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Springer, New York. Horv´ath, L. and Kokoszka, P. (2012). Inference for Functional Data with Applications. Springer, New York. Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. Wiley. 17 Kokoszka, P. and Reimherr, M. (2017). Introduction to Functional Data Analysis. CRC Press. Kraus, D. (2015). Components and completion of partially observed functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):777– 801. Kraus, D. (2019). Inferential procedures for partially observed functional data. Journal of Multivariate Analysis, 173:583–603. Kraus, D. and Panaretos, V. M. (2012). Dispersion operators and resistant second-order functional data analysis. Biometrika, 99(4):813–832. Kraus, D. and Stefanucci, M. (2019). Classification of functional fragments by regularized linear classifiers with domain selection. Biometrika, 106(1):161–180. Kraus, D. and Stefanucci, M. (2020). Ridge reconstruction of partially observed functional data is asymptotically optimal. Statistics & Probability Letters, 165:108813. Panaretos, V. M., Kraus, D., and Maddocks, J. H. (2010). Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. Journal of the American Statistical Association, 105(490):670–682. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer, New York. 18 A. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles By Victor M. Panaretos, David Kraus, and John H. Maddocks Journal of the American Statistical Association, 105(490):670–682, 2010 DOI: 10.1198/jasa.2010.tm09239 19 Supplementary materials for this article are available online. Please click the JASA link at http://pubs.amstat.org. Second-Order Comparison of Gaussian Random Functions and the Geometry of DNA Minicircles Victor M. PANARETOS, David KRAUS, and John H. MADDOCKS Given two samples of continuous zero-mean iid Gaussian processes on [0,1], we consider the problem of testing whether they share the same covariance structure. Our study is motivated by the problem of determining whether the mechanical properties of short strands of DNA are significantly affected by their base-pair sequence; though expected to be true, had so far not been observed in three-dimensional electron microscopy data. The testing problem is seen to involve aspects of ill-posed inverse problems and a test based on a Karhunen– Loève approximation of the Hilbert–Schmidt distance of the empirical covariance operators is proposed and investigated. When applied to a dataset of DNA minicircles obtained through the electron microscope, our test seems to suggest potential sequence effects on DNA shape. Supplemental material available online. KEY WORDS: Covariance operator; DNA shape; Functional data analysis; Hilbert–Schmidt norm; Karhunen–Loève expansion; Regularization; Spectral truncation; Two-sample testing. 1. INTRODUCTION The understanding of the mechanical properties of the DNA molecule constitutes a fundamental biophysical task, as important biological processes, such as the packing of DNA in the nucleus or the regulation of genes, can be affected by properties such as stiffness and shape (Vilar and Leibler 2003; Tolstorukov et al. 2005). The study of these properties can focus on different scales, and accordingly involves a variety of mathematical models and techniques. At a coarse-grained level, the behavior of short (of the order of 150 base pairs) strands of DNA is likened to that of a continuous elastic rod. By means of a reaction called cyclization, two ends of this elastic rod bend and twist and bind together to form a loop called a DNA minicircle. These three-dimensional cyclic structures are an excellent specimen for examining the elastic properties of DNA since a minicircle is in a naturally stressed state without the application of external forces. Furthermore, the short length of these strands will amplify the dependence of the mechanistic behavior on intrinsic factors such as the specific base pair sequence. Such sequence-dependent shape characteristics are of special interest as they potentially reveal a dual purpose of the DNA base-pair sequence: in addition to holding the genetic code, the sequence may influence the geometric properties of the molecule. While in principle certain particular subsequences are expected to have a strong effect on the mechanical properties of DNA, empirical detection of this effect on stereological data acquired through the electron microscope has been elusive (Hagerman 1988; Amzallag et al. 2006). A specific example is that of a subsequence called the TATA box, which promotes gene transcription. It is thought that the mechanical properties of this subseqence are intimately related with its function, and that its presence in a DNA minicircle will enhance its flexibility. Nevertheless, exploratory comparisons between reconstructed minicircles from microscope images containing TATA boxes with reconstructed minicircles with no TATA box did not reVictor M. Panaretos is Assistant Professor (E-mail: victor.panaretos@epfl. ch), David Kraus is Postdoctoral Researcher, and John H. Maddocks is Professor, Section de Mathématiques, Ecole Polytechnique Fédérale de Lausanne, Lausanne 1015, Switzerland. The authors thank the editor, associate editor, and two referees for providing detailed and constructive comments, and for their fruitful suggestions. The last author wishes to acknowledge support from FN grant 205320-112178. veal any effects due to the presence of the sequence (Amzallag et al. 2006). Motivated by the need of two-sample comparison of loops, as exemplified in DNA minicircle experiments, this article considers the problem of second-order comparison of two samples of random functions, within a functional data analysis framework. In particular, given realisations of n1 and n2 independent copies of two continuous zero mean Gaussian processes X and Y on a compact set, we consider the problem of testing the hypothesis H0 :RX = RY against the alternative HA :RX = RY , where the covariance operators RX,RY are not necessarily stationary. The literature on hypothesis testing for functional data is mostly concentrated on tests pertaining to the mean function (Fan and Lin 1998), as encountered, for instance, in functional linear models (Cardot et al. 2003; Cuevas, Febrero, and Fraiman 2004; Shen and Faraway 2004) or functional change detection (Berkes et al. 2009). Hall and Van Keilegom (2007) studied the important issue of the effect that the data smoothing step may have on two-sample testing. Second-order tests for functional data analysis pertaining to serial correlation were also investigated (e.g., Gabrys and Kokoszka 2007; Horváth, Hušková, and Kokoszka 2010). Although the seeds of functional two-sample covariance tests can be found in Grenander (1981), the problem of second-order comparison of functional data has—interestingly—so far received relatively little attention. A related recent article by Benko, Härdle, and Kneip (2009) proposed two-sample bootstrap tests for specific aspects of the spectrum of functional data, such as the equality of a subset of the eigenfunctions, or—assuming that the eigenfunctions are shared—equality of a subset of eigenvalues. In this article, we consider the difficulties associated with this testing problem, and it is seen that the extension of finite-dimensional procedures can lead to complications, as the infinite-dimensional version of the problem constitutes an illposed inverse problem. As an alternative solution, we propose a test based on the approximation of the Hilbert–Schmidt distance of the empirical covariance operators of the two samples of functions based on the Karhunen–Loève expansion. The asymptotic distribution of the test statistic is determined and its © 2010 American Statistical Association Journal of the American Statistical Association June 2010, Vol. 105, No. 490, Theory and Methods DOI: 10.1198/jasa.2010.tm09239 670 Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 671 performance is investigated computationally. The application of our methodology to an electron microscope dataset of two groups of minicirles characterized by the presence or absence of a TATA box suggests the potential existence of significant differences in the two groups, which eluded previous analyses as these focused on the mean (the shape of the minicircle), whereas we detect the differences in the covariance structure (the flexibility/stiffness). The article is organized as follows. The next section describes the three-dimensional functional dataset of DNA minicircles, from acquisition to registration, and includes a preliminary exploratory analysis. The first part of the third section then provides some functional data analysis background. Section 3.2 introduces our spectral test statistic and develops its asymptotic distribution, while Section 3.3 treats the problem of tuning the amount of regularization. In Section 4 the power and level of the test under various scenarios is investigated by means of simulation. Section 5 presents the results of a two-sample analysis of the DNA minicircles through the spectral test statistics, and the article concludes with a short discussion. 2. DNA MINICIRCLE DATA The dataset of interest was reconstructed from electron micrographs imaged by Jan Bednar at the Laboratory of Ultrastructural Analysis of the University of Lausanne, Switzerland. A total of 99 DNA minicircles of 158 base-pair length were vitrified and imaged under two different angles, yielding two projected images of the same specimen, which were then used to reconstruct three-dimensional structural models (Jacob et al. 2006). The reconstructed data consist of 99 closed curves (DNA minicircles) in R3 of two types: both types have identical base pair sequences, except for a 14 base-pair window where 65 curves contain the TATA sequence, while the remaining 34 contain a different sequence, called a CAP sequence. Biophysical considerations suggest that the presence of a TATA box will have a significant effect on the geometry of the minicircle, and the goal is to compare these two groups to probe for such an effect. In its reconstructed form, each curve is represented as a combination of periodic B-spline basis functions taking values in R3. To perform a functional data analysis of the minicircles it is required to register the data. Each curve has thus been centered and scaled, so that the center of mass is at zero and the length of the curve is one. The nature of the experimental setup in single-particle electron microscopy requires that the minicircles be imbedded unconstrained in the aqueous solution, so that the reconstructed curves are not aligned: the original (x,y,z)coordinates for the different curves are not directly comparable as each curve was subjected to a random unobservable orthogonal transformation. It is thus necessary to align the curves. Landmark alignment methods (e.g., Gasser and Kneip 1995) are not applicable as the exact DNA sequence is not detectable from an electron micrograph. On the other hand, more flexible methods such as warping (e.g., Gervini and Gasser 2004; Tang and Müller 2008) are inappropriate since nonrigid alignment will alter the second-order properties that are of principal interest. As an alternative, we rigidly align curves by their intrinsic characteristics: each curve was individually aligned using the coordinate system induced by its moments of inertia tensor (e.g., Arnold 1989), which is described as follows. Consider an object in three dimensions described by a mass distribution μ—for example, for a DNA minicircle, μ will be the uniform measure supported on the curve. Suppose that the object is rotating around an axis, which without loss of generality, is given by span(u) := {λu:λ ∈ R} for some u ∈ S2. Let r(u,x) := (I − uu )x denote the distance of a point x from the subspace span(u). The moment of inertia of the object around the axis u is given by J (u) := R3 r2 (u,x)μ(dx) = R3 (I − uu )x 2 μ(dx). Given a coordinate system defined by an orthonormal basis, say the canonical basis (e1,e2,e3), we can use only these basis vectors to compactly represent the moment of inertia with respect to any other axis passing by the origin. Define the inertia matrix as J := R3 x (ei ejI − eiej )xμ(dx) i,j . Notice that the diagonal elements of the above matrix are the moments of inertia with respect to the axes of the coordinate system. The moment of inertia around any unit vector u can now be recovered as J (u) = u Ju. Since the tensor is symmetric, it possesses real eigenvalues and orthonormal eigenvectors forming a basis, which admit the following interpretation: the first eigenvector, say w1, determines the axis (first principal axis of inertia, PAI1) around which the curve is most difficult to rotate, in the sense that the corresponding angular moment is maximized: w1 Jw1 ≥ u Ju for any other u ∈ S2. The projection on the plane orthogonal to w1 is “most spread” in this sense. The second eigenvector determines the axis within the first principal plane around which the projected curve is most difficult to rotate. That is, within the first principal plane, the projection on the line orthogonal to PAI2 is most spread. Hence, PAI3 carries the most spatial information, whereas PAI1 contains the smallest amount of information. Then, for each curve, the starting point was determined as the point where the projection on the first principal plane intersects the horizontal (PAI2) positive semi-axis and the orientation was chosen as counterclockwise in this plane (i.e., at the beginning the PAI3 coordinate increases from zero and PAI2 is positive). The projections onto the principal axes of the minicircle curves are depicted in Figures 1 and 2. The data appear to be well aligned, and seem to be elliptical on average within the principal plane of inertia. Deviations from this principal plane, on the other hand, seem to be lacking systematic structure. The effectiveness of this alignment method is of crucial importance, as we will not be able to otherwise proceed with the testing problem (procrustean alignment of the curves will require us to optimize a sum of squares criterion with respect to 99 orthogonal transformations). A visual inspection reveals five curves (plotted with dashed lines) that appear to be “standing out” of the rest—outliers in a broad sense. Judging whether or not a curve (an infinite dimensional object) is an outlier or not can be far trickier than in the vector case. In particular, it can be that there are further “outlying curves” that do not appear to stick out of the crowd, but are nevertheless intrinsically different from the rest. For this reason, we pursue a robust analysis for the mean curve 672 Journal of the American Statistical Association, June 2010 Figure 1. Projection of DNA curves on the first principal plane. Five removed outlying observations plotted in dashed lines. The mean curves (in white) are computed without outlying observations. using a functional median introduced in Gervini (2008). The idea is simple: an iterative robust procedure will assign weights to each curve, and we can then detect outlying curves by looking at small weights. The method confirms our visual intuition, and reveals no further outliers. The outlying observations are removed, and after this preprocessing stage we are left with 94 aligned smooth curves. 3. METHODS 3.1 Background: FDA and Karhunen–Loève Expansions We adopt a functional data analysis perspective (Ramsay and Silverman 2005; Ferraty and Vieu 2006) and model each curve as the realization of a stochastic process indexed by the closed interval [0,1] and taking values in R3 (but everyFigure 2. Coordinates of DNA curves on the principal axes of inertia. Five removed outlying observations plotted with dashed lines. Mean curves (in white) are computed without outlying observations. Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 673 thing readily extends to the case of Rd). In particular, we assume that we have two independent collections X1,...,Xn1 and Y1,...,Yn2 , of iid Gaussian processes on [0,1], considered as random elements of the Hilbert space L2[0,1] of coordinate-wise square-integrable R3-valued functions with the inner product f,g = 1 0 f(t) g(t)dt. Here, f(t) represents the transpose of the vector-valued function f(t) ∈ R3. Assuming, without loss of generality, that the mean functions are zero, the processes are characterized by their respective covariance kernels RX(s,t) = cov(Xi(s),Xi(t)) = E{Xi(s)Xi (t)}, and RY(s,t), respectively. Associated with the covariance kernel is the covariance operator RX :L2[0,1] → L2[0,1] defined as RX(f)(t) = cov( Xi,f ,Xi(t)) = 1 0 RX(t,s)f(s)ds. Throughout the article, we will be assuming RX to be continuous, so that RX is bounded and the X process is continuous (resp. the Y process). Inference for iid collections of infinite-dimensional random elements is often carried out in practice by an “optimal” reduction to a finite-dimensional setting, using finitely many appropriately chosen contrasts in a functional principal component analysis (e.g., Ramsay and Silverman 2002, 2005; Hall and Hosseini-Nasab 2006; also see Dauxois, Pousse, and Romain 1982 for distributional asymptotics). This procedure exploits the Karhunen–Loève theorem (e.g., Adler 1990), which allows for a representation of the process by a stochastic Fourier series with respect to the orthonormal eigenfunctions {ϕ (j) X }∞ j=1 of the operator RX, Xi(t) = ∞ j=1 λ (j) X ξijϕ (j) X (t), where {λ (j) X }∞ j=1 is the nonincreasing sequence of corresponding eigenvalues and {ξij} is an iid array of standard Gaussian random variables. Convergence of the series is in mean square, uniformly in t ∈ [0,1]. Thus, in a practical setting, the empirical covariance kernel may be used to “optimally” reduce infinite-dimensional inferential problems to multivariate ones. Letting RX stand for the empirical covariance kernel, RX(s,t) := 1 n1 n1 i=1(Xi(s) − X(s))(Xi(t) − X(t)) , we denote its eigenvalues (or principal scores) by {λk,n1 X }n1 k=1 and its eigenfunctions (or principal components) by {ϕk,n1 X }n1 k=1. The finite-dimensional reduction is then achieved by retaining a finite number of principal components { Xi − X,ϕk,n1 X }K k=1 in lieu of each Xi. These are zero mean and uncorrelated random variables, with corresponding sample variances λk,n1 X . Similarly, for the second sample, the analogous quantities are RY , RY , λ (j) Y , ϕ (j) Y (and their empirical “hat” counterparts). The dimension reduction afforded by the Karhunen– Loève expansion is the tool we will next employ to construct our test. 3.2 Second-Order Comparison of Gaussian Processes Let {Xi}n1 i=1 and {Yi}n2 i=1 constitute two iid random samples of Gaussian processes indexed by the interval [0,1] and taking values in R3 (or indeed Rd). As mentioned in the previous section, these are regarded as random elements of the Hilbert space L2[0,1] of square-integrable R3-valued functions (where integration is to be understood coordinate-wise). Assuming that the covariance operators RX and RY associated with the processes are continuous, we wish to test the hypothesis pair H0 : RX = RY, HA : RX = RY. (1) A natural first approach to developing a test for the hypothesis pair in Equation (1) is to attempt to extend tests developed for the finite-dimensional version of the problem, which was extensively studied. The majority of test statistics for the equality of covariance matrices of Gaussian vectors are based on the determinant, trace, or maximum/minimum eigenvalues of matrices such as: S1S2S−1 , S1S−1 2 , S2(S1 + S2)−1 (Roy 1953; Pillai 1955; Kiefer and Schwartz 1965; Giri 1968); here, S1 and S2 are the empirical covariance matrices corresponding to each sample, and S is the pooled empirical covariance matrix. Evidently, such tests cannot immediately be carried over to the case of Gaussian processes: inversion of an empirical covariance operator will be required, which transforms the construction of the test statistic into an ill-posed inverse problem. The operator Rn1 X (resp. Rn2 Y ) will be of rank at most n1 (resp. n2) as its image is the subspace spanned by {Xi}n1 i=1 (resp. {Yi}n2 i=1). Therefore, we cannot talk of its inverse, except if we restrict the operator on span{Xi}n1 i=1 (resp. span{Yi}n1 i=1), but the two spans will not coincide in general and the two empirical operators will not be diagonalized by the same basis. Furthermore, since the processes are assumed to be second order, the operators RX and RY are necessarily bounded (in fact compact), and it must be the case that λ (k) X ,λ (k) Y k→∞ −→ 0, the rate of convergence depending on the degree of smoothness of the Gaussian processes (the smoother the process, the faster the rate). Thus, for any finite n1 and n2, however large, a test statistic employing an “inverse” of RX composed with RY will be unstable to perturbations of the Y-data. In the infinite-dimesional case, we propose the use of a test statistic based on the norm of the difference of the two empirical covariance operators. Recall that for trace-class operators, one may define the Hilbert–Schmidt norm. Consider an integral operator R :f → 1 0 R(·,s)f(s)ds such that 1 0 1 0 trace{R(s,t) R(s,t)}dsdt < ∞. The Hilbert–Schmidt norm of the operator R is defined as R HS := 1 0 1 0 trace{R(s,t) R(s,t)}dsdt. Assuming that the covariance operators in question are Hilbert– Schmidt, a test may be based on the squared Hilbert–Schmidt distance RN X − RN Y 2 HS of their empirical counterparts. Of course, the sampling distribution of this latter quantity will depend on the unknown covariance operators even asymptotically. To be able to “normalize” the test statistic, we employ a very useful property of the Hilbert–Schmidt norm: for any orthonormal system {ei}∞ i=1 of L2[0,1], we have R 2 HS = ∞ i=1 Rei 2 L2 . (2) Therefore, we may use a basis to obtain a countable expression for RN X − RN Y 2 HS. In practice, one will need to truncate a series such as the above to obtain an “optimal” finite-dimensional 674 Journal of the American Statistical Association, June 2010 reduction, that is, the choice of contrasts {ei} should be such that the truncated version of Equation (2) retains the bulk of the norm. For each of the two empirical operators, the optimal contrasts will coincide with their eigenfunctions, as dictated by the Karhunen–Loève expansion, but to use the relation in Equation (2) we need to use a common basis. As a compromise, we thus choose the eigenfunctions {ϕk,N XY } corresponding to the empirical covariance operator of the pooled sample of N = n1 +n2 curves and base our test on K k=1 (RN X − RN Y )ϕk,N XY 2 L2 , which by Parseval’s theorem, may be further approximated by K i=1 K j=1 (RN X − RN Y )ϕi,N XY ,ϕ j,N XY 2 . (3) With this quantity in mind, the following theorem, whose proof may be found in the Appendix, provides the basis for our test: Theorem 1. Let {Xn}n1 n=1 and {Yn}n2 n=1 be two collections of zero mean iid continuous Gaussian random functions indexed by the interval [0,1] and taking values in Rd, possessing covariance operators RX and RY with distinct eigenvalues. Let Rn1 X and Rn2 Y denote the empirical covariance operators based on {Xn}n1 n=1 and {Yn}n2 n=1. For N = n1 + n2, let RN XY denote the empirical covariance operator of the pooled collection, and {ϕk,N XY }N k=1 the corresponding eigenfunctions. Finally, let λk,n1 X,XY , λk,n2 Y,XY denote the empirical variance of the kth Fourier coefficient of {Xn}n1 n=1 and {Yn}n2 n=1, respectively, with respect to the eigenfunctions {ϕn,K XY }N n=1. Assuming that E[ X1 4 L2 ] < ∞, E[ Y1 4 L2 ] < ∞, and n1/N → θ ∈ (0,1) as N = n1 + n2 → ∞, it follows that, under the hypothesis H0 :RX = RY , TN(K) := n1n2 2N K i=1 K j=1 (Rn1 X − Rn2 Y )ϕi,N XY ,ϕ j,N XY 2 n1 N λi,n1 X,XY + n2 N λi,n2 Y,XY × n1 N λ j,n1 X,XY + n2 N λ j,n2 Y,XY w −→ χ2 K(K+1)/2 as N → ∞, for any finite K ≤ rank(RX) = rank(RY) ≤ ∞. Under the alternative hypothesis, the test statistic will converge to a sum of K(K + 1)/2 dependent shifted chi square random variables. Our proposed test procedure is thus to reject the hypothesis H0 :RX = RY at level α, whenever the test statistic exceeds the corresponding critical value, TN(K) ≥ χ2 K(K+1)/2,1−α. Of course, conducting the test requires the selection of a spectral truncation level, K. This choice must be made judiciously, as it has a direct bearing on the power of the test: 1. Conservative choices of K [i.e., choosing K rank(RX) ∧ rank(RY)] may result in Type II error due to differences in the higher frequency covariance structure, especially in situations where the two covariances share the same eigenfunctions, but have different eigenvalues at higher frequencies. 2. Greedy choices of K [choosing K > rank(RX) ∧ rank(RY)] will inflate the variance of the test statistic since an element of ill-posedness will enter when dividing with the empirical eigenvalues of higher order terms. In the latter sense, the test can also be thought of as an L2regularized test. These aspects are further considered quantitatively in Section 4. It should be noted that the problem of choosing K is directly analogous to the choice of a cutoff point in principal component analysis and the choice of a bandwidth in a nonparametric problem; thus we deal with it using empirical eigenvalue scree-plots as well as penalized goodness-of-fit criteria (see Sections 3.3 and 5.1). A more user-friendly expression for the test statistic T can be given if we introduce some additional notation. Let λ ij,N X,XY := Rn1 X ϕi,N XY ,ϕ j,N XY = n−1 1 i Xi − X,ϕi,N XY Xi − X,ϕ j,N XY be the empirical covariance of the ith and jth Fourier coefficients of the X-curves, with respect to the basis {ϕk,N XY }k≥1 (resp. λ ij,N Y,XY ). For simplicity, we also write λ jj,N X,XY ≡ λ j,N X,XY (resp. λ jj,N Y,XY ). Then we may re-express the test statistic as TN(K) := n1n2 2N K i=1 K j=1 ((λ ij,N X,XY − λ ij,N Y,XY)2 ) n1 N λi,n1 X,XY + n2 N λi,n2 Y,XY × n1 N λ j,n1 X,XY + n2 N λ j,n2 Y,XY . If for some reason, we a priori know the eigenfunctions of RX and RY to be equal, then the following test statistic may be used instead of T: T1 = K k=1 n1n2 N (λk,N X,XY − λk,N Y,XY)2 2((n1/N)λk,N X + (n2/N)λk,N Y )2 . The motivation for this statistic is that when the eigenfunctions coincide, then K k=1 (Rn1 X − Rn2 Y )ϕk,N XY 2 L2 ≈ K k=1 (λk,N X,XY − λk,N Y,XY)2 . It follows as an immediate corollary to Theorem 1 that, under H0, the statistic T1 is asymptotically chi-square distributed with K degrees of freedom [assuming n1/N → θ ∈ (0,1)]. One may also wish to consider modified versions of the test statistics T and T1, obtained via suitable variance-stabilizing transformations. In the case of the test statistic T, we apply a log transformation to the diagonal terms of the sum in Equation (3), and Fisher’s z-transformation to the off-diagonal terms to obtain a Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 675 test statistic with the same asymptotic distribution as T (an immediate corollary to Theorem 1), T∗ = K k=1 n1n2 N (logλk,N X,XY − logλk,N Y,XY)2 2 + 1≤j 0. The asymptotic approximations for the distributions of the test statistics investigated hold for Gaussian processes. Departures from this assumption will affect the limiting law of the statistics. In simulations we observed that the test derived under the Gaussian assumption used in a non-Gaussian case becomes conservative when the scores have lighter tails than the normal distribution and anticonservative in the opposite case. Our tests are based on sums of squares of components which are asymptotically normal independent variables. When the data are not Gaussian, these components have asymptotically a multivariate normal distribution with unknown covariance structure. The limiting covariance matrix can be estimated and a chi-square test statistic can be based on the corresponding quadratic form (see also Horváth, Hušková, and Kokoszka 2010 for a similar approach in a different context). Some simulations showed that the convergence to the limiting distribution might be slow and one has to use only a small value of K, especially for the offdiagonal test. Of course, testing whether a process is Gaussian is a research project in itself, but informal qq-plots constructed for the Karhunen–Loève coefficients of the minicircle data did not reveal any noteworthy departures from normality. For the benefit of the doubt, however, we also employed permutation tests based on our test statistics, with similar results—but with slightly more inflated p-values (Panaretos and Kraus 2009). APPENDIX Proof of Theorem 1 Introduce the notation Xif := Xi,f Xi and Yif = Yi,f Yi, so that Rn X = n−1 i Xi and Rn Y = n−1 i Yi. These are viewed as random elements of the Hilbert space of Hilbert–Schmidt operators acting on L2[0,1]. Under the hypothesis H0 :RX = RY , the collections {Xi} and {Yi} are iid random operators with mean RX = RY and common covariance S := E[Xi ⊗Xi]−RX ⊗RX = E[Yi ⊗Yi]−RY ⊗RY , where ⊗ denotes the tensor product, (u ⊗ v)w = v,w Hu for any elements v,w,u of a Hilbert space (H, ·,· H). In addition, our moment assumptions imply that E Xi 2 HS < ∞. We may, therefore, apply the Hilbert space central limit theorem (e.g., Bosq 2000, theorem 2.7) to conclude that √ n1(R n1 X − RX) w −→ Z1 and √ n2(R n2 Y − RY) w −→ Z2 as n1,n2 → ∞, where Z1 and Z2 are independent Gaussian random operators with mean 0 and covariance operator S. Now, given i,j, consider the sequence of random variables W i,j N = n1n2/N(R n1 X − R n2 Y )sgn[ ϕi,N XY ,ϕi ]ϕi,N XY , sgn[ ϕ j,N XY ,ϕj ]ϕ j,N XY . On the one hand, the strong law in Hilbert space implies that RN XY − RX HS a.s. −→ 0 under the hypothesis H0. Consequently, convergence also occurs with probability 1 in the strong operator topology, so that by Bosq (2000, lemma 4.3) sgn[ ϕk,N XY ,ϕk ]ϕk,N XY − ϕk L2 a.s. −→ 0 ∀k ≥ 1. (A.1) On the other hand, as N → ∞ with n1/N → θ ∈ (0,1) we will have n2 N √ n1R n1 X − n1 N √ n2R n2 Y w −→ √ 1 − θZ1 − √ θZ2 = Z , (A.2) with Z a zero-mean Gaussian random operator with covariance S. Combining Equations (A.1) and (A.2) with the Hilbert space Slutsky lemma establishes that, for all i,j ∈ {1,...,K}, W i,j N w −→ Z ϕi,ϕj . For the next step, we note that Z , being a Gaussian process itself, also admits a Karhunen–Loève decomposition, with respect to the eigenfunctions of S. These eigenfunctions can be retrieved directly from the definition of S and the Karhunen–Loève expansion of the typical X process, X = i √ λiξiϕi. Defining the operator ijf := ϕi,f ϕj, we immediately see that X = i,j λiλjξiξj ij and RX = j λj jj. Hence, upon recalling that the {ξi} are an iid standard Gaussian array we may write S = E[X ⊗ X ] − RX ⊗ RX = i,j,q,p λiλjλpλqE[ξiξjξpξq] ij ⊗ qp − i,j λiλj ii ⊗ jj = i=j λiλj ii ⊗ jj + i=j λiλj ij ⊗ ji + i=j λiλj ij ⊗ ij + i 3λ2 i ii ⊗ ii − i λ2 i ii ⊗ ii − i=j λiλj ii ⊗ jj = 2 i λ2 i ii ⊗ ii + i=j λiλj( ij ⊗ ji + ij ⊗ ij), since E[ξiξjξpξq] is 1 whenever pairs of indices are equal but not all indices are totally coincident, 3 when all indices are equal, and zero Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 681 otherwise. Regrouping the summation by adding the terms that are symmetric with respect to their indices, we further obtain S = 2 i λ2 i ii ⊗ ii + i n. It follows that Z ϕk,ϕk iid ∼ N(0,2λ2 k) independently of Z ϕm, ϕn iid ∼ N(0,λmλn), m = n. Consequently, we have 1 2 Z ϕk,ϕk 2 λ2 k iid ∼ χ2 1 , independently of 1 2 Z ϕm,ϕn 2 + Z ϕn,ϕm 2 λmλn = Z ϕm,ϕn 2 λmλn ∼ χ2 1 . The continuous mapping theorem now implies that 1 2 (W ij N)2 + (W ji N)2 λiλj = n1n2 2N K i=1 K j=1 (R n1 X − R n2 Y )ϕi,N XY ,ϕ j,N XY 2 λiλj w −→ χ2 K(K+1)/2. To complete the proof, we note that n1 N λ k,n1 X,XY + n2 N λ k,n2 Y,XY p −→ θλk + (1 − θ)λk = λk ∀k ∈ {1,...,K}, so that the result follows from the application of Slutsky’s lemma. SUPPLEMENTAL MATERIALS Additional plots and tables and detailed study: Additional plots and tables are available in a supplementary file. In addition, the supplementary file contains a more detailed study of the problem of comparing the complete spectrum, extending the discussion in the last part of Section 3.2. (Supplement.pdf) [Received April 2009. Revised December 2009.] REFERENCES Adler, R. J. (1990), An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes. Lecture Notes and Monographs Series, Hayward: Institute of Mathematical Statistics. [673] Amzallag, A., Vaillant, C., Jacob, M., Unser, M., Bednar, M., Kahn, J. D., Dubochet, J., Stasiak, A., and Maddocks, J. H. (2006), “3D Reconstruction and Comparison of Shapes of DNA Minicircles Observed by Cryo-Electron Microscopy,” Nucleic Acids Research, 34 (18), e125. [670,678] Arnold, V. I. (1989), Mathematical Methods of Classical Mechanics, New York: Springer. [671] Benko, M., Härdle, W., and Kneip, A. (2009), “Common Functional Principal Components,” The Annals of Statistics, 37, 1–34. [670] Berkes, I., Gabrys, R., Horváth, L., and Kokoszka, P. (2009), “Detecting Changes in the Mean of Functional Observations,” Journal of the Royal Statistical Society, Ser. B, 71, 927–946. [670,678] Bosq, D. (2000), Linear Processes in Function Spaces, New York: Springer. [675,680] Cardot, H., Ferraty, F., Mas, A., and Sarda, P. (2003), “Testing Hypotheses in the Functional Linear Model,” Scandinavian Journal of Statistics, 30 (1), 241–255. [670] Cuevas, A., Febrero, M., and Fraiman, R. (2004), “An ANOVA Test for Functional Data,” Computational Statistics and Data Analysis, 47, 111–122. [670] Dauxois, J., Pousse, A., and Romain, Y. (1982), “Asymptotic Theory for the Principal Component Analysis of a Random Vector Function: Some Applications to Statistical Inference,” Journal of Multivariate Analysis, 12, 136– 154. [673,678] Fan, J., and Lin, S.-K. (1998), “Tests of Significance When the Data Are Curves,” Journal of the American Statistical Association, 93, 1007–1021. [670] Ferraty, F., and Vieu, P. (2006), Nonparametric Functional Data Analysis, New York: Springer. [672] Gabrys, R., and Kokoszka, P. (2007), “Portmanteau Test of Independence for Functional Observations,” Journal of the American Statistical Association, 102, 1338–1348. [670] Gasser, T., and Kneip, A. (1995), “Searching for Structure in Curve Samples,” Journal of the American Statistical Association, 90, 1179–1188. [671] Gervini, D. (2008), “Robust Functional Estimation Using the Median and Spherical Principal Components,” Biometrika, 95 (3), 587–600. [672] Gervini, D., and Gasser, T. (2004), “Self-Modelling Warping Functions,” Journal of the Royal Statistical Society, Ser. B, 66 (4), 959–971. [671] Giri, N. (1968), “On Tests of the Equality of Two Covariance Matrices,” The Annals of Mathematical Statistics, 39, 275–277. [673] Grenander, U. (1981), Abstract Inference, New York: Wiley. [670,678] Hagerman, P. J. (1988), “Flexibility of DNA,” Annual Review Biophysics and Biophysical Chemistry, 17, 265–286. [670] Hall, P., and Hosseini-Nassab, M. (2006), “On Properties of Functional Principal Components Analysis,” Journal of the Royal Statistical Society, Ser. B, 68 (1), 109–126. [673] 682 Journal of the American Statistical Association, June 2010 Hall, P., and Van Keilegom, I. (2007), “Two Sample Tests in Functional Data Analysis Starting From Discrete Data,” Statistica Sinica, 17, 1511–1531. [670] Horváth, L., Hušková, M., and Kokoszka, P. (2010), “Testing the Stability of the Functional Autoregressive Process,” Journal of Multivariate Analysis, 101 (2), 352–367. [670,680] Jacob, M., Blu, T., Vaillaint, C., Maddocks, J. H., and Unser, M. (2006), “3-D Shape Estimation of DNA Molecules From Stereo Cryo-Electron Micrographs Using a Projection Steerable Snake,” IEEE Transations on Image Processing, 15 (1), 214–227. [671] Kiefer, J., and Schwartz, R. (1965), “Admissible Bayes Character of T2-Test, R2-Test, and Other Fully Invariant Tests for Classical Multivariate Normal Problems,” The Annals of Mathematical Statistics, 36, 747–770. [673] Ledwina, T. (1994), “Data-Driven Version of Neyman’s Smooth Test of Fit,” Journal of the American Statistical Association, 89, 1000–1005. [677] Panaretos, V. M., and Kraus, D. (2009), “Second Order Comparison of Gaussian Processes With Applications to DNA Shape Analysis,” Technical Report 01-09, Chair of Mathematical Statistics, EPFL. [680] Pillai, K. C. S. (1955), “Some New Test Criteria in Multivariate Analysis,” The Annals of Mathematical Statistics, 26, 117–121. [673] Ramsay, J. O., and Silverman, B. W. (2002), Applied Functional Data Analysis: Methods and Case Studies, New York: Springer. [673] (2005), Functional Data Analysis, New York: Springer. [672,673,675] Rice, J., and Silverman, B. W. (1991), “Estimating the Mean and Covariance Structure Nonparametrically When the Data Are Curves,” Journal of the Royal Statistical Society, Ser. B, 53, 233–243. [676] Roy, S. N. (1953), “On a Heuristic Method of Test Construction and Its Use in Multivariate Analysis,” The Annals of Mathematical Statistics, 24, 220– 238. [673] Shen, Q., and Faraway, J. (2004), “An F Test for Linear Models With Functional Responses,” Statistica Sinica, 14, 1239–1257. [670] Tang, R., and Müller, H. G. (2008), “Pairwise Curve Synchronization for Functional Data,” Biometrika, 95 (4), 875–889. [671] Tolstorukov, M. Y., Virnik, K. M., Adhya, S., and Zhurkin, V. B. (2005), “ATract Clusters May Facilitate DNA Packaging in Bacterial Nucleoid,” Nucleic Acids Research, 33 (12), 3907–3918. [670] Vilar, J. M. G., and Leibler, S. (2003), “DNA Looping and Physical Constraints on Transcription Regulation,” Journal of Molecular Biology, 331 (5), 981– 989. [670] Yao, F., Müller, H. G., and Wang, J. L. (2005a), “Functional Data Analysis of Sparse Longitudinal Data,” Journal of the American Statistical Association, 100, 577–590. [676] (2005b), “Functional Linear Regression Analysis for Longitudinal Data,” The Annals of Statistics, 33, 2873–2903. [676] Supplemental File: Second–Order Comparison of Gaussian Random Functions and the Geometry of DNA Minicircles This supplementary note contains additional plots and tables in Section 1. In addition, Section 2 contains a more detailed study of the problem of comparing the complete spectrum, extending the discussion in the last part of Section 3.2 in the main body of the paper. 1 Supplementary Figures and Tables This section contains figures and a table not presented in the main body of the paper. The first two figures contain plots of the projected aligned curves onto their principal axes of inertia, including their superimposition. The third figure contains scree plots with respect to the mixed eigenbasis for the two groups separately, as well as jointly. The last figure depicts the Normal QQ plots of the Karhunen-Lo`eve residuals, as described in the discussion section of the paper. Finally, a complete table containing the results of the simulations for level and power corresponding to Section 4 is also given. In addition to the main test statistic proposed in the paper, the complete table also presents simulations for the diagonal form of the statistic (which compares only the eigenvalues). It is observed that when the difference lies only in the eigenvalues, this test statistic performs more powerfully, as would be expected. However, in the cases where differences also lie in the eigenfunctions, it is outperformed by the full version of the test statistic. 1 −0.15 −0.05 0.05 0.15 −0.2−0.10.00.10.2 Proj. on Prin. Plane 1, TATA PAI2 PAI3 −0.15 −0.05 0.05 0.15 −0.2−0.10.00.10.2 Proj. on Prin. Plane 1, all PAI2 PAI3 −0.15 −0.05 0.05 0.15 −0.2−0.10.00.10.2 Proj. on Prin. Plane 1, CAP PAI2 PAI3 Figure 1: Projection of DNA curves on the first principal plane. Five removed outlying observations plotted in green. Mean curves (yellow and cyan) computed without outlying observations. 2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 3, TATA arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 3, all arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 3, CAP arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 2, TATA arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 2, all arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 2, CAP arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 1, TATA arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 1, all arclength 0.0 0.2 0.4 0.6 0.8 1.0 −0.2−0.10.00.10.2 Principal Axis 1, CAP arclength Figure 2: Coordinates of DNA curves on the principal axes of inertia. Five removed outlying observations plotted in green. Mean curves (yellow and cyan) computed without outlying observations. 3 2 4 6 8 10 0e+004e−058e−05 Lambda for PAI3 lambda q q q q q q q q q q 2 4 6 8 10 0.00.20.4 Var prop for PAI3 varprop q q q q q q q q q q 2 4 6 8 10 0.40.60.81.0 Cum var prop for PAI3 cumvarprop q q q q q q q q q q 2 4 6 8 10 0e+004e−058e−05 Lambda for PAI2 lambda q q q q q q q q q q 2 4 6 8 10 0.000.150.30 Var prop for PAI2 varprop q q q q q q q q q q 2 4 6 8 10 0.40.60.81.0 Cum var prop for PAI2 cumvarprop q q q q q q q q q q 2 4 6 8 10 0.000000.00015 Lambda for PAI1 lambda q q q q q q q q q q 2 4 6 8 10 0.00.10.20.3 Var prop for PAI1 varprop q q q q q q q q q q 2 4 6 8 10 0.40.60.81.0 Cum var prop for PAI1 cumvarprop q q q q q q q q q q 2 4 6 8 10 0.000000.000060.00012 Lambda for PAI2,3 lambda q q q q q q q q q q 2 4 6 8 10 0.050.150.25 Var prop for PAI2,3 varprop q q q q q q q q q q 2 4 6 8 10 0.30.50.70.9 Cum var prop for PAI2,3 cumvarprop q q q q q q q q q q Figure 3: Empirical variances (scree plot), proportions and cumulative proportions of variance explained by components for the TATA (blue lines with circles) and CAP (red with diamonds) group and for both groups together (black with squares). 4 q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q qq q q q q q q q q q q q q q q q −2 0 1 2 −0.020.020.04 PAI1,2,3, TATA, score 1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −0.040.000.02 PAI1,2,3, CAP, score 1 q q q qq q q q q qq q q q q q q qqq q q q q qq q q qq q q qq q qq q q q q q q q q q q q q q qq q q q q q q q q q q q −2 0 1 2 −0.040.00 PAI1,2,3, TATA, score 2 q q q q qq q q q q q q q q qq q q q q q q q q qq q q q q q −2 −1 0 1 2 −0.020.000.020.04 PAI1,2,3, CAP, score 2 q q q qq q q q q qq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q −2 0 1 2 −0.03−0.010.01 PAI1,2,3, TATA, score 3 q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −0.020.000.02 PAI1,2,3, CAP, score 3 q q q q qq q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq −2 0 1 2 −0.020.000.02 PAI1,2,3, TATA, score 4 q q q q q qq q q q q q q q q q qq q q q q q q q qq q q q q −2 −1 0 1 2 −0.020.000.01 PAI1,2,3, CAP, score 4 Figure 4: QQ plots corresponding to the centred Fourier coefficients when projecting onto the first four empirical eigenfunctions for each sample of curves, respectively. The exact distribution of these quantities will not be Gaussian, even if the processes are Gaussian. However, asymptotically, their distribution will be Gaussian. There do not appear systematic deviations, except for the plot corresponding to the third Fourier coefficient in the TATA group, which seems to suggest lighter upper tails as compared to the Gaussian. 5 Table 1: Empirical rejection probabilities on the nominal level 5 %, sample size n1 = n2 = 50, number of replications 5000 for A, 1000 for B–I. Here, uX = (vX, wX) (resp. uY ) and K∗ is the automatic truncation choice given by the penalised fit criterion. K Parameters Test 1 2 3 4 K∗ A uX = (12, 7, 0.5, 9, 5, 0.3) T 0.045 0.049 0.044 0.044 0.047 uY = (12, 7, 0.5, 9, 5, 0.3) T∗ 0.051 0.056 0.057 0.056 0.059 T1 0.045 0.046 0.045 0.047 0.047 T∗ 1 0.051 0.054 0.056 0.061 0.061 B uX = (14, 7, 0.5, 6, 5, 0.3) T 0.422 0.264 0.185 0.150 0.148 uY = (8, 7, 0.5, 6, 5, 0.3) T∗ 0.443 0.315 0.223 0.174 0.175 T1 0.422 0.317 0.265 0.219 0.222 T∗ 1 0.443 0.350 0.306 0.267 0.267 C uX = (15, 10, 0.5, 4, 3, 0.3) T 0.186 0.331 0.218 0.169 0.167 uY = (11, 6, 0.5, 4, 3, 0.3) T∗ 0.201 0.366 0.269 0.207 0.208 T1 0.186 0.380 0.312 0.279 0.273 T∗ 1 0.201 0.420 0.358 0.317 0.314 D uX = (12, 7, 0.5, 9, 3, 0.3) T 0.040 0.204 0.836 0.973 0.962 uY = (12, 7, 0.5, 2, 5, 0.3) T∗ 0.047 0.221 0.848 0.984 0.980 T1 0.040 0.202 0.766 0.803 0.799 T∗ 1 0.047 0.217 0.783 0.822 0.820 E uX = (12, 7, 0.5, 9, 3, 0.3) T 0.047 0.246 0.644 0.964 0.962 uY = (12, 7, 0.5, 3, 9, 0.3) T∗ 0.055 0.267 0.686 0.976 0.975 T1 0.047 0.227 0.477 0.597 0.594 T∗ 1 0.055 0.250 0.509 0.620 0.617 F uX = uY = (12, 7, 4, 0.5, 0.3, 0.1) T 0.257 0.693 0.909 1.000 1.000 δX = (0.15, 0.15, 0.15) T∗ 0.273 0.706 0.916 1.000 1.000 T1 0.257 0.474 0.521 0.567 0.637 T∗ 1 0.273 0.496 0.544 0.594 0.655 G uX = (12, 7, 0.5, 8, 6, 0.3) T 0.042 0.040 0.054 1.000 1.000 uY = (12, 7, 0.5, 8, 0, 0.3) T∗ 0.047 0.048 0.068 1.000 1.000 T1 0.042 0.047 0.051 1.000 1.000 T∗ 1 0.047 0.061 0.062 1.000 1.000 H uX = (12, 7, 0.5, 9, 5, 0.3) T 0.044 0.140 0.500 1.000 1.000 uY = (12, 7, 0.5, 0, 5, 0.3) T∗ 0.049 0.154 0.520 1.000 1.000 T1 0.044 0.139 0.478 0.992 0.992 T∗ 1 0.049 0.155 0.497 0.993 0.993 I Brownian motion versus T 0.719 0.608 0.483 0.377 0.493 Ornstein–Uhlenbeck process T∗ 0.731 0.644 0.532 0.443 0.546 T1 0.719 0.627 0.547 0.476 0.551 T∗ 1 0.731 0.666 0.596 0.542 0.595 6 2 Comparing the Full Spectrum The test procedure developed in the paper employs an optimal finite dimensional reduction in order to regularise the problem of testing. This is motivated by a Parseval decomposition of the Hilbert-Schmidt distance between the two operators, RX − RY 2 HS = K k=1 (RX − RY ) ϕk XY 2 L2 + , where can be made arbitrarily small by appropriate choice of K. By making such a choice, the statistic will be (eventually) able to detect departures from the null hypothesis unless one operator is contained within a ball of small radius centred at the other operator; in this latter case, the test will still be able to detect the difference (eventually), except if this small difference lies completely at the high frequency end of the spectrum (in which case, for all practical purposes, the difference is irrelevant). We are willing to tolerate this small level of “bias”, in order to control the overall type II error of the problem. Comparison of the higher order terms of the operator spectrum on the basis of a finite sample is an ill-defined estimation problem: the fast decay of the spectrum means that we are attempting to compare extremely small quantities that have variance roughly proportional to their magnitude. In addition, the estimators of higher order eigenfunction will be characterised by very large integrated mean squared errors (available bounds grow for fixed N depending inversely on the rate of decay of the spectrum). Therefore, by trying to increase K in order to eliminate the small type II error introduced by the truncation, we are in effect causing an overall blow-up of the type II error. If one nevertheless wishes to compare even the finest differences in the spectrum, then one needs to let K grow to infinity along with N, K = KN and modify the test statistic so as to obtain a Gaussian limit. Regularisation now manifests itself by the imposition of an allowed rate of growth of KN . That is, a rate of growth of K relative to N that does not 7 allow overwhelming instabilities due to the growing K. As one might expect, this growth will depend inversely on the rate of decay of the true eigenvalues (a lot of data is required to compare the finest details of the two procsses). Inevitably, in fact, this rate will be rather slow due to the following: (a) Although the truncation level will grow as KN , the number of terms being compared is K2 N . (b) While these K2 summation terms do become independent as N grows (allowing for a CLT phenomenon) no mixing concept applies. In effect, this means that one has to look at the convergence in distribution to independence of a random vector of increasing dimension (= K2 N ). For any fixed dimension, the weak convergence will be at a rate of N−1/2 . Therefore, if one wishes to use Lp norms in order to use the Hilbert structure of the problem, KN must grow slow enough to allow the N−1/2 rate to compensate for the K2 N rate of increase in dimension. (c) This required “global convergence” to independence is regulated by the convergence of the empirical eigenfunctions to the true ones; this in turn depends on the spacings between the true eigenvalues: the rate of convergence of the Kth empirical eigenfunction behaves like N−1/2 λ−1 K . Therefore, when we let K grow, it has to be at rate slow enough, to allow N−1/2 to annihilate the blow-up of the inverse eigenvalues. The above heuristics are made precise in the proof of the next theorem, which provides a sufficient regularisation rate for asymptotically comparing the whole spectrum of infinite rank processes. Theorem 1. Let {Xn}n1 n=1 and {Yn}n2 n=1 be two collections of zero mean iid continuous Gaussian random functions indexed by the interval [0, 1] and taking values in Rd , possessing covariance operators RX and RY . Suppose that both operators are of infinite rank and have distinct eigenvalues. Let Rn1 X and Rn2 Y denote the empirical covariance operators based on 8 {Xn}n1 n=1 and {Yn}n2 n=1. For N = n1 + n2, let RN XY denote the empirical covariance operator of the pooled collection, and { ˆϕk,N XY }N k=1 the corresponding eigenfunctions. Finally, let ˆλk,n1 X,XY , ˆλk,n2 Y,XY denote the empirical variance of the kth Fourier coefficient of {Xn}n1 n=1 and {Yn}n2 n=1, respectively, with respect to the eigenfunctions { ˆϕn,K XY }N n=1. Assuming that E[ X1 4 L2 ] < ∞, E[ Y1 4 L2 ] < ∞, and n1/N → θ ∈ (0, 1) as N = n1 + n2 → ∞, it follows that, under the hypothesis H0 : RX = RY , SN := n1n2 2N KN (KN + 1)/2 KN i=1 KN j=1 (Rn1 X − Rn2 Y ) ˇϕi,N XY , ˇϕj,N XY 2 − KN (KN + 1) 2 w −→ N(0, 1), as N → ∞, for any KN ↑ ∞ such that K7 N λ −3/2 3KN /2 = o( √ N), where ˇϕk,N XY = ˆϕk,N XY n1 N ˆλk,n1 X,XY + n2 N ˆλk,n2 Y,XY . Proof of Theorem 2. Let {ZNk} denote the triangular array of random variables defined as ZNk := 1 KN (KN + 1)/2 n1n2 N (Rn1 X − Rn2 Y ) ˇϕ i(k),N XY , ˇϕ j(k),N XY 2 − 1 , i(k) = j(k) and ZNk := 1 KN (KN + 1)/2 n1n2 2N (Rn1 X − Rn2 Y ) ˇϕ i(k),N XY , ˇϕ i(k),N XY 2 − 1 , otherwise, where (i(k), j(k)) is the the kth element of the index array {(i, j) : i ≤ j ≤ KN }, when enumerating row-wise. Clearly, for κN = KN (KN + 1)/2, SN = κN k=1 ZNk. 9 Write ZN := (n1n2/N)1/2 (Rn1 X − Rn2 Y ) and define ZNk := n1n2 N (Rn1 X − Rn2 Y )sgn[ ˇϕ i(k),N XY , ˇϕi(k) ] ˇϕ i(k),N XY , sgn[ ˇϕ j(k),N XY , ˇϕj(k) ] ˇϕ j(k),N XY , i(k) = j(k) and ZNk := n1n2 2N (Rn1 X − Rn2 Y )sgn[ ˇϕ i(k),N XY , ˇϕi(k) ] ˇϕ i(k),N XY , sgn[ ˇϕ i(k),N XY , ˇϕi(k) ] ˇϕ i(k),N XY , otherwise, where we use the notation ˇϕk := λ −1 2 k ϕk. The corresponding natural filtration is denoted by FN,k := σ(ZNm; 1 ≤ m ≤ k), and notice that {ZNk} is also adapted to the filtration {FN,k}. Finally, we will write ZNj := (ZN1, . . . , ZNj) (resp. ZNj). We will show that (A) κN k=1 E ZNk1{|ZNk|≤1}|FN,k−1 P −→ 0. (B) κN k=1 Var ZNk1{|ZNk|≤1}|FN,k−1 P −→ 1. (C) κN k=1 P[|ZNk| > |FN,k−1] P −→ 0, ∀ > 0. The conclusion will then follow from an “almost-martingale” central limit theorem for triangular arrays, Shorack (5, Thm. 12.2). Fix some N, let d = κN , and let ζ ∼ Nd(0, I). Letting d∞ denote the Kolmogorov metric, we obtain d∞ ZNd, ζ ≤ d∞ ZNd, n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 + d∞ n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 , ζ First we concentrate on the second term of the right hand side. From the proof of Theorem 1 and P´olya’s theorem we know that this term converges to zero. In fact, recalling that Rn1 X = n−1 1 ni i=1 Xi (resp. Rn2 Y ) and that the ϕk are the eigenfunctions of the common covariance operator, the convergence can be seen to be due to the standard multidimensional 10 central limit theorem. We therefore have the following Berry-Esseen upper bound (e.g. DasGupta (2, Cor. 11.1)), d∞ n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 , ζ ≤ Cd 1 4 √ N . Turning our attention to the first term in our triangle inequality, and letting νi(k) := sgn[ ˇϕ i(k),N XY , ˇϕi(k) ], we note that E ZNd − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 1 = = d k=1 E ZNk − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(k), ˇϕj(k) where, for every 1 ≤ k ≤ d we have ZNk − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(k), ˇϕj(k) = ZN νi(k) ˇϕ i(k),N XY , νj(k) ˇϕ j(k),N XY − ZN ˇϕi(k), ˇϕj(k) = ZN νi(k) ˇϕ i(k),N XY , νj(k) ˇϕ j(k),N XY − ZN νi(k) ˇϕ i(k),N XY , ˇϕj(k) + ZN νi(k) ˇϕ i(k),N XY , ˇϕj(k) − ZN ˇϕi(k), ˇϕj(k) = ZN νi(k) ˇϕ i(k),N XY , νj(k) ˇϕ j(k),N XY − ˇϕj(k) + ZN νi(k) ˇϕ i(k),N XY − ˇϕi(k) , ˇϕj(k) = ZN νi(k) ˇϕ i(k),N XY , νj(k) ˇϕ j(k),N XY − ˇϕj(k) + ZN ˇϕj(k), νi(k) ˇϕ i(k),N XY − ˇϕi(k) ≤ ZN νi(k) ˇϕ i(k),N XY L2 νj(k) ˇϕ j(k),N XY − ˇϕj(k) L2 + ZN ˇϕj(k) L2 νi(k) ˇϕ i(k),N XY − ˇϕi(k) L2 ≤ ZN HS νi(k) ˇϕ i(k),N XY L2 νj(k) ˇϕ j(k),N XY − ˇϕj(k) L2 + ZN HS ˇϕj(k) L2 νi(k) ˇϕ i(k),N XY − ˇϕi(k) L2 = ZN HS νj(k) ˇϕ j(k),N XY − ˇϕj(k) L2 + νi(k) ˇϕ i(k),N XY − ˇϕi(k) L2 Here we have used the Cauchy-Schwartz inequality and the fact that ZN is a bounded 11 operator. By the triangle inequality we now obtain ZN HS νj(k) ˇϕ j(k),N XY − ˇϕj(k) L2 + νi(k) ˇϕ i(k),N XY − ˇϕi(k) L2 ≤ ZN HS νj(k) ˇϕ j(k),N XY − νj(k)λ −1/2 j(k) ˆϕ j(k),N XY L2 + νj(k)λ −1/2 j(k) ˆϕ j(k),N XY − ˇϕj(k) L2 + νi(k) ˇϕ i(k),N XY − νi(k)λ −1/2 i(k) ˆϕ i(k),N XY L2 + νi(k)λ −1/2 i(k) ˆϕ i(k),N XY − ˇϕi(k) L2 = ZN HS (ˆλ −1/2 j(k) − λ −1/2 j(k) ) + λ −1/2 j(k) νj(k) ˆϕ j(k),N XY − ϕj(k) L2 +(ˆλ −1/2 i(k) − λ −1/2 i(k) ) + λ −1/2 i(k) νi(k) ˆϕ i(k),N XY − ϕi(k) L2 where we have used the simplified notation ˆλi(k) = n1 N ˆλ i(k),n1 X,XY + n2 N ˆλ i(k),n2 Y,XY . We now apply the inequality given in Bosq (1, Lem. 4.3) and obtain ZN HS (ˆλ −1/2 j(k) − λ −1/2 j(k) ) + (ˆλ −1/2 i(k) − λ −1/2 i(k) ) + λ −1/2 j(k) νj(k) ˆϕ j(k),N XY − ϕj(k) L2 + λ −1/2 i(k) νi(k) ˆϕ i(k),N XY − ϕi(k) L2 ≤ ZN HS (ˆλ −1/2 j(k) − λ −1/2 j(k) ) + (ˆλ −1/2 i(k) − λ −1/2 i(k) ) λ −1/2 j(k) 2 √ 2 max (λj(k)−1 − λj(k))−1 , (λj(k) − λj(k)+1)−1 RN XY − RX HS + λ −1/2 i(k) 2 √ 2 max (λi(k)−1 − λi(k))−1 , (λi(k) − λi(k)+1)−1 RN XY − RX HS Recapitulating, we have obtained ZNk − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(k), ˇϕj(k) ≤ ZN HS (ˆλ −1/2 j(k) − λ −1/2 j(k) ) + (ˆλ −1/2 i(k) − λ −1/2 i(k) ) λ −1/2 j(k) 2 √ 2 max (λj(k)−1 − λj(k))−1 , (λj(k) − λj(k)+1)−1 RN XY − RX HS 12 + λ −1/2 i(k) 2 √ 2 max (λi(k)−1 − λi(k))−1 , (λi(k) − λi(k)+1)−1 RN XY − RX HS Now we take expectations on both sides, expand the right hand side, and repeatedly apply the Cauchy-Schwartz inequality (with respect to the mean-square norm) to obtain E ZNk − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(k), ˇϕj(k) ≤ E ZN 2 HS E(ˆλ −1/2 j(k) − λ −1/2 j(k) )2 + E ZN 2 HS E(ˆλ −1/2 i(k) − λ −1/2 i(k) )2 +λ −1/2 j(k) 2 √ 2 max (λj(k)−1 − λj(k))−1 , (λj(k) − λj(k)+1)−1 E ZN 2 HS E RN XY − RX 2 HS +λ −1/2 i(k) 2 √ 2 max (λi(k)−1 − λi(k))−1 , (λi(k) − λi(k)+1)−1 E ZN 2 HS E RN XY − RX 2 HS We note first that, by Minkowski’s inequality, E ZN 2 HS is bounded above for all N, by definition of the random operator ZN . Next, E(ˆλ−1 i(k) − λ−1 i(k))2 and E(ˆλ−1 i(k) − λ−1 i(k))2 are, asymptotically in N, of the order of O(λ −1/2 i(k) N−1/2 ) and so are also of the order of O(λ −1/2 i(d) N−1/2 ), when k ≤ d. This can be seen by applying the Delta method to the CLT given in Dauxois et. al (3, Prop. 8). Finally, E RN XY − RX 2 HS is asymptotically of the order of O(N−1/2 ) by the CLT in Hilbert Space (Bosq (1, Thm 2.7)). Now by definition of i(k) and j(k), we have that i(d)[i(d) + 1]/2 = j(d)[j(d) + 1]/2 = d, so that it holds that λi(k) = λ√ 8d+1−1 2 ≥ λ3 √ d 2 . Combining all the above, we arrive at E ZNk − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(k), ˇϕj(k) = O λ −3/2 3 √ d/2 N−1/2 . so that E ZNd − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 1 = O λ −3/2 3 √ d/2 N−1/2 d . 13 Letting dW denote the L1-Wasserstein distance between two probability measures, we have (e.g. Gibbs & Su (4)), d∞(GN,d, HN,d) ≤ (1 + hN,d ∞) dW (GN,d, HN,d) ≤ (1 + hN,d ∞) E ZNd − n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 1 = (1 + hN,d ∞)O λ −3/4 3 √ d/2 N−1/4 d1/2 . where HN,d is the distribution function of n1n2 2N (Rn1 X − Rn2 Y ) ˇϕi(m), ˇϕj(m) d m=1 , GN,d is the distribution function of ZNd, and hN,d is the density function of HN,d. But hN,d is the density of a difference of two independent random vectors, each of which is in turn the sum of n1 and n2 iid random vectors, respectively. Thus, letting h [1] d and h [2] d be the respective densities, and by symmetry, we have, hN,d ∞ = h [1] d,n1 ∗ . . . ∗ h [1] d,n1 n1 times ∗ h [2] d,n2 ∗ . . . ∗ h [2] d,n2 n2 times ∞ ≤ h [1] d,n1 ∗ . . . ∗ h [1] d,n1 n1 times 1 h [2] d,n2 ∗ . . . ∗ h [2] d,n2 n2 times ∞ = h [2] d,n2 ∗ . . . ∗ h [2] d,n2 n2 times ∞ Now it is immediate that h [2] d,n2 ∗ . . . ∗ h [2] d,n2 ∞ ≤ h[2] n2 ∗ . . . ∗ h[2] n2 ∞, where h [2] n2 is the marginal density of n1n2 2N ( 1 n2 X1) ˇϕi(1), ˇϕj(1) . But it must the case that h [2] n2 ∗ . . . ∗ h [2] n2 ∞ be bounded above, since n2 i=1 n1n2 2N ( 1 n2 Xi) ˇϕi(1), ˇϕj(1) is a sequence of variables with diffuse laws converging weakly to a non-degenerate Gaussian. We are thus in a position to conclude that d∞ ZNd, ζ = O λ −3/4 3 √ d/2 N−1/4 d1/2 . (1) 14 Now recall that, with probability one, E ZNk1{|ZNk|≤1}|FN,k−1 = +∞ −∞ 1 √ κN x2 − 1 1{|x2−1|≤ √ 2κN }FeZNk| eZN,k−1 (dx|ZN,k−1) where he have used standard notation for conditional distribution functions. It follows that, given ζ a standard Gaussian random variable, E ZNk1{|ZNk|≤1}|FN,k−1 − E 1 √ κN (ζ2 − 1)1{|ζ2−1|≤ √ κN } = +∞ −∞ 1 √ κN x2 − 1 1{|x2−1|≤ √ 2κN }FeZNk| eZN,k−1 (dx|ZN,k−1) − +∞ −∞ 1 √ κN x2 − 1 1{|x2−1|≤ √ 2κN }Fζ(dx) = +∞ −∞ 1 √ κN x2 − 1 1{|x2−1|≤ √ 2κN } F eZN,k−1 eZNk| eZN,k−1 − Fζ (dx) with the alternative notation F eZN,k−1 eZNk| eZN,k−1 (x) ≡ FeZNk| eZN,k−1 (x|ZN,k−1). From (1) we have that for ζ ∼ Nk(0, I), d∞(ZNk, ζ) = O λ −1/3 3 √ d/2 N−1/4 k1/2 , so by Lemma 1 (see below), given any z ∈ Rk−1 , sup x∈R Fz eZNk| eZN,k−1 (x) − Fζ(x) = O λ −3/4 3 √ d/2 N−1/4 k1/2 and so given z ∈ Rk−1 +∞ −∞ 1 √ κN x2 − 1 1{|x2−1|≤ √ 2κN } Fz eZNk| eZN,k−1 − Fζ (dx) = O λ −3/4 3 √ κN /2N−1/4 k1/2 κ 1/4 N . Consequently, for {ζk} an iid sequence of standard Gaussian variables, and for all ω ∈ Ω, κN k=1 E ZNk1{|ZNk|≤1}|FN,k−1 − E 1 √ κN (ζ2 k − 1)1{|ζk|≤ √ κN } = O   κ 7/4 N N1/4λ 3/4 3 √ κN /2   = O   K 7/2 N N1/4λ 3/4 3 √ κN /2   15 And, since K7 N λ −3/2 3 √ 2KN (KN +1) 2 ≤ K7 N λ −3/2 3KN 2 = o √ N , it follows from our assumptions that the quantity above converges to zero almost certainly. But, on the other hand, κN k=1 E ZNk1{|ZNk|≤1}|FN,k−1 ≤ κN k=1 E ZNk1{|ZNk|≤1}|FN,k−1 − E 1 √ κN (ζ2 k − 1)1{|ζk|≤ √ κN } + κN k=1 E 1 √ κN (ζ2 k − 1)1{|ζk|≤ √ κN } with the last term obviously converging to zero as N → ∞ so that condition (A) is fulfilled. We now turn our attention to condition (B). By definition: κN k=1 Var ZNk1{|ZNk|≤1}|FN,k−1 = κN k=1 E Z2 Nk1{|ZNk|≤1}|FN,k−1 − κN k=1 E2 ZNk1{|ZNk|≤1}|FN,k−1 That the second term converges to zero almost surely follows from our proof of condition (A). Hence, it suffices to concentrate on the first term. Following the same steps as with (A), we may write +∞ −∞ (x2 − 1) 2 2κN 1{|x2−1|≤ √ 2κN } Fz eZNk| eZN,k−1 − Fζ (dx) = O   K 3/2 N N1/4λ 3/4 3 √ κN /2   This in turn imples that, with probability one, κN k=1 E ZNk1{|ZNk|≤1}|FN,k−1 − E 1 √ κN (ζ2 k − 1)1{|ζk|≤ √ κN } N→∞ −→ 0. 16 Finally, we see that κN k=1 E Z2 Nk1{|ZNk|≤1}|FN,k−1 = κN k=1 E Z2 Nk1{|ZNk|≤1}|FN,k−1 − E 1 2κN (ζ2 k − 1)2 1{|ζk|≤ √ κN } + κN k=1 E 1 2κN (ζ2 k − 1)2 1{|ζk|≤ √ κN } with the last term clearly converging to 1 almost certainly. This establishes condition (B). Finally, we concentrate on condition (C). By definition, P[|ZNk| > |FN,k−1] = 1 − E [1 {|ZNk| < } |FN,k−1] = 1 + E 1 |ζ2 − 1| < √ κN − E [1 {|ZNk| < } |FN,k−1] −E 1 |ζ2 − 1| < √ κN = E 1 |ζ2 − 1| < √ κN − E [1 {|ZNk| < } |FN,k−1] + P[|ζ2 − 1| > √ κN ] It is clear from our analysis of (A) and (B) that κN k=1 E 1 |ζ2 − 1| < √ κN − E [1 {|ZNk| < } |FN,k−1] a.s. −→ 0. Finally, we have κN k=1 P[|ζ2 − 1| > √ κN ] = κN P[|ζ2 − 1| > √ κN ] = O  κN e−(1+ √ κN ) 1/2 1 + √ κN 1/4   N→∞ −→ 0 by the tail decay properties of the Gaussian distribution. This completes the proof. Lemma 1. Assume that Fn is a sequence of distribution functions on Rd converging weakly 17 to a standard Gaussian distribution function Φd , at a rate n in the Kolmogorov distance, sup x∈Rd |Fn(x) − Φd (x)| = O( n). Letting d = p + q, and given y ∈ Rq , we have sup x∈Rp |Fn(x|y) − Φq (x)| = O( n). Proof. By definition, and by our uniform bound, given any y ∈ Rq we have that sup x∈Rp |Fn(x|y)Fn(y) − Φp (x)Φq (y)| = sup x∈Rp |Fn(x, y) − Φd (x, y)| = O( n). Now divide across by Φq (y), and obtain sup x∈Rp Fn(x|y) Fn(y) Φq(y) − Φp (x) = O( n) (2) By assumption of the theorem, it must also be that |Fn(y) − Φq (y)| = O( n). In turn, this implies that Fn(y) Φq(y) − 1 = O( n), (3) for if this were not the case, for every α > 0 and M ≥ 1, there would exist and m ≥ M such that Fm(y) Φq(y) − 1 > α Φq(y) | m|, or equivalently, for every α > 0 and M ≥ 1, there would exist and m ≥ M such that |Fm(y) − Φq (y)| > α| m|, 18 which would contradict the fact that supu |Fn(u) − Φq (u)| ∈ O( n). Now conditions (2) and (3) allow us to complete the proof by applying the triangle inequality: d∞ (Fn(·|y), Φp) ≤ d∞ Fn(·|y), Fn(y) Φq(y) Fn(·|y) + d∞ Fn(y) Φq(y) Fn(·|y), Φp since d∞ Fn(·|y), Fn(y) Φq(y) Fn(·|y) = sup x∈Rp Fn(x|y) − Fn(y) Φq(y) Fn(x|y) = 1 − Fn(y) Φq(y) sup x∈Rp |Fn(x|y)| = 1 − Fn(y) Φq(y) = O( n) References [1] Bosq, D.(2000). Linear processes in function spaces. Springer. [2] DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer. [3] Dauxois, J. Pousse, A. & Romain, Y. (1982). Asymptotic theory for the principal component analysis of a random vector function: some applications to statistical inference. Journal of Multivariate Analysis, 12: 136–154. [4] Gibbs, A.L. & Su, F.E. (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3): 419–435. [5] Shorack, G. R. (2000). Probability for Statisticians. Springer. 19 B. Dispersion operators and resistant second-order functional data analysis By David Kraus and Victor M. Panaretos Biometrika, 99(4):813–832, 2012 DOI: 10.1093/biomet/ass037 52 Biometrika (2012), 99, 4, pp. 813–832 doi: 10.1093/biomet/ass037 C 2012 Biometrika Trust Advance Access publication 26 August 2012 Printed in Great Britain Dispersion operators and resistant second-order functional data analysis BY DAVID KRAUS AND VICTOR M. PANARETOS Institute of Mathematics, Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne, Switzerland david.kraus@epfl.ch victor.panaretos@epfl.ch SUMMARY Inferences related to the second-order properties of functional data, as expressed by covariance structure, can become unreliable when the data are non-Gaussian or contain unusual observations. In the functional setting, it is often difficult to identify atypical observations, as their distinguishing characteristics can be manifold but subtle. In this paper, we introduce the notion of a dispersion operator, investigate its use in probing the second-order structure of functional data, and develop a test for comparing the second-order characteristics of two functional samples that is resistant to atypical observations and departures from normality. The proposed test is a regularized M-test based on a spectrally truncated version of the Hilbert–Schmidt norm of a score operator defined via the dispersion operator. We derive the asymptotic distribution of the test statistic, investigate the behaviour of the test in a simulation study and illustrate the method on a structural biology dataset. Some key words: Covariance operator; Karhunen–Lo`eve expansion; M-estimation; Resistant test; Spectral truncation; Two-sample testing. 1. INTRODUCTION The second-order structure of a random function is key to understanding the nature of the functional observations that it induces, as it is inextricably linked with the smoothness properties of the stochastic fluctuations of the function. Given a suitable random function in a separable Hilbert space, e.g., L2[0, 1], these second-order properties are encapsulated in the covariance operator. The link with the smoothness properties of the random function is then given by the Karhunen–Lo`eve expansion (e.g., Adler, 1990), which provides an optimal Fourier representation of the random function, using a basis comprised by the eigenfunctions of this operator. Consequently, a significant part of functional data analysis has concentrated on estimating the covariance operator, and employing its spectral decomposition in order to probe the smoothness properties of the functional data; see Bosq (2000), Dauxois et al. (1982), Hall & Hosseini-Nasab (2006), Ramsay & Silverman (2005), Gervini (2006), Hall et al. (2006) and Yao & Lee (2006), to name but a few. A natural inference problem is that of comparing the covariance structures of two samples of functional data, in order to decide whether they share the same fluctuation properties. Aspects of this problem were considered in Benko et al. (2009), who employed a bootstrap procedure to compare subsets of eigenfunctions or eigenvalues of the two samples in a financial context. The more global problem of testing whether two samples share the same covariance operator was investigated in the Gaussian case by Panaretos et al. (2010), motivated by the study of mechanical properties of DNA, and subsequently by Boente et al. (2011) through atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 814 DAVID KRAUS AND VICTOR M. PANARETOS a simulation-based approach. In a slightly different setting, Gabrys & Kokoszka (2007) and Horv´ath et al. (2010) investigated second-order tests to detect the presence or change of serial correlation in functional data. The goal of this paper is to study the problem of second-order inference in a more general setting. We focus on situations where the data are not Gaussian, and indeed may be characterized by the presence of influential observations. That we do not use the word outlier is deliberate: in the functional case, observations can significantly impact the empirical covariance operator, though they may not be outlying. The infinite-dimensional nature of the data means that an observation can be atypical in many ways, the deviation from the mean being only one; observations close to the mean may contain unusual frequency components. Detection of such observations via exploratory techniques may be nontrivial (Sun & Genton, 2011). Such influential observations might significantly influence the estimation of the covariance, and, even more profoundly, the quality of the estimators of its spectrum. For these reasons, robustified estimates of the spectrum have been proposed, based on the spectra of robust estimators of the covariance operator. Locantore et al. (1999) proposed the use of the spectrum of the socalled spherical covariance operator in a discretized setting (Boente & Fraiman, 1999). Gervini (2008) introduced the functional median and further studied the properties of the spherical covariance spectrum for functional data concentrated on an unknown finite-dimensional hyperplane. Bali et al. (2012) adapted the projection-pursuit method of Li & Chen (1985) in the functional case. The sensitivity of the empirical covariance operator and its spectrum to the presence of influential observations can have an impact on testing procedures for the covariance operator. This is already observed in the finite-dimensional case (Layard, 1974; Olson, 1974), where deviations from a Gaussian assumption, or the presence of influential observations, can completely ruin a testing procedure even in one dimension (Box, 1953; Hampel et al., 1986). Finitedimensional robust or resistant tests for covariance matrices cannot be directly extended to the functional case, as they often depend on the assumption of an invertible empirical covariance, which will by default be violated in the functional case for all sample sizes (Tiku & Balakrishnan, 1985; O’Brien, 1992; Zhang et al., 1991; Anderson, 2006). Even if a pseudo-inverse operator is employed, one immediately runs into the problem of ill-posedness. To cope with these issues, this paper introduces a class of operators that we term dispersion operators that are implicitly defined through a variational problem, motivated by M-estimators of location for the tensor product of the centred functional observations. It is then proposed that these operators be used as proxies for the covariance operator, when inferences on the second-order structure are to be drawn for non-Gaussian and potentially contaminated functional samples. The implicit definition of a dispersion operator gives rise to a score equation, as the dispersion operator is a zero of the Fr´echet derivative of the variational problem with respect to the operator argument. This functional score equation is then used as a basis to construct a test for the second-order comparison of two functional samples. The test is based on the distance of the functional score equation under the null hypothesis from zero, measured by an appropriately renormalized Hilbert–Schmidt distance. 2. SECOND-ORDER INFERENCE BASED ON THE DISPERSION OPERATOR 2·1. Covariance operators To describe the second-order properties of a random element X in a separable Hilbert space of functions H, often taken to be L2[0, 1], with norm · and inner product ·, · , one typically considers the covariance operator of X, C : H → H, defined as C ( f ) = E{ f, X − μ (X − μ)}; atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 815 here μ = E(X) represents the mean of the function X. For example, in the case H ≡ L2[0, 1], with inner product f, g = 1 0 f (t)g(t) dt, the covariance operator is represented as an integral operator C ( f ) = 1 0 r(·, s) f (s) ds, where r(s, t) = E[{X(s) − μ(s)}{X(t) − μ(t)}] stands for the covariance kernel of the process X. For the purposes of this paper, it will be more fruitful to think of the covariance operator as an operator related to tensor products on H, rather than through the sample path perspective based on the covariance kernel. In particular, we will think of the covariance operator as C = E{(X − μ) ⊗ (X − μ)}, where ⊗ stands for the tensor product on H: for f, g ∈ H, f ⊗ g defines an operator on H through ( f ⊗ g)(h) = g, h f , where h ∈ H. In this setting, and provided that E( X 2) < ∞, the covariance operator C can itself be thought of as an element of a Hilbert space, the space HS(H, H) of Hilbert–Schmidt operators acting on H. This is the space of linear operators R on H such that R HS = ∞ k=1 Rek 2 1/2 < ∞, where {ek} is any orthonormal basis of H. Here, · HS defines a norm on HS(H, H), corresponding to the inner product R1, R2 HS = ∞ k=1 R1ek, R2ek . In what follows, we will usually omit the subscript HS, as the nature of the norm or inner product employed, whether it is an operator or an element norm, will be clearly implied from the space where its argument belongs. In this Hilbert–Schmidt setting, the covariance operator can be seen as the operator C ∈ HS(H, H) that solves the variational problem min R∈HS(H,H) E{ (X − μ) ⊗ (X − μ) − R 2 }. The sample counterpart of the covariance operator, the empirical covariance operator, ˆCn = 1 n n i=1 (Xi − ¯X) ⊗ (Xi − ¯X), can be represented as the solution to the problem min R∈HS(H,H) 1 n n i=1 (Xi − ¯X) ⊗ (Xi − ¯X) − R 2 , where X1, . . . , Xn is a collection of independent and identically distributed copies of X, and ¯X = n−1 n i=1 Xi stands for their empirical mean. This being essentially a least squares problem, both the empirical covariance operator and methods based on it will be sensitive to the presence of atypical observations in the dataset X1, . . . , Xn. In fact, it can also be seen that the empirical covariance operator admits a Gaussian maximum likelihood estimator interpretation, in a Cram´er–Wold sense: if X is assumed Gaussian, then ˆCn is the unique element of HS(H, H) atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 816 DAVID KRAUS AND VICTOR M. PANARETOS such that, for every f ∈ H, f, ˆCn f is the unique maximum likelihood estimator of the variance of f, X . The law of X is completely determined by the laws of the collection { f, X : f ∈ H}, and of course f, X is Gaussian with mean f, μ and variance f, C f . The basic strategy of this paper will be to obtain procedures pertaining to the second-order structure of X that are more resistant to departures from normality and to the presence of influential observations by replacing the squared norm in the variational problem defining the covariance by a less sensitive loss function. This gives rise to a new class of second-order characteristics, which we call dispersion operators. 2·2. Dispersion operators Let P be a distribution on the separable Hilbert space H and let X be a random element with this distribution. The usual covariance is the integral of the operator P(x; μ) = (x − μ) ⊗ (x − μ), x ∈ H, with respect to P. This suggests that a dispersion operator could be defined as an M-estimator of the location of P(X; μ). Let ρ be a nonnegative, differentiable, strictly increasing and convex function on R+ 0 with ρ(0) = 0. We define the ρ-dispersion operator of the distribution P as R(P) = arg min R∈HS(H,H) M(P; R, μ), (1) where M(P; R, μ) = EP[ρ{ P(X; μ) − R } − ρ{ P(X; μ) }] = [ρ{ P(x; μ) − R } − ρ{ P(x; μ) }] dP(x). (2) In the definition of the dispersion operator, μ is chosen to be some suitable element of H with the interpretation of a location parameter. It is natural to use μ equal to the ρ-centre μ(P) = arg min μ∈H L(P; μ), where L(P; μ) = EP{ρ( X − μ ) − ρ( X )} = {ρ( x − μ ) − ρ( x )} dP(x). Equivalently, one may define μ(P) and R(P) as solutions to score equations. The objective functionals L(P; μ) and M(P; R, μ) are real-valued functionals defined on the Hilbert spaces H and HS(H, H), respectively. The corresponding scores are their Fr´echet derivatives, that is, linear functionals on the corresponding Hilbert space that can be uniquely identified with an element of that Hilbert space. Specifically, the centre μ(P) is the solution to the functional equation G(P; μ) = 0, where the element G(P; μ) = ∂ ∂μ L(P; μ) = EP ρ ( X − μ ) X − μ (μ − X) = ρ ( x − μ ) x − μ (μ − x) dP(x) atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 817 of H determines the Fr´echet derivative of L with respect to μ. The dispersion operator is defined as the solution to the operator equation G (P; R, μ) = O, (3) where O is the zero operator on H and the operator G (P; R, μ) = ∂ ∂R M(P; R, μ) = EP ρ { P(X; μ) − R } P(X; μ) − R {R − P(X; μ)} = ρ { P(x; μ) − R } P(x; μ) − R {R − P(x; μ)} dP(x) determines the Fr´echet derivative of M with respect to R. The empirical dispersion operator based on the sample X1, . . . , Xn is the dispersion operator of the empirical distribution ˆP of the sample, that is, R( ˆP). The empirical dispersion operator can be in general computed around any element μ ∈ H; in practice, one naturally uses the empirical centre μ( ˆP), i.e., the centre of the empirical distribution. PROPOSITION 1. Let P be a distribution on the separable Hilbert space H that is not concentrated on a line in H or on four points of H. Assume that ρ is nonnegative, strictly increasing on [0, ∞) and convex. Then, the objective function M(P; R, μ) as a functional of R is strictly convex for any μ ∈ H and thus the ρ-dispersion operator around μ exists and is unique. Proposition 1 holds without any moment assumptions because the subtraction of ρ{ P(X; μ) } and ρ( X ) in the definition of M(P; R, μ) and L(P; μ), respectively, guarantees the existence and finiteness of the objective functions. Under fairly weak further assumptions, we may also deduce that the empirical dispersion operator is well defined and consistent. COROLLARY 1. Let X1, . . . , Xn be independent random elements with law P that has no discrete component and is such that the probability that X1, . . . , Xn be collinear is zero (n 3). Then, for n 5, the empirical ρ-dispersion operator corresponding to X1, . . . , Xn exists and is almost surely unique. Moreover, if ˆμ is consistent for a location parameter μ, then the empirical dispersion operator around ˆμ is itself consistent for the dispersion operator around μ. We remark, for example, that the empirical functional median, i.e., the empirical centre corresponding to ρ(u) = u, was proven to be consistent for its theoretical counterpart in Gervini (2008). In fact, in the setting of Corollary 1, this result can be extended to location parameters corresponding to strictly increasing convex ρ-functions. It is seen from (1) or (3) that the ρ-dispersion operator is self-adjoint. Moreover, from the spectral decomposition found in Proposition 2, it will follow that the ρ-dispersion operator is positive semidefinite. Although many results derived in this paper are valid for a wide class of functions ρ, the choice ρ(u) = uq for some q > 0 is especially attractive as the resulting centre is scale invariant and the dispersion is scale equivariant. For general ρ, it would be more appropriate to use a suitably studentized version of the objective functions; to this end, one can insert a preliminary estimator of the trace into the objective function. We now provide explicit formulae for two main choices of the ρ-function. atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 818 DAVID KRAUS AND VICTOR M. PANARETOS When choosing ρ(u) = u2, the score determining the ρ-dispersion operator equals G (P; R, μ) = EP[2{R − P(X; μ)}]. Thus, R(P) can be found explicitly as R(P) = EP{P(X; μ)}. As the score for the ρ-centre is G(P; μ) = EP{2(μ − X)}, the solution is μ(P) = EP(X). Hence, the dispersion operator is the usual covariance operator. The choice ρ(u) = u is expected to place less emphasis on influential observations and result in more resistant procedures. The corresponding score operators for the dispersion and centre are G (P; R, μ) = EP R − P(X; μ) R − P(X; μ) , G(P; μ) = EP μ − X μ − X . The parameter μ(P) has been studied by a number of authors under different names in the multivariate as well as functional settings. In the multivariate context Chaudhuri (1996) calls μ(P) the geometric median; other authors (Serfling, 2004; Sirki¨a et al., 2009) use the name spatial median and some authors (Huber & Ronchetti, 2009; Fritz et al., 2012) use the term L1-centre or L1-median. In the functional setting, μ(P) was studied by Locantore et al. (1999) and by Gervini (2008), who calls it the functional or spatial median. We use the term spatial median for μ(P) and, similarly, we call R(P) the spatial dispersion operator. To clarify the terminology, we recall that S (P) = EP (X − μ) ⊗ (X − μ) X − μ 2 is called the spherical covariance operator (Locantore et al., 1999). Unlike the parameters under the L2-type loss function, the spatial median and spatial dispersion are not available explicitly. Their empirical counterparts ˆμ = μ( ˆP) and ˆR = R( ˆP) can, however, be obtained numerically, employing a Newton–Raphson algorithm, as explained in the Appendix. The score function ρ (u) = quq−1 corresponding to ρ(u) = uq is unbounded unless q = 1. Therefore, the estimator of the spatial dispersion operator, q = 1, is resistant, whereas other choices are nonresistant due to the effect of outliers, q > 1, or inliers, q < 1. Although the dispersion operator is in general different from the covariance operator unless ρ(u) = u2, it carries useful information on second-order properties of the distribution. There is an interesting link between the spectra of the dispersion and covariance operator. Let X admit the Karhunen–Lo`eve expansion X = μ + ∞ k=1 λ 1/2 k βkϕk, where β1, β2, . . . are zeromean unit-variance uncorrelated random variables, {λk : k 1} are the nonincreasing nonnegative eigenvalues and {ϕk : k 1} are the complete orthonormal eigenfunctions of the covariance operator C (P) = EP{(X − μ) ⊗ (X − μ)} = ∞ k=1 λkϕk ⊗ ϕk. We now investigate the eigendecomposition of the theoretical ρ-dispersion operator R(P) defined via M-estimation as the solution to (3). The main result is as follows. PROPOSITION 2. Assume that the Fourier coefficient sequence {βk}∞ k=1 has a joint distribution that is invariant under the change of the sign of any component. Then, the dispersion operator R(P) has the same eigenfunctions as the covariance operator C (P), i.e., there exists a nonnegative sequence {δk}∞ k=1 such that R(P) = ∞ k=1 δkϕk ⊗ ϕk. Furthermore, the eigenvalues δ1, δ2, . . . satisfy the conditions δk = λk E ρ [{ i (δi −λi β2 i )2+ i |=l λi λlβ2 i β2 l }1/2] { i (δi −λi β2 i )2+ i |=l λi λlβ2 i β2 l }1/2 β2 k E ρ [{ i (δi −λi β2 i )2+ i |=l λi λlβ2 i β2 l }1/2] { i (δi −λi β2 i )2+ i |=l λi λlβ2 i β2 l }1/2 (k = 1, 2, . . .). atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 819 A similar result relating the covariance operator and the spherical covariance operator S (P) was obtained by Gervini (2008, Theorem 3) who showed that, under the assumption of exchangeability of the coefficient sequence, both operators have the same eigenfunctions in the same order; see also Marden (1999) and Boente & Fraiman (1999). Our proposition shows that the ρ-dispersion operator also has the same set of eigenfunctions. We conjecture that, potentially under further assumptions, the order of the eigenfunctions is also the same; computational experiments back this conjecture. Gervini (2008) assumed that the Karhunen–Lo`eve expansion has only finitely many terms, i.e., that the distribution is concentrated on a finite-dimensional subspace, whereas our results hold even for processes with infinite series expansions. On the other hand, Gervini (2008) needed no moment assumptions, whereas we need to assume finite second moments: without moment assumptions the convergence of an infinite Karhunen–Lo`eve series is not guaranteed, while a finite sum is always well defined regardless of the properties of the random summands. 2·3. The two-sample test Having defined the notion of a dispersion operator, we now construct a two-sample secondorder test based upon it. Let X1, . . . , Xn1 and Y1, . . . , Yn2 be two independent random samples from distributions P1, P2 on H, whose ρ-centres are μ(P1), μ(P2) and ρ-dispersion operators are R(P1), R(P2). The goal is to test the null hypothesis H0: R(P1) = R(P2) against the general alternative H1: R(P1) |= R(P2). Note that μ(P1), μ(P2) can be equal or different, as neither H0 nor H1 specifies their relation. We propose to employ the general idea of score tests, that is, to base the test on the estimating score for the general model, without assuming H0, evaluated at the null estimate of the parameter. As the centres μ(P1), μ(P2) are not restricted under the null hypothesis, they can be estimated separately by minimizing L( ˆP1; μ1), L( ˆP2; μ2), i.e., by solving G( ˆP1; μ1) = 0, G( ˆP2; μ2) = 0, respectively. Denote μ( ˆPj ) by ˆμj ( j = 1, 2). On the other hand, the null estimator of the dispersion is based on both samples. As we now have two samples, we need to extend our notation to cover situations with two distributions, empirical or theoretical, mixed at proportions a and 1 − a for a ∈ (0, 1). We denote M(P1, P2, a; R1, R2, μ1, μ2) = aM(P1; R1, μ1) + (1 − a)M(P2; R2, μ2). The common null value R of the dispersion operator is estimated by ˆR, which minimizes M( ˆP1, ˆP2, an; R, R, ˆμ1, ˆμ2) where an = n1/n with n = n1 + n2. Equivalently, ˆR solves G ( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2) = O, the null estimating equation, where G (P1, P2, a; R, μ1, μ2) = aG (P1; R, μ1) + (1 − a)G (P2; R, μ2). Using the reparameterization R = (R1 + R2)/2, T = (R1 − R2)/2, we have R1 = R + T , R2 = R − T and we need to test H0: T = O against H1: T |= O. For the test, we need the score in the general model ∂ ∂(R, T )T M( ˆP1, ˆP2, an; R + T , R − T , ˆμ1, ˆμ2) = G ( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2) B( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2) where B(P1, P2, a; R, μ1, μ2) = aG (P1; R, μ1) − (1 − a)G (P2; R, μ2). The score test is based on this general score at the null estimator. When evaluated at (R, T ) = ( ˆR, O), the score is zero in the first component. Thus, the test can be based on the second component B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2). atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 820 DAVID KRAUS AND VICTOR M. PANARETOS When the null hypothesis holds, the score operator B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is expected to be close to the zero operator, otherwise it should be far from the zero operator. To perform the test, we need to measure the distance of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) from the zero operator and assess the significance of the resulting test statistic. One way to measure the distance of the score operator from zero is to use its Hilbert–Schmidt norm. A drawback of this approach is that the resulting statistic does not have a tractable asymptotic distribution. The score operator turns out to be asymptotically Gaussian, but its Hilbert– Schmidt norm is not asymptotically distribution-free. In the context of comparison of covariance operators, Boente et al. (2011) use a simulation procedure to approximate the distribution of the statistic. Another idea is to mimic the standard procedure from settings where the parameter of interest is Euclidean. In such settings, the difference of the score vector from zero is measured with the help of a quadratic form involving the score vector and the inverse of its covariance matrix. The quadratic statistic is usually asymptotically chi-square distributed and the null hypothesis is then rejected when the value of the statistic is significantly large. In the functional context, the score B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is infinite dimensional. Due to the noninvertibility of its covariance operator, one cannot construct a quadratic statistic. We overcome this problem by regularizing the score operator using spectral truncation. The test object B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is an element of the space of operators HS(H, H). Recall that HS(H, H) is a Hilbert space with inner product defined as A1, A2 = ∞ k=1 A1ek, A2ek = ∞ j=1 ∞ k=1 ej , A1ek ej , A2ek , A1, A2 ∈ HS(H, H), where {ek : k = 1, 2, . . . } is an arbitrary complete orthonormal basis of H. For any complete orthonormal basis {Ek : k = 1, 2, . . . } of HS(H, H), an operator A ∈ HS(H, H) and the square of its Hilbert–Schmidt norm can be written as A = ∞ k=1 A , Ek Ek, A 2 = ∞ k=1 A , Ek 2 . Instead of this infinite series, one can use a truncated version. If U ⊂ HS(H, H) is a suitably chosen finite-dimensional linear subspace with an orthonormal basis {U1, . . . , UL}, then instead of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2 one can use πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2 = B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU 2 = L l=1 B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), Ul 2 , where πU is the projection onto the subspace U . That is, the test can be based on a score vector with components Sl = B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), Ul (l = 1, . . . , L). (4) One particular way of choosing the basis elements Ul is to derive them from a basis of the Hilbert space H. If U is a K-dimensional linear subspace of H with an orthonormal basis {u1, . . . , uK }, atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 821 then one may use the L = K(K + 1)/2 orthonormal operators of the form Ujk = u j ⊗ u j ( j = k), (u j ⊗ uk + uk ⊗ u j )/21/2 ( j < k). (5) There is yet another way of motivating the above truncation. Instead of measuring the difference of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) from zero on the entire Hilbert space H, we can measure how it differs from the zero operator when attention is restricted to the linear subspace U. More precisely, instead of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), we use the operator πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU , where πU is the projection operator on U. Its squared Hilbert– Schmidt norm πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU 2 = K j=1 K k=1 u j , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)uk 2 is a truncated version of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2 = ∞ j=1 ∞ k=1 ej , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)ek 2 , where {ej : j = 1, 2, . . . } is any complete orthonormal basis of H. The resulting scores Sjk = u j , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)uk (1 j k K) are equivalent to (4) with Ul of the form (5). It is natural to use the basis operators of the form (5) with u1, . . . , uK being the first K eigenfunctions of the dispersion operator R because, in light of Mercer’s theorem, they carry the main portion of information about the dispersion operator. In practice, the eigenfunctions of R are not known, so one uses the eigenfunctions of the pooled sample estimator ˆR. The number of components K can be selected as the minimal number for the cumulative proportion of dispersion explained by the subspace to exceed a certain threshold, e.g., 80% of the trace of the corresponding pooled sample dispersion operator. The proportion of dispersion, corresponding to the eigenvalues of the dispersion operator, is in general not equivalent to the proportion of variability, corresponding to the eigenvalues of the covariance operator. To construct the test statistic, instead of simply summing squares of the terms Sl of the form (4), one combines them in a quadratic form reflecting their covariance structure. The formal test will be based on the asymptotic distribution of the test statistic. Let n1, n2 be such that n1 → ∞, n2 → ∞ and an = n1/n → a ∈ (0, 1). Assume that G(Pj ; μ) 2, G (Pj ; R, μ) 2 ( j = 1, 2) are finite. Let the function ρ: R+ 0 → R+ 0 be twice differentiable, strictly increasing, and convex with ρ(0) = 0. Assume that the laws P1, P2 satisfy the conditions of Corollary 1 and the expectations EPj {ρ ( X − μ )2}, EPj [ρ { P(X; μ) − R }2], EPj {ρ ( X − μ )}, EPj [ρ { P(X; μ) − R }] and EPj ρ ( X − μ ) X − μ , EPj ρ { P(X; μ) − R } P(X; μ) − R ( j = 1, 2) are finite. Assume that the derivatives D(Pj ; μ), D(Pj ; R, μ), D(Pj ; R, μ) given in (A1)– (A3) in the Appendix exist for j = 1, 2. Let S be a score vector of length L of the form (4) for some linearly independent operators Ul = U (n) l . Let the operators Ul be either nonrandom, independent of n, or convergent in probability to some nonrandom limits, up to a possible sign ambiguity in the sense that there atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 822 DAVID KRAUS AND VICTOR M. PANARETOS exist some operators U ∞ l such that | U (n) l , U ∞ l | converges to 1. In this set-up, we have the following theorem. THEOREM 1. Under the null hypothesis H0 : R(P1) = R(P2), the score n1/2B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) converges weakly to a mean zero Gaussian random operator with covariance operator, which can be consistently estimated by W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) given in (A5) in the Appendix. The asymptotic distribution of the score vector n1/2S is L-variate zero-mean Gaussian with a covariance matrix that is consistently estimated by a matrix W with entries Wj,l = Uj , W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)Ul ( j,l = 1, . . . , L). The test statistic T = nST W−1S asymptotically follows a χ2 distribution with L degrees of freedom. We now deal with the two main cases, spatial and L2-type, explicitly. In the spatial case, ρ(u) = u, we test the null hypothesis that the spatial dispersion operators are equal in both samples. The score operator takes the form B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) = 1 n n1 i=1 ˆR − P(Xi ; ˆμ1) ˆR − P(Xi ; ˆμ1) − 1 n n2 i=1 ˆR − P(Yi ; ˆμ2) ˆR − P(Yi ; ˆμ2) . The Fr´echet derivatives D(P; μ), D(P; R, μ) involved in the covariance operator of the score are D(P; μ) = EP 1 X − μ I − (X − μ) ⊗ (X − μ) X − μ 2 , D(P; R, μ) = EP 1 P(X; μ) − R I − {P(X; μ) − R} ⊗ {P(X; μ) − R} P(X; μ) − R 2 , and the derivative D(P; R, μ) evaluated at f ∈ H is D(P; R, μ) f = EP −Q(X; μ) f P(X; μ) − R + P(X; μ) − R, Q(X; μ) f P(X; μ) − R 3 {P(X; μ) − R} . When the L2 approach, ρ(u) = u2, is employed, the hypothesis to be tested states that the covariance operators in both samples are equal. The null estimator of R takes the form ˆR = an ˆR1 + (1 − an) ˆR2, that is, the pooled covariance estimator. The test score operator equals B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) = an2( ˆR − ˆR1) − (1 − an)2( ˆR − ˆR2) = 4an(1 − an)( ˆR2 − ˆR1), which is a multiple of the difference of the empirical covariance operators. So, the test is equivalent to a Wald-type test proposed by Panaretos et al. (2010). This is different from the spatial test for which the score does not simplify to the difference of the spatial dispersions, so the score test differs from the Wald test. To compute the covariance operator of the test score, we first notice that D(P; R, μ) = −2 EP{Q(X; μ)} equals zero at μ = μ(P) = EP(X); see (A4) in the Appendix. Consequently, the fact that the centres of the two distributions must be estimated does not affect the asymptotic distribution, as could be expected. Also, D(P; R, μ) = 2I. Hence, after straightforward calculations, the estimator of the covariance operator of the test operator is W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) = 4an(1 − an){(1 − an)J( ˆP1; ˆR, ˆμ1) + anJ( ˆP2; ˆR, ˆμ2)} atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 823 = 16an(1 − an) × (1 − an) 1 n1 n1 i=1 {P(Xi ; ˆμ1) − ˆR1} ⊗ {P(Xi ; ˆμ1) − ˆR1} + an 1 n2 n2 i=1 {P(Yi ; ˆμ2) − ˆR2} ⊗ {P(Yi ; ˆμ2) − ˆR2} . In Panaretos et al. (2010), the limiting covariance of the L2 score for the Wald-type test was investigated in the special case of Gaussian data and a simpler formula was found. 3. A SIMULATION STUDY In order to investigate the performance of the testing procedure introduced in § 2·3, we generate random samples of size n1, n2 of curves of the form X(t) = μ1(t) + 10 k=1 λ 1/2 1k a1k21/2 sin{2πk(t + γ1k)} + 10 k=1 ν 1/2 1k b1k21/2 cos{2πk(t + δ1k)}, Y(t) = μ2(t) + 10 k=1 λ 1/2 2k a2k21/2 sin{2πk(t + γ2k)} + 10 k=1 ν 1/2 2k b2k21/2 cos{2πk(t + δ2k)}, where the coefficients ajk, bjk are mutually independent random variables with zero-mean and unit variance. Three symmetric coefficient distributions are considered: normal, uniform and t5, all scaled to have unit variance. As the test procedures are invariant with respect to the location shift of one or both samples, we set μ1(t) = μ2(t) = 0. Unless stated otherwise, we set γjk = δjk = 0 in all situations. We perform the nonresistant L2 test and the proposed spatial dispersion test at the nominal level α = 0·05. The sample sizes are n1 = n2 = 50. The basis of the subspace for dimension reduction consists of several leading eigenfunctions of the pooled sample estimator of the dispersion operator; that is, the pooled sample empirical covariance for the L2 test and the pooled sample empirical spatial dispersion for the spatial test. The number of components K included in the basis is selected as the minimal number needed to explain at least 80% of the dispersion. We first study the behaviour of the test procedures under the null hypothesis. We set λ1k = λ2k = k−3 and ν1k = ν2k = (1/3)k. We begin with uncontaminated samples to verify that the tests maintain the prescribed nominal level. The first row of Table 1 shows that, in general, the asymptotic distribution approximates the distribution of both test statistics reasonably well. The asymptotic approximation for the L2 method is slightly less accurate and tends to be liberal for distributions with light tails, i.e., normal and uniform. Next we simulate datasets contaminated by atypical observations. Mean contamination, i.e., observations whose mean is different from the mean of the central distribution, usually impacts the level more seriously than pure covariance contamination, i.e., observations with the same mean but different covariance structure. Thus, we focus on mean contamination, i.e., outliers, in the study of the resistance of the level. In one or both samples, mj out of nj observations were replaced by observations that have mean function μcont j instead of μj and the same covariance structure as the original distribution. We consider various distances of the contamination distribution from the central distribution and various contamination proportions, as indicated in Tables 1 atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 824 DAVID KRAUS AND VICTOR M. PANARETOS Table 1. Empirical rejection probabilities (%) at the nominal level α = 5% under the null hypothesis. Samples of size n1 = n2 = 50 are contaminated by m1, m2 observations with mean functions μcont 1 , μcont 2 , respectively, and the same covariance structure as the central distribution. Estimates are based on 2000 simulation runs Normal t5 Uniform m1 μcont 1 (t) m2 μcont 2 (t) L2 Spatial L2 Spatial L2 Spatial 0 0 7·1 5·0 5·4 5·3 7·8 4·6 5 1 5 1·5 − 3 sin(πt) 9·2 6·6 8·2 6·4 10·0 4·6 5 1·5 5 1·5 − 3 sin(πt) 14·4 6·4 14·6 6·8 14·6 4·6 5 2·5 5 1·5 − 3 sin(πt) 22·9 6·0 23·0 7·2 23·0 5·1 5 1 5 2 − 4 sin(πt) 11·2 7·2 10·3 7·7 11·7 5·2 5 1·5 5 2 − 4 sin(πt) 18·8 7·2 19·8 7·8 20·0 5·4 5 2·5 5 2 − 4 sin(πt) 30·4 7·2 32·4 8·2 30·8 6·4 5 1 5 2·5 − 5 sin(πt) 14·1 8·2 14·0 8·0 15·0 6·4 5 1·5 5 2·5 − 5 sin(πt) 25·9 8·2 25·4 8·4 27·8 6·5 5 2·5 5 2·5 − 5 sin(πt) 41·8 8·3 46·4 9·0 42·4 7·2 5 1 0 7·4 6·0 6·4 5·4 8·6 5·0 5 1·5 0 12·6 5·9 11·2 5·7 13·4 4·6 5 2·5 0 19·0 6·1 17·8 6·0 17·8 4·7 0 5 1·5 − 3 sin(πt) 9·0 6·0 7·2 6·6 9·8 5·6 0 5 2 − 4 sin(πt) 12·3 6·8 10·8 7·7 13·0 6·6 0 5 2·5 − 5 sin(πt) 16·4 7·6 14·4 8·7 16·8 7·6 Table 2. Empirical rejection probabilities (%) at the nominal level α = 5% under the null hypothesis. Samples of size n1 = n2 = 50 are contaminated by m1, m2 observations with mean functions μcont 1 (t) = 1·5, μcont 2 (t) = 2 − 4 sin(πt), respectively, and the same covariance structure as the central distribution. Estimates are based on 2000 simulation runs m1 = m, m2 = 0 m1 = 0, m2 = m m1 = m2 = m m L2 Spatial L2 Spatial L2 Spatial 0 7·1 5·0 7·1 5·0 7·1 5·0 1 7·0 5·4 6·7 5·1 7·2 5·6 2 6·8 5·0 7·5 5·4 7·8 5·6 3 6·9 5·3 8·7 5·6 8·4 6·2 4 8·4 6·2 10·7 6·2 11·2 6·4 5 12·6 5·9 12·3 6·8 18·8 7·2 6 24·8 6·5 14·8 7·5 39·2 8·1 7 57·8 7·4 17·2 8·6 71·6 10·2 8 89·2 7·9 20·8 9·2 93·0 17·6 9 99·0 11·9 24·7 11·4 99·0 28·2 10 99·8 18·4 28·2 13·6 100·0 42·7 and 2. We consider only atypical observations that are not very far from the central distribution. These are the most insidious because they are often hidden in the main, apparently typical part of the dataset, do not stand out and thus are not easily identified visually, yet they often have a devastating impact on the behaviour of the nonresistant test. To illustrate this, we plot in Fig. 1 typical simulated samples with m1 = 5, μcont 1 (t) = 1·5 and m2 = 5, μcont 2 (t) = 2 − 4 sin(πt). When looking at the plots, one would be unable to identify atypical observations, if they were not highlighted. Visually, many of them do not seem to be very different from most curves, whereas some curves from the central distribution could be considered unusual. atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 825 −4 0 0·2 0·4 0·6 0·8 1·0 0 0·2 0·4 0·6 0·8 1·0 −2 0 2 4 (a) (b) −4 −2 0 2 4 Fig. 1. Simulated contaminated samples. (a) Samples with m1 = 5 atypical observations with μcont 1 (t) = 1·5; (b) Samples with m2 = 5 atypical observations with μcont 2 (t) = 2 − 4 sin(πt). Atypical obsevations plotted in bold. Table 1 shows that the proposed spatial test is much more resistant to contamination than the L2-type test. For instance, notice that for m1 = m2 = 5, i.e., 10% contamination of both samples, the level of the spatial test in all situations considered is only slightly inflated, while the actual level of the L2-type test exceeds 40%. Similarly, if one of the samples contains five atypical observations and the other is not contaminated, i.e., 10% contamination of one sample with 5% contamination overall, the spatial test rejects with probability close to the nominal level, while the level of the L2-type test is as high as 19%. As the magnitude of atypical observations increases, the true level of the L2 test, unlike that of the spatial one, increases dramatically. Comparing the behaviour of the tests across the various coefficient distributions, we observe no important differences. The higher resistance of the spatial method is also documented in Table 2, where the dependence of the level on the amount of contamination is studied for Gaussian data. The spatial procedure can tolerate much more contamination than can the L2-type method. Now we focus on the behaviour of the tests under alternatives. We consider five alternative scenarios. Under all of them, the parameters of the distribution of the first sample are λ1k = k−3 and ν1k = (2/5)k. The parameters of the second sample are as follows. Under scenario I, we have λ2k = 1·6λ1k and ν2k = 1·6ν1k (k = 1, . . . , 10), so the samples differ only in scale, their covariance structure is otherwise the same. Under scenario II, we use λ21 = 1·5, ν21 = 0·8 and λ2k = λ1k and ν2k = ν1k (k = 2, . . . , 10), so the covariance operators differ in the two leading eigenvalues, which however correspond to the same eigenfunctions. Scenario III has λ2k = λ1k (k = 1, . . . , 10) and ν21 = 0·2, ν22 = 0·35 and ν2k = ν1k (k = 3, . . . , 10); here the difference is on the second and third eigenvalues whose corresponding eigenfunctions are the same but in the opposite order. Under scenario IV, we set λ22 = λ13, λ23 = λ12, ν22 = ν13, ν23 = ν12 and λ2k = λ1k, ν2k = ν1k (k /∈ {2, 3}), so the difference occurs further down in the spectrum; eigenfunctions with indices 3, 4, 5, 6 are permuted, the leading two eigen-elements do not differ. Under scenario V, we use λ2k = λ1k, ν2k = ν1k and γ2k = δ2k = 0·15 (k = 1, . . . , 10); in this case, the whole eigenbases are different but the eigenvalues remain the same in both samples. First, we compare the power of the proposed spatial method with the L2-type method for samples without contamination. Table 3 shows that in most cases the power of the spatial test is lower than the power of the L2-type test for distributions with light tails. The lower efficiency of the spatial method is the price we pay for its increased resistance. Both methods have comparable power in the heavy tailed case under most scenarios. Under scenario IV the spatial method outperforms the L2-type method. This is due to the automatic selection of K: for instance in the normal case, for the L2-type test K equals 3 in 91 percent of cases while, for the spatial test, K equals 4 in 96 percent of cases; as the covariance operators differ on the third to sixth eigen-elements, K equal to 4 captures more of the difference. atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 826 DAVID KRAUS AND VICTOR M. PANARETOS Table 3. Empirical rejection probabilities (%) at the nominal level α = 5% under various alternative scenarios for samples of size n1 = n2 = 50 without contamination. Estimates are based on 1000 simulation runs Normal t5 Uniform L2 Spatial L2 Spatial L2 Spatial I 55 40 28 30 93 62 II 53 29 28 22 92 48 III 74 53 36 38 99 85 IV 38 61 24 53 49 73 V 76 58 53 51 96 72 Table 4. Empirical rejection probabilities (%) of the spatial test at the nominal level α = 5% under various alternative scenarios for samples of size n1 = n2 = 50 contaminated by m1, m2 atypical observations. Estimates are based on 1000 simulation runs Contamination m1 m2 I II III IV V configuration 0 0 40 29 53 61 58 A 5 5 12 16 57 64 59 5 0 34 25 54 62 58 0 5 15 16 56 63 61 B 5 5 29 22 36 39 55 5 0 33 28 46 74 55 0 5 40 28 49 34 57 C 5 5 24 18 34 39 52 5 0 32 22 43 50 62 0 5 31 24 43 49 48 Next, we investigate the impact of contamination on the power of the spatial test; we do not study the L2-type test as we have seen before that its level is unreliable for contaminated data. The goal is to study if and how contamination can decrease the power. Similarly to the null scenario, here we also observed that mean contamination usually increases the rejection probability. Therefore, it is more interesting to contaminate data with curves with atypical covariance structure. We experimented with many configurations of atypical observations such that it is difficult to identify them visually and found that often even covariance contamination increases the rejection probability. Nevertheless, we were able to find some configurations for which we observed a decrease of the power in some situations. The central distributions follow the same scenarios I–V as before with normally distributed coefficients. Contamination configurations are as follows. Under configuration A, the contamination distribution has λcont 1k = 1·4λ1k, νcont 1k = 1·4ν1k, λcont 2k = 0·25λ2k and νcont 2k = 0·25ν2k (k = 1, . . . , 10), other parameters of the contamination distribution are the same as for the central distribution. Under configuration B, we set λcont 1k = 0·3λ1k and λcont 2k = 0·3λ2k (k = 1, . . . , 10), νcont 1k = 0·3ν1k and νcont 2k = 0·3ν2k (k = 3, . . . , 10), and νcont 11 = νcont 21 = 1 and νcont 12 = νcont 22 = 0·9, while other parameters remain unchanged. Under configuration C, atypical observations in the first sample follow the central distribution of the second sample and atypical observations in the second sample follow the central distribution of the first sample. The simulation results are presented in Table 4. We report only configurations with some detrimental effect on the power, while many configurations not reported here do not have such an effect. Under configuration A, we can see a decrease of the rejection probability for scenarios I and II. Configuration A was specifically designed to decrease the power under scenario I: atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 827 TATA −0·2 −0·1 0·0 0·1 0·2 −0·2 −0·2 −0·1 0 0·1 0·2 −0·2 −0·1 0 0·1 0·2 −0·1 0·0 0·1 0·2 CAP Principal axis of inertia 2 Principal axis of inertia 2 Principalaxisofinertia3 Principalaxisofinertia3 Fig. 2. Projection of DNA minicircle curves on the first principal plane spanned by the second and third principal axis of inertia. Atypical observations plotted in bold. atypical observations deviate from the central distribution against the direction of the alternative; specifically, both the central and contamination distributions have proportional covariance operators but in the opposite direction. A similar phenomenon is seen for scenario II, where the directions of the alternative and of the contamination distribution are in a similar relationship. On the other hand, we observe no important effect of contamination of type A under scenarios III–V because in these cases atypical observations do not go against the alternative. Under configuration B, the power decreases mainly for scenarios III and IV. Configuration B downweights components other than the first and second cosine component, where it puts higher weight equal for both samples. As these are components carrying an important part of the difference between the covariances, one expects some decrease of the rejection probability, especially under scenarios III and IV. Under configuration C, the two samples are partly mixed, i.e., one sample contaminates the other sample and vice versa. This blurs the difference and somewhat decreases the power under some of the scenarios. 4. AN ILLUSTRATION: DNA MINICIRCLE DATA We illustrate the proposed methods on a dataset consisting of reconstructed three-dimensional electron microscope images of loops called minicircles obtained from short strands of DNA (Amzallag et al., 2006). The dataset contains 99 DNA minicircles of two types, TATA, 65 observations, and CAP, 34 observations, with identical base-pair sequences, except for a short subsequence where they differ. The main question is whether this difference affects the flexibility properties of the DNA minicircles. One way to formalize the flexibility properties is through the fluctuation pattern around the mean minicircle shape. This naturally leads one to consider twosample second-order functional comparisons. DNA minicircles are closed curves in R3. In the original dataset, each curve was randomly rotated and shifted in R3 and had no starting point and no orientation. In Panaretos et al. (2010), an alignment procedure based on the moment of inertia tensor was used as a means of alignment of the curves in a common coordinate system. Figure 2 shows projections of aligned curves on the plane spanned by the two principal axes of inertia. Using inverse weights induced by Gervini’s (2008) spatial median, Panaretos et al. (2010) identified five unusual curves, possible outliers, and removed them from the analysis of the covariance structure. These atypical curves, plotted in thick lines in Fig. 2, are visibly different from the remaining curves. Panaretos et al. (2010) analysed the data without the atypical observations using a test comparing empirical covariance operators under the assumption that atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 828 DAVID KRAUS AND VICTOR M. PANARETOS the curves are Gaussian. Under this assumption, they observed significant differences at the 5% level. These differences were highly significant with a numerically zero p-value, when the comparison was restricted to the eigenvalues of the covariance operators; the corresponding empirical eigenfunctions suggested that the eigenfunction structure of the two operators was very similar. Taking advantage of the results in the present paper, we may run an L2-type test without assuming normality. When doing so, with the atypical observations still removed, the p-value of the L2-type score test of the equality of covariance operators equals 0·023 with the dimension of the subspace on which the test operator is projected equal to K = 6, suggesting persistence of the effect, independently of a Gaussian assumption. Instead of removing apparently atypical observations manually, one might also wish to run an analysis on the complete dataset. However, the performance of L2-type procedures was seen to be highly unstable in the presence of atypical observations, such as the ones in the present dataset, see Tables 1 and 2. By contrast, the spatial dispersion test was seen to maintain a level close to nominal in our simulations, especially in outlier scenarios similar to the one in the minicircle data. There may be further influential observations lurking in the sample. For this reason, we applied the score test based on the spatial dispersion operator, using the full minicircle dataset. In contrast to the other procedures, this yielded the p-value 0·353 indicative of a lack of significant differences in the spatial dispersions. The value of K was selected as the minimal number of components needed to explain 80% of the trace of the underlying null dispersion estimator. No further outliers were detected by the resistant test. The discordance between the L2 and spatial tests is probably due to the reduced efficiency of the resistant procedure when the two samples share common eigenfunctions, as seems to be the case in the minicircle dataset; recall that the dispersion operator shares the same eigenfunctions with the covariance operator, possibly up to order. It was seen in our simulations that, in general, though the level of the spatial test was conserved, in the presence of influential observations its power was appreciably reduced when differences were only in the eigenvalues, i.e., under scenarios I and II in Table 4, as compared to scenarios where differences exist between the eigenfunctions, too, i.e., scenarios III–V in Table 4. Moreover the present framework does not immediately yield a special version of the test that would concentrate only on the eigenvalue structure; the complete structure of the operator is taken into account. ACKNOWLEDGEMENT We thank the editor, associate editor, and two anonymous referees for their extensive, constructive, and in-depth comments and suggestions. This research was supported in part by the European Research Council. SUPPLEMENTARY MATERIAL Supplementary material available at Biometrika online includes proofs of Proposition 1, Corollary 1, Proposition 2, Theorem 1 and a technical lemma needed in the proof of Theorem 1. APPENDIX Computation Assume that the observations Xi ∈ H are represented as linear combinations of some known fixed basis elements ψj , that is, Xi = p j=1 ξi j ψj . This representation is usually obtained by a least squares procedure, possibly with smoothing, from some form of discrete original observations of Xi . The exact form of the original data depends on the particular application. For instance, when H is a functional, L2 , space indexed by one-dimensional time, the original data usually consist of observations Xi (tk) (k = 1, . . . , m) for a grid atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 829 of points t1 < · · · < tm. Now suppose that the original data are observed discretely but exactly, i.e., without noise; later we explain how to handle noisy discrete observations. The methods proposed in this paper have the advantage that all required quantities and operations can be expressed in terms of basis coefficients; thus, from the computational point of view the task is multivariate. To estimate the centre, it is enough to find the vector of coefficients mj in its basis expansion μ = p j=1 mj ψj . Similarly, for the dispersion operator, we need to find the matrix of coefficients Rj j in the expansion R = p j=1 p j =1 Rj j ψj ⊗ ψj . For simplicity, we first assume that the basis ψ1, . . . , ψp is orthonormal. Then, the norm in the objective function for μ is simply the norm of the coefficient vector, i.e., Xi − μ 2 = ξi − m 2 = p j=1(ξi j − mj )2 , and the score operator G( ˆP; μ) is equivalent to the p-vector 1 n n i=1 ρ ( ξi − m ) ξi − m (m − ξi ). The Hilbert–Schmidt norm in the objective function for R is the Frobenius norm of the coefficient matrix, i.e., P(Xi ; μ) − R 2 = (ξi − m)(ξi − m)T − R 2 = p j=1 p j =1 {(ξi j − mj )(ξi j − m j ) − Rj j }2 , and the score operator G ( ˆP; R, μ) is equivalent to the p × p matrix 1 n n i=1 ρ { (ξi − m)(ξi − m)T − R } (ξi − m)(ξi − m)T − R {R − (ξi − m)(ξi − m)T }. For the two-sample test, the operator B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) and the basis elements Ul for dimension reduction are equivalent to matrices, and the score components Sl are computed as their inner products. Similarly, all quantities involved in the covariance matrix of the score vector are computed in a multivariate setting. When the basis ψ1, . . . , ψp is not orthonormal, one simply multiplies each coefficient vector ξi by the matrix A1/2 where A has entries aj j = ψj , ψj , and performs all computations, i.e., estimation of the centre and dispersion, eigen-decomposition and the two-sample test, with these transformed multivariate inputs. This corresponds to switching from the original basis to the orthonormal basis A−1/2 (ψ1, . . . , ψp)T . If needed, the centre and the eigenfunctions can then be obtained in the original basis by multiplying their coefficient vectors by A−1/2 and in the dispersion by multiplying its coefficient matrix by A−1/2 from both sides. We refer to Ramsay & Silverman (2005, § 8.4.2) for a detailed explanation of a similar problem of computing functional principal components from coefficients with respect to a general non-orthonormal basis. To estimate the centre and dispersion one solves the corresponding multivariate optimization problem. If ρ(u) = u2 , the solutions are the sample mean and covariance matrix of the coefficient vectors; otherwise an iterative procedure is used. We use the Broyden–Fletcher–Goldfarb–Shanno quasi-Newton method implemented in the R package (R Development Core Team, 2012) in the function optim, initialized by the componentwise median of ξi for the centre and the componentwise median of (ξi − m)(ξi − m)T for the dispersion. This numerical procedure was reliable and reasonably fast in our experiments. This is in agreement with a detailed study of the numerical performance of various algorithms for the spatial median presented by Fritz et al. (2012). In functional settings one can directly use the functional values on a grid of points instead of computing with basis coefficients. The basis approach is slightly more general than the discretization approach because it can be used for any separable Hilbert space, not only a functional space, and in the functional atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 830 DAVID KRAUS AND VICTOR M. PANARETOS case it does not require a common grid for all functions. Standard software for functional data analysis, such as the fda package in R, uses basis representations of data. In many applications, the original functional values on a grid of points are observed with noise. In such situations, some degree of smoothing is necessary for the reconstruction of the underlying functional data. Ramsay & Silverman (2005, Chapter 5) describe how roughness penalties can be used to compute the basis coefficients of the functions. After this preliminary step, our methods can be applied to the reconstructed curve, i.e., their basis coefficients, as described above. In the case of the spatial median, Gervini (2008, pp. 589–590) proposes an alternative method to deal with noise in discretely observed functions. Rather than on denoising and reconstructing the curves, his procedure is based on removing the bias, which is due to the errors, in the norm in the objective function with the help of a consistent estimate of the variance of the errors. He uses this idea in connection with numerical integration on a grid, but it can be adapted to the basis approach as well. However, this method is less practical for second-order problems, as one would also need to estimate higher order moments of the errors and use convoluted formulae to remove the bias from the norm in the objective functional. Technical material We now derive several key expressions pertaining to the assumptions, statement and discussion of Theorem 1. We use the script font, e.g., D, J , I , for linear operators on H, i.e., linear mappings H → H, the fraktur font, e.g., D, J, I, H, W, for linear operators on Hilbert–Schmidt operators on H, i.e., linear mappings HS(H, H) → HS(H, H), and the blackboard bold font, e.g., D, J, H, Q, for linear operators from H to Hilbert–Schmidt operators on H, i.e., linear mappings H → HS(H, H). First, we introduce certain derivatives in the Fr´echet sense as follows. Denote by I and I the identity operators on H and HS(H, H), respectively. The derivative D(P; μ) = ∂ ∂μ G(P; μ) = EP ρ ( X − μ ) X − μ I + ρ ( X − μ ) X − μ 2 − ρ ( X − μ ) X − μ 3 P(X; μ) (A1) is a linear mapping from H to H. The derivative D(P; R, μ) = ∂ ∂R G (P; R, μ) = EP ρ { P(X; μ) − R } P(X; μ) − R I + ρ { P(X; μ) − R } P(X; μ) − R 2 − ρ { P(X; μ) − R } P(X; μ) − R 3 P(X; R, μ) , (A2) where we denote P(x; R, μ) = {P(x; μ) − R} ⊗ {P(x; μ) − R}, is a linear mapping from HS(H, H) to HS(H, H). We define D(P; R, μ) = ∂ ∂μ G (P; R, μ), (A3) which is a linear mapping from H to HS(H, H). To compute it, we first compute Q(x; μ) = ∂ ∂μ P(x; μ). We consider its value at some f ∈ H, i.e., we investigate the operator Q(x; μ) f ∈ HS(H, H). This is done through its coordinate representation as follows. For any g1, g2 ∈ H, we have g1, {Q(x; μ) f }g2 = g1, ∂ ∂μ P(x; μ) f g2 = ∂ ∂μ g1, P(x; μ)g2 f = ∂ ∂μ ( x − μ, g1 x − μ, g2 ) f = −( x − μ, g2 g1 + x − μ, g1 g2) f = − x − μ, g2 g1, f − x − μ, g1 g2, f . (A4) atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom Resistant functional data analysis 831 Then, the derivative of G (P; R, μ) with respect to μ evaluated at f ∈ H is D(P; R, μ) f = − EP ρ { P(X; μ) − R } P(X; μ) − R Q(X; μ) f − EP ρ { P(X; μ) − R } P(X; μ) − R 2 − ρ { P(X; μ) − R } P(X; μ) − R 3 × P(X; μ) − R, Q(X; μ) f {P(X; μ) − R} . We set D0(P1, P2, a; R, μ1, μ2) = aD(P1; R, μ1) + (1 − a)D(P2; R, μ2), D1(P1, P2, a; R, μ1, μ2) = aD(P1; R, μ1) − (1 − a)D(P2; R, μ2). Next, using the notation f ⊗2 = f ⊗ f for f ∈ H and A ⊗2 = A ⊗ A for A ∈ HS(H, H), we define J (P; μ) = EP ρ ( X − μ ) X − μ (μ − X) − G(P; μ) ⊗2 J(P; R, μ) = EP ρ { P(X; μ) − R } P(X; μ) − R {R − P(X; μ)} − G (P; R, μ) ⊗2 and J(P; R, μ) = EP ρ { P(X; μ) − R } P(X; μ) − R {R − P(X; μ)} − G (P; R, μ) ⊗ ρ ( X − μ ) X − μ (μ − X) − G(P; μ) . Next, we denote H1(P1, P2, a; R, μ1, μ2) = I − D1(P1, P2, a; R, μ1, μ2)D0(P1, P2, a; R, μ1, μ2)−1 , H1(P1, P2, a; R, μ1, μ2) = H1(P1, P2, a; R, μ1, μ2)D(P1; R, μ1)D(P1; μ1)−1 , H2(P1, P2, a; R, μ1, μ2) = I + D1(P1, P2, a; R, μ1, μ2)D0(P1, P2, a; R, μ1, μ2)−1 , H2(P1, P2, a; R, μ1, μ2) = H2(P1, P2, a; R, μ1, μ2)D(P2; R, μ2)D(P2; μ2)−1 , where I stands for the identity operator on HS(H, H). Finally, we set W(P1, P2, a; R, μ1, μ2) = aW1(P1, P2, a; R, μ1, μ2) + (1 − a)W2(P1, P2, a; R, μ1, μ2), (A5) where W1(P1, P2, a; R, μ1, μ2) = H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗ − H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗ − H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)∗ H1(P1, P2, a; R, μ1, μ2)∗ + H1(P1, P2, a; R, μ1, μ2)J (P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗ with ∗ denoting adjoint operators, and W2(P1, P2, a; R, μ1, μ2) is defined analogously with H2, H2 in place of H1, H1, respectively, and P2 instead of P1 in J, J, J . atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 832 DAVID KRAUS AND VICTOR M. PANARETOS REFERENCES ADLER, R. J. (1990). An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 12. Hayward: Institute of Mathematical Statistics. AMZALLAG, A., VAILLANT, C., JACOB, M., UNSER, M., BEDNAR, J., KAHN, J. D., DUBOCHET, J., STASIAK, A. & MADDOCKS, J. H. (2006). 3D reconstruction and comparison of shapes of DNA minicircles observed by cryoelectron microscopy. Nucleic Acids Res. 34, e125. ANDERSON, M. J. (2006). Distance-based tests for homogeneity of multivariate dispersions. Biometrics 62, 245–53. BALI, L., BOENTE, G., TYLER, D. E. & WANG, J.-L. (2012). Robust functional principal components: A projectionpursuit approach. Ann. Statist. 39, 2852–82. BENKO, M., H¨ARDLE, W. & KNEIP, A. (2009). Common functional principal components. Ann. Statist. 37, 1–34. BOENTE, G. & FRAIMAN, R. (1999). Comment on a paper by Locantore et al. Test 8, 28–35. BOENTE, G., RODRIGUEZ, D. & SUED, M. (2011). Testing the equality of covariance operators. In Recent Advances in Functional Data Analysis and Related Topics, Ed. F. Ferraty, pp. 49–53. Heidelberg: Physica-Verlag. BOSQ, D. (2000). Linear Processes in Function Spaces: Theory and Applications. New York: Springer. BOX, G. E. P. (1953). Non-normality and tests on variances. Biometrika 40, 318–35. CHAUDHURI, P. (1996). On a geometric notion of quantiles for multivariate data. J. Am. Statist. Assoc. 91, 862–72. DAUXOIS, J., POUSSE, A. & ROMAIN, Y. (1982). Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. J. Mult. Anal. 12, 136–54. FRITZ, H., FILZMOSER, P. & CROUX, C. (2012). A comparison of algorithms for the multivariate L1-median. Comp. Statist., to appear. doi: 10.1007/s00180-011-0262-4. GABRYS, R. & KOKOSZKA, P. (2007). Portmanteau test of independence for functional observations. J. Am. Statist. Assoc. 102, 1338–48. GERVINI, D. (2006). Free-knot spline smoothing for functional data. J. R. Statist. Soc. B 68, 671–87. GERVINI, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika 95, 587–600. HALL, P. & HOSSEINI-NASAB, M. (2006). On properties of functional principal components analysis. J. R. Statist. Soc. B 68, 109–26. HALL, P., M¨ULLER, H.-G. & WANG, J.-L. (2006). Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist. 34, 1493–517. HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J. & STAHEL, W. A. (1986). Robust Statistics. New York: Wiley. HORV ´ATH, L., HUˇSKOV ´A, M. & KOKOSZKA, P. (2010). Testing the stability of the functional autoregressive process. J. Mult. Anal. 101, 352–67. HUBER, P. J. & RONCHETTI, E. M. (2009). Robust Statistics. Hoboken: Wiley. LAYARD, M. W. J. (1974). A Monte Carlo comparison of tests for equality of convariance matrices. Biometrika 61, 461–5. LI, G. & CHEN, Z. (1985). Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and Monte Carlo. J. Am. Statist. Assoc. 80, 759–66. LOCANTORE, N., MARRON, J. S., SIMPSON, D. G., TRIPOLI, N., ZHANG, J. T. & COHEN, K. L. (1999). Robust principal component analysis for functional data. Test 8, 1–73. MARDEN, J. I. (1999). Some robust estimates of principal components. Statist. Prob. Lett. 43, 349–59. O’BRIEN, P. C. (1992). Robust procedures for testing equality of covariance matrices. Biometrics 48, 819–27. OLSON, C. L. (1974). Comparative robustness of six tests in multivariate analysis of variance. J. Am. Statist. Assoc. 69, 894–08. PANARETOS, V. M., KRAUS, D. & MADDOCKS, J. H. (2010). Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Am. Statist. Assoc. 105, 670–82. R DEVELOPMENT CORE TEAM (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. RAMSAY, J. & SILVERMAN, B. W. (2005). Functional Data Analysis. New York: Springer. SERFLING, R. (2004). Nonparametric multivariate descriptive measures based on spatial quantiles. J. Statist. Plan. Infer. 123, 259–78. SIRKI ¨A, S., TASKINEN, S., OJA, H. & TYLER, D. E. (2009). Tests and estimates of shape based on spatial signs and ranks. J. Nonparam. Statist. 21, 155–76. SUN, Y. & GENTON, M. G. (2011). Functional boxplots. J. Comp. Graph. Statist. 20, 316–334. TIKU, M. L. & BALAKRISHNAN, N. (1985). Testing the equality of variance-covariance matrices the robust way. Commun. Statist. A 14, 3033–51. YAO, F. & LEE, T. C. M. (2006). Penalized spline models for functional principal component analysis. J. R. Statist. Soc. B 68, 3–25. ZHANG, J., PANTULA, S. G. & BOOS, D. D. (1991). Robust methods for testing the pattern of a single covariance matrix. Biometrika 78, 787–95. [Received April 2011. Revised May 2012] atUniversité&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (2012), xx, x, pp. 1–8 C 2007 Biometrika Trust Printed in Great Britain Supplementary file: Dispersion operators and resistant second-order functional data analysis BY DAVID KRAUS AND VICTOR M. PANARETOS Section de Math´ematiques, Ecole Polytechnique F´ed´erale de Lausanne, EPFL Station 8, 1015 Lausanne, Switzerland david.kraus@epfl.ch victor.panaretos@epfl.ch SUMMARY This supplementary file contains proofs of Proposition 1, Corollary 1, Proposition 2, Theorem 1 and a technical lemma needed in the proof of Theorem 1. Equations in this supplement are numbered (S1), (S2), . ..; equation numbers such as (1), (2), . .. or (A1), (A2), . .. refer to the main body of the paper. PROOF OF PROPOSITION 1 It suffices to prove that the finitely-valued objective functional M(P; R, µ) given in equation (2) in the paper admits a unique minimizer on the space of Hilbert–Schmidt operators acting on H. By the triangle inequality, monotonicity and convexity of ρ we have that EP(ρ[ P(X; µ) − {λR + (1 − λ)R } ] − ρ{ P(X; µ) }) ≤ EP[ρ{λ P(X; µ) − R + (1 − λ) P(X; µ) − R } − ρ{ P(X; µ) }] ≤ λ EP[ρ{ P(X; µ) − R } − ρ{ P(X; µ) }] + (1 − λ) EP[ρ{ P(X; µ) − R } − ρ{ P(X; µ) }] for any λ ∈ [0, 1] and arbitrary Hilbert–Schmidt operators R, R . Notice that since ρ is strictly increasing, the first inequality is strict unless P(X; µ) − R and P(X; µ) − R are collinear almost surely. Equivalently, the inequality is strict whenever the distribution of P(X; µ) is not concentrated on the line {tR + (1 − t)R : t ∈ R}. We now investigate what this condition means geometrically in the space H. First, notice that as the rank of P(X; µ) is 1, the rank of tR + (1 − t)R has to be 1 also. Now we distinguish two cases. First, if R, R are collinear, then the line is of the form {αR : α ∈ R}, which by the condition on the rank is {αu ⊗ u : α ∈ R} for some u ∈ H. Since P(X; µ) is positive semidefinite, we in fact have {αu ⊗ u : α ≥ 0}. Thus, the operator P(X; µ) lying on this line is equivalent to X lying on the line {µ + βu : β ∈ R}. Second, if R, R are not collinear, then operators of the form tR + (1 − t)R have rank 1 for at most two values of t. To see this, notice that the rank condition implies that for all i < j, det t Rii Rij Rji Rjj + (1 − t) Rii Rij Rji Rjj = 0, where Rij = ei, Rej , Rij = ei, R ej . This system of quadratic equations has at most two solutions. Thus, the set {tR + (1 − t)R : t ∈ R} reduces at most to the set {α1u1 ⊗ u1, α2u2 ⊗ 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 2 KRAUS, D., PANARETOS, V. M. u2} for some nonnegative α1, α2 and some u1, u2 ∈ H. Hence, the operator P(X; µ) belonging to this set is equivalent to X belonging to the set of at most four points {µ ± β1u1, µ ± β2u2}. Therefore, if the distribution P is not concentrated on a line or on four points, the objective function to be minimized is strictly convex. It follows that the minimum of the functional exists and is unique. PROOF OF COROLLARY 1 The empirical version of the functional defining the dispersion operator is the expectation with respect to the empirical distribution ˆP. Under our assumptions on P, the empirical distribution ˆP is almost surely not concentrated on a line or on four points. Therefore, strict convexity, and thus existence and uniqueness, follows with probability 1 by applying Proposition 1 to the empirical distribution ˆP. Consistency then follows from strict convexity and the consistency of ˆµ, using standard arguments. PROOF OF PROPOSITION 2 Consider R of the form ∞ k=1 δkϕk ⊗ ϕk for some sequence δ1, δ2, . . . We will prove that such an operator solves the estimating equation (5) showing that R and C have the same set of eigenfunctions, and that the sequence δ1, δ2, . . . satisfies the condition (6). We investigate the coordinates of the left-hand side of (5), with the aim of showing that the values ϕj, EP ρ { R − P(X; µ) } R − P(X; µ) {R − P(X; µ)} ϕk (S1) are zero for all j, k. By the orthonormality of ϕ1, ϕ2, . . . , we have that R − P(X; µ) 2 = ∞ k=1 δkϕk ⊗ ϕk − ∞ j=1 ∞ k=1 λ 1/2 j λ 1/2 k βjβkϕj ⊗ ϕk 2 = k (δk − λkβ2 k)2 + k=j λjλkβ2 j β2 k. First, we compute the off-diagonal coordinates with j = k. The first summand in (S1) is zero because ϕj, Rϕk = 0. To show that the second summand in (S1) is zero, we use the fact that, by assumption, the sequence {siβi}∞ i=1 with si = (−1)1{i=j} has the same joint distribution as {βi}∞ i=1. Compute Ajk = ϕj, EP ρ { R − P(X; µ) } R − P(X; µ) P(X; µ) ϕk = E ρ [{ i(δi − λiβ2 i )2 + i=l λiλlβ2 i β2 l }1/2] { i(δi − λiβ2 i )2 + i=l λiλlβ2 i β2 l }1/2 λ 1/2 j λ 1/2 k βjβk = E ρ ([ i{δi − λi(siβi)2}2 + i=l λiλl(siβi)2(slβl)2]1/2) [ i{δi − λi(siβi)2}2 + i=l λiλl(siβi)2(slβl)2]1/2 λ 1/2 j λ 1/2 k sjβjskβk = −Ajk. Thus, Ajk = 0. Therefore, the operator R is diagonalized by the same functions ϕ1, ϕ2, . . . as C . By computing the diagonal coordinates with j = k in (5) we obtain (6). 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 Supplementary file: Dispersion operators and resistant functional data analysis 3 A TECHNICAL LEMMA LEMMA 1. Under the assumptions of Theorem 1, (a) the linear operator D(P; µ) defined in equation (A1) is a bijection of H onto itself, it is bounded and has bounded inverse, (b) the linear operator D(P; R, µ) defined in equation (A2) is a bijection of HS(H, H) onto itself, it is bounded and has bounded inverse. Proof. We prove part (a); the proof of part (b) is similar. The proof uses and extends the steps of the proof of Lemma 1 (iii) of Gervini (2008) modified for the present context of general ρ and generalized to the case of infinitely many components in the Karhunen–Lo`eve expansion. Recall that D(P; µ) = EP ρ ( X − µ ) X − µ I + ρ ( X − µ ) X − µ 2 − ρ ( X − µ ) X − µ 3 P(X; µ) ; see the appendix of the main body of the paper. To show that D(P; µ) is a bijection, we need to find for any h ∈ H a unique element f ∈ H such that D(P; µ)f = h. The set of orthonormal eigenfunctions {ϕk}∞ k=1 of C can be extended to an orthonormal basis of H by possibly adding some functions {ψk}q k=1 with q finite or infinite or zero. It is then enough to verify the relation D(P; µ)f = h in terms of the Fourier coefficients of both sides with respect to the basis {ϕk}∞ k=1 ∪ {ψk}q k=1, i.e., to show that D(P; µ)f, ϕk = h, ϕk for all k = 1, 2, . . . and D(P; µ)f, ψk = h, ψk for all k = 1, . . . , q. As D(P; µ)f, ϕk = f, D(P; µ)ϕk and D(P; µ)f, ψk = f, D(P; µ)ψk , we first investigate D(P; µ)ϕk and D(P; µ)ψk. We begin by exploring the structure of the operator D(P; µ). We can rewrite EP ρ ( X − µ ) X − µ 3 P(X; µ) = EP(˜ε ⊗ ˜ε), where ˜ε = ρ ( X − µ )1/2 X − µ 3/2 (X − µ) = ∞ k=1 λ 1/2 k ρ ( X − µ )1/2 X − µ 3/2 βkϕk = ∞ k=1 ˜λ 1/2 k ˜βkϕk (S2) with ˜λk = λk EP ρ ( X − µ ) X − µ 3 β2 k , ˜βk = ρ ( X − µ )1/2 X − µ 3/2 βk EP ρ ( X − µ ) X − µ 3 β2 k 1/2 . Thus, we need to find the covariance operator of ˜ε. The series expansion (S2) of ˜ε is a Karhunen– Lo`eve expansion because the coefficients ˜βk have zero mean and unit variance and are uncorrelated (which follows from the fact that the distribution of {βk} is invariant under the change of the sign of any component). Therefore, since EP( ˜ε 2) < ∞, which follows immediately from the assumption that EP ρ ( X − µ ) X − µ < ∞, 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 4 KRAUS, D., PANARETOS, V. M. the operator of interest, as the covariance operator of ˜ε, takes the form EP ρ ( X − µ ) X − µ 3 P(X; µ) = ∞ k=1 ˜λkϕk ⊗ ϕk = ∞ k=1 EP ρ ( X − µ ) X − µ 3 λkβ2 k ϕk ⊗ ϕk. Using analogous arguments for ˙ε = ρ ( X − µ )1/2 X − µ (X − µ), we can show that EP ρ ( X − µ ) X − µ 2 P(X; µ) = ∞ k=1 EP ρ ( X − µ ) X − µ 2 λkβ2 k ϕk ⊗ ϕk. Hence, we finally obtain D(P; µ) in the form D(P; µ) = EP ρ ( X − µ ) X − µ I + ∞ k=1 EP ρ ( X − µ ) X − µ 2 − ρ ( X − µ ) X − µ 3 λkβ2 k ϕk ⊗ ϕk. Therefore, for k = 1, 2, . . . we have D(P; µ)ϕk = EP ρ ( X − µ ) X − µ ϕk + EP ρ ( X − µ ) X − µ 2 − ρ ( X − µ ) X − µ 3 λkβ2 k ϕk and, for k = 1, . . . , q, we have D(P; µ)ψk = EP ρ ( X − µ ) X − µ ψk. Thus, we obtain D(P; µ)f, ϕk = νk f, ϕk (k = 1, 2, . . . ), D(P; µ)f, ψk = η f, ψk (k = 1, . . . , q), where νk = EP ρ ( X − µ ) X − µ + λk EP ρ ( X − µ ) X − µ 2 − ρ ( X − µ ) X − µ 3 β2 k (k = 1, 2, . . . ) and η = EP ρ ( X − µ ) X − µ . So f, the candidate for D(P; µ)−1h, should have Fourier coefficients f, ϕk , f, ψk satisfying the system of equations νk f, ϕk = h, ϕk (k = 1, 2, . . . ), η f, ψk = h, ψk (k = 1, . . . , q). To be able to write f, ϕk = h, ϕk /νk, we need to show that νk (k = 1, 2, . . . ) and η are nonzero and finite. Then, f will be uniquely determined by the formula f = ∞ k=1 h, ϕk νk ϕk + q k=1 h, ψk η ψk 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 Supplementary file: Dispersion operators and resistant functional data analysis 5 provided that f is a well-defined element of H, that is, f 2 = ∞ k=1 h, ϕk 2 ν2 k + q k=1 h, ψk 2 η2 < ∞. (S3) We assumed that η < ∞ and we immediately see that η > 0 because ρ is strictly increasing. We now deal with νk (k = 1, 2, . . . ). We will show that there exist 0 < a ≤ b < ∞ such that νk ∈ [a, b] for all k = 1, 2, . . . First we establish the lower bound a. Using the Karhunen–Lo`eve expansion (S2) we can rewrite EP ρ ( X − µ ) X − µ = EP( ˜ε 2 ) = ∞ k=1 ˜λk = ∞ k=1 λk EP ρ ( X − µ ) X − µ 3 β2 k . (S4) Each term in the series on the right hand side of (S4) is obviously positive and by finiteness of the left hand side it is finite, and thus the differences EP ρ ( X − µ ) X − µ − λk EP ρ ( X − µ ) X − µ 3 β2 k , (S5) which appear in the expression for νk, are positive and bounded away from zero by a constant a. The remaining term λk EP ρ ( X − µ ) X − µ 2 β2 k (S6) appearing in νk is nonnegative as ρ ≥ 0 because ρ is convex. It follows that νk ≥ a for all k = 1, 2, . . . Now we find the upper bound b. By applying the same idea as in (S4) to ˙ε, we obtain EP{ρ ( X − µ )} = ∞ k=1 λk EP ρ ( X − µ ) X − µ 2 β2 k . (S7) In view of (S7), the terms (S6) are smaller than or equal to EP{ρ ( X − µ )}. The differences (S5) are smaller than EP ρ ( X − µ ) X − µ . Therefore, we have that νk ≤ b for all k = 1, 2, . . . with b = EP ρ ( X − µ ) X − µ + EP{ρ ( X − µ )}. Finally, it remains to show (S3), which is now straightforward because f 2 = ∞ k=1 h, ϕk 2 ν2 k + q k=1 h, ψk 2 η2 ≤ ∞ k=1 h, ϕk 2 + q k=1 h, ψk 2 min(a, η) = h 2 min(a, η) < ∞. This shows that f is a well defined element of H and thus the linear operator D(P; µ) is a bijection of H onto itself. It also shows that the inverse D(P; µ)−1 is a bounded operator. Hence also the operator D(P; µ) is bounded by the bounded inverse theorem or by direct verification. 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 6 KRAUS, D., PANARETOS, V. M. Remark: As νk are bounded away from zero and bounded from above, the operator D(P; µ) is only a small perturbation of a multiple of the identity. This gives an intuitive explanation why it inherits its bijectivity and boundedness. PROOF OF THEOREM 1 It is enough to prove the weak convergence of n1/2B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2). The weak convergence of the vector with components Sl will then follow directly from Slutsky’s theorem. The continuous mapping theorem and Slutsky’s theorem will then establish the weak convergence of the statistic T. Applying a Taylor expansion (Nelson, 1969, Theorem 6, p. 12) of B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) around the true values of the parameters yields n1/2 B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = n1/2 B(ˆP1, ˆP2, an; R, µ1, µ2) + D1(ˆP1, ˆP2, an; R , µ1, µ2)n1/2 ( ˆR − R) + a1/2 n D(ˆP1; R , µ1)n 1/2 1 (ˆµ1 − µ1) − (1 − an)1/2 D(ˆP2; R , µ2)n 1/2 2 (ˆµ2 − µ2), (S8) where D1(P1, P2, a; R, µ1, µ2) = ∂ ∂R B(P1, P2, a; R, µ1, µ2) = aD(P1; R, µ1) − (1 − a)D(P2; R, µ2) and D(P; R, µ) = ∂ ∂R G (P; R, µ), D(P; R, µ) = ∂ ∂µ G (P; R, µ). See the Appendix in the main body of the paper for explicit formulae. We now turn to develop certain asymptotic representations for ˆµ1, ˆµ2 and ˆR. Using the Taylor expansion, law of large numbers and consistency of ˆµ1 we get 0 = n 1/2 1 G(ˆP1; ˆµ1) = n 1/2 1 G(ˆP1; µ1) + D(ˆP1; µ† 1)n 1/2 1 (ˆµ1 − µ1) = n 1/2 1 G(ˆP1; µ1) + D(P1; µ1)n 1/2 1 (ˆµ1 − µ1) + oP (1), where the term oP (1) is due to the fact that we replace D(ˆP1; µ1) by its limit D(P1; µ1). From this and an analogous expansion for µ2 we obtain n 1/2 1 (ˆµ1 − µ1) = −D(P1; µ1)−1 n 1/2 1 G(ˆP1; µ1) + oP (1), n 1/2 2 (ˆµ2 − µ2) = −D(P2; µ2)−1 n 1/2 2 G(ˆP2; µ2) + oP (1). (S9) The existence of the bounded inverse operators in the above equations, as well as of other inverse operators appearing later in the proof, is shown in Lemma 1. The Taylor expansion of the estimating score for R around the true values is O = n1/2 G (ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = n1/2 G (ˆP1, ˆP2, an; R, µ1, µ2) + D0(ˆP1, ˆP2, an; R‡ , µ‡ 1, µ‡ 2)n1/2 ( ˆR − R) + a1/2 n D(ˆP1; R‡ , µ‡ 1)n 1/2 1 (ˆµ1 − µ1) + (1 − an)1/2 D(ˆP2; R‡ , µ‡ 2)n 1/2 2 (ˆµ2 − µ2), 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 Supplementary file: Dispersion operators and resistant functional data analysis 7 where D0(P1, P2, a; R, µ1, µ2) = aD(P1; R, µ1) + (1 − a)D(P2; R, µ2). This yields n1/2 ( ˆR − R) = −D0(P1, P2, a; R, µ1, µ2)−1 {n1/2 G (ˆP1, ˆP2, an; R, µ1, µ2) + a1/2 n D(P1; R‡ , µ‡ 1)n 1/2 1 (ˆµ1 − µ1) + (1 − an)1/2 D(P2; R‡ , µ‡ 2)n 1/2 2 (ˆµ2 − µ2)} + oP (1); (S10) here again the term oP (1) is present because we replace the empirical distributions by their theoretical counterparts in D0 and D. The different Taylor expansions we have used contain various elements denoted by , †, ‡ which lie on the line segments between the true and estimated corresponding parameters. We will replace all of these elements by the true values of the parameters. Due to the consistency of the estimators, the difference between a quantity at the true value of the parameters and at a value on the line segment between the true value and the estimator converges in probability to zero. Moreover, the quantities involving elements marked with , † or ‡ are always multiplied by a term that is bounded in probability (by its convergence in distribution which will be seen later). Hence, the change we make by replacing the elements marked with , † or ‡ by their true values is asymptotically negligible. The reason for doing this is that we obtain simpler formulas. Denote H1(P1, P2, a; R, µ1, µ2) = I − D1(P1, P2, a; R, µ1, µ2)D0(P1, P2, a; R, µ1, µ2)−1 , H1(P1, P2, a; R, µ1, µ2) = H1(P1, P2, a; R, µ1, µ2)D(P1; R, µ1)D(P1; µ1)−1 , H2(P1, P2, a; R, µ1, µ2) = I + D1(P1, P2, a; R, µ1, µ2)D0(P1, P2, a; R, µ1, µ2)−1 , H2(P1, P2, a; R, µ1, µ2) = H2(P1, P2, a; R, µ1, µ2)D(P2; R, µ2)D(P2; µ2)−1 , where I stands for the identity operator on HS(H, H). Inserting (S9) and (S10) into (S8), we obtain n1/2 B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = a1/2 n H1(P1, P2, a; R, µ1, µ2)n 1/2 1 G (ˆP1; R, µ1) − a1/2 n H1(P1, P2, a; R, µ1, µ2)n 1/2 1 G(ˆP1; µ1) − (1 − an)1/2 H2(P1, P2, a; R, µ1, µ2)n 1/2 2 G (ˆP2; R, µ2) + (1 − an)1/2 H2(P1, P2, a; R, µ1, µ2)n 1/2 2 G(ˆP2; µ2) + oP (1). The term oP (1) is due to the fact that we have replaced the quantities marked with , †, ‡ by their true counterparts. By the central limit theorem for Hilbert spaces (Bosq, 2000, Theorem 2.7), the operators n 1/2 1 G (ˆP1; R, µ1), n 1/2 1 G(ˆP1; µ1) jointly converge in distribution to a zero-mean Gaussian random variable in HS(H, H) × H. The asymptotic covariance operator of n 1/2 1 G (ˆP1; R, µ1), i.e., an operator on operators on H, can be estimated by the empirical covariance J(ˆP1; ˆR, ˆµ1), where J(P; R, µ) = EP ρ { P(X; µ) − R } P(X; µ) − R {R − P(X; µ)} − G (P; R, µ) ⊗2 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 8 KRAUS, D., PANARETOS, V. M. with the notation A ⊗2 = A ⊗ A for A ∈ HS(H, H), the asymptotic covariance operator of n 1/2 1 G(ˆP1; µ1), i.e., an operator on H, can be estimated by J (ˆP1; ˆµ1), where J (P; µ) = EP ρ ( X − µ ) X − µ (µ − X) − G(P; µ) ⊗2 with f⊗2 = f ⊗ f for f ∈ H, and the asymptotic cross-covariance operator of n 1/2 1 G (ˆP1; R, µ1) and n 1/2 1 G(ˆP1; µ1), i.e., an operator from H to operators on H, can be estimated by J(ˆP1; ˆR, ˆµ1), where J(P; R, µ) = EP ρ { P(X; µ) − R } P(X; µ) − R {R − P(X; µ)} − G (P; R, µ) ⊗ ρ ( X − µ ) X − µ (µ − X) − G(P; µ) . Similarly, n 1/2 2 G (ˆP2; R, µ2), n 1/2 2 G(ˆP2; µ2) jointly converge in distribution to a zero-mean Gaussian random element with covariance estimators analogous to those mentioned above for the sample from P1. As the samples are independent, all four random variables jointly converge in distribution. Finally, it follows by Slutsky’s theorem that the test operator n1/2B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) is asymptotically distributed as a zero-mean Gaussian operator whose covariance operator can be consistently estimated by W(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = anW1(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) + (1 − an)W2(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2), where W1(P1, P2, a; R, µ1, µ2) = H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗ − H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗ − H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)∗ H1(P1, P2, a; R, µ1, µ2)∗ + H1(P1, P2, a; R, µ1, µ2)J (P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗ with ∗ denoting adjoint operators, and W2(P1, P2, a; R, µ1, µ2) is defined analogously with H2, H2 in place of H1, H1, respectively, and P2 instead of P1 in J, J, J . REFERENCES BOSQ, D. (2000). Linear Processes in Function Spaces: Theory and Applications. New York: Springer. GERVINI, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika 95, 587–600. NELSON, E. (1969). Topics in Dynamics. I: Flows. Princeton: Princeton University Press. C. Components and completion of partially observed functional data By David Kraus Journal of the Royal Statistical Society. Series B. Statistical Methodology, 77(4):777–801, 2015 DOI: 10.1111/rssb.12087 81 © 2014 Royal Statistical Society 1369–7412/15/77777 J. R. Statist. Soc. B (2015) 77, Part 4, pp. 777–801 Components and completion of partially observed functional data David Kraus University Hospital Lausanne, Switzerland [Received June 2013. Final revision July 2014] Summary. Functional data are traditionally assumed to be observed on the same domain. Motivated by a data set of heart rate temporal profiles, we develop methodology for the analysis of incomplete functional samples where each curve may be observed on a subset of the domain and unobserved elsewhere.We formalize this observation regime and develop the fundamental procedures of functional data analysis for this framework: estimation of parameters (mean and covariance operator) and principal component analysis. Principal scores of a partially observed function cannot be computed directly and we solve this challenging issue by estimating their best predictions as linear functionals of the observed part of the trajectory. Next, we propose a functional completion procedure that recovers the missing part by using the observed part of the curve. We construct prediction intervals for principal scores and bands for missing parts of trajectories. The prediction problems are seen to be ill-posed inverse problems; regularization techniques are used to obtain a stable solution. A simulation study shows the good performance of our methods. We illustrate the methods on the heart rate data and provide practical computational algorithms and theoretical arguments and proofs of all results. Keywords: Functional data analysis; Incomplete observation; Inverse problem; Prediction; Principal component analysis; Regularization 1. Introduction Contemporary data sets often consist of data units that are complex objects, such as functions, curves or images; see, for example, Ramsay and Silverman (2005), Ferraty and Vieu (2006), Ferraty and Romain (2011) and Horv´ath and Kokoszka (2012). It is standard in the field of functional data analysis to assume that all functions are observed on the same domain. In this paper, we develop methods of analysis for functional data that are observed incompletely in the sense that each function might be observed only on a subset of the domain, whereas no information about the curve is available on the complement of this subset. Our work is motivated by an ambulatory blood pressure monitoring data set that is part of the ‘Swiss kidney project on genes in hypertension’ (Pruijm et al., 2013) which is a multicentre cross-sectional study focusing on the role of kidney function and genes in blood pressure regulation and hypertension. In ambulatory blood pressure monitoring, participants wear a calibrated automatic device that is programmed to record systolic and diastolic blood pressure and heart rate at frequent intervals during 24 h (every 15 min during the day and every 30 min during the night). Ideally, this design should provide enough information for each continuous temporal profile to be reconstructed by standard smoothing techniques; the resulting sample of curves would then be analysed by traditional methods of functional data analysis. In reality, Address for correspondence: David Kraus, Institute of Social and Preventive Medicine, University Hospital Lausanne, Route de la Corniche 10, Lausanne 1010, Switzerland. E-mail: kraus.stat@gmail.com 778 D. Kraus 20 21 22 23 24 25 26 406080100 Time (a) (b) 20 21 22 23 24 25 26 406080100 Time Fig. 1. (a) Subset of the sample of heart rate profiles and (b) several curves in detail however, some values have not been measured and the time points corresponding to unobserved values form series (intervals) of non-negligible length. There are two main reasons why no measurements are available for certain periods: first is the participant’s discomfort (the participants can remove the device when they feel uncomfortable) and second is the failure of the device to take measurements. However, there are series of frequent, properly recorded measurements. It is therefore possible to reconstruct the underlying profiles in continuous time on these periods. Fig. 1(a) displays a subset of 685 heart rate profiles (values in beats per minute); we focus on the time interval [20, 26] (i.e. from 8 p.m. of one day to 2 a.m. next day) that is of particular medical interest because it is the transition period between the day and night regime. In Fig. 1(b), we plot separately four profiles to illustrate the type of available data: whereas some curves (dotted and chain curves) are observed completely (on the entire domain [20, 26]), other curves (the two broken curves) have unobserved periods. The percentage of incomplete functions is 31% for blood pressure profiles and 44% for heart rate profiles. This is a considerable fraction of the data, and we therefore wish to avoid removing the incomplete curves from the analysis. The partial observation regime that we encounter in this data set is of general interest in applications as often, despite the failure to observe the curves in some regions, there is enough observed information in the rest of the domain. The mechanism that causes the absence of data can be random, like in our data, but the curves may also be partially observed by design. Moreover, data need not necessarily be curves indexed by time; methods that we develop can be extended to more general object data subject to incomplete observation, such as partially observed images, spatial curves or surfaces. Hence this kind of functional data is worth systematic investigation. Interestingly and surprisingly, this observation pattern, however natural and likely to occur in many applications it is, has received relatively little attention in the literature. James et al. (2000) and James and Hastie (2001) used parametric mixed effects models for principal components analysis and classification of partially observed curves. Bugni (2012) developed a goodness-of-fit test under circumstances that were similar to those of our paper. Delaigle and Hall (2013) dealt with classification of functional data when only fragments of curves are available. Liebl (2013) studied low rank extensions of curves observed on subdomains. Goldberg et al. (2014) propose a prediction procedure for the continuation of a low rank functional observation. In this paper we introduce a formal framework for analysing incompletely observed functional data and develop basic non-parametric, fully functional (infinite dimensional) inferential Partially Observed Functional Data 779 procedures. When exploring functional data, one often finds interesting information in their covariance structure; see Ramsay and Silverman (2005) for some examples and, for example, Benko et al. (2009), Sangalli et al. (2009) or Panaretos et al. (2010) for other illustrations. Therefore, we first focus on the main building blocks of the analysis of the second-order properties: estimation of the covariance operator and principal component analysis. We propose an estimator of the covariance operator and its eigenvalues and eigenfunctions for partially observed functions and derive their properties. We deal with the estimation of projections (principal scores) of individual incomplete functions which is especially challenging. We develop a procedure that enables us to predict the value of a principal score of a function when only a fragment of the function is available and direct computation is thus impossible. Next, we propose a method that can recover the unobserved part of the function from the observed part, using the information about the distribution of the data that it learns from the sample. We develop automatic procedures for the selection of the tuning parameter of the method that is based on generalized cross-validation for incompletely observed functions. We quantify the uncertainty of the predictions of unobserved quantities and provide approximate prediction regions (intervals and bands) covering the unobserved random quantity with high probability. Simulations confirm the usefulness and good performance of the methodology proposed. Both the prediction of principal scores and the reconstruction of an incomplete function or its derivatives are important problems. Principal scores are key elements in the exploration of complex data and can be used as input quantities in many inferential procedures. Their usefulness in the multivariate setting is well described, for example, in Krzanowski (2000) and Jolliffe (2002). In the functional context Ramsay and Silverman (2005) provided some real data examples illustrating how principal scores help to understand the properties of the data. Further applications can be found in Ramsay and Silverman (2002) and Ramsay et al. (2009). Horv´ath and Kokoszka (2012) have given a comprehensive account of the utility of principal scores in procedures like two-sample tests, linear and non-linear regression, clustering and classification, time series analysis or change point analysis. In this paper, we shall see in Section 6 that the first three principal components of the heart rate profiles and their derivatives explain a large proportion of the total variability and are sufficiently flexible to describe interesting features of the curves. Hence the corresponding scores provide an effectively reduced representation of the complex individual heart rate profiles. To perform graphical or formal analyses of the scores, we need to be able to compute them, which is not straightforward in the partial observation regime. Also, when an individual curve, surface or image is observed incompletely, one is interested in visualizing and studying the shape of the missing part, for instance to forecast the continuation of the natural or social process that is described by the functional variable. Our paper provides solutions to these problems by developing methods that predict unobserved quantities via their conditional expectation given the observed data. In addition to their direct application to data, these methods will be an important tool in future research: for instance, advanced techniques of missing data analysis in the multivariate setting involve conditional expectations in some form, and our results will be helpful in extending them to the functional case. To our knowledge, no results of the kind that we provide here exist for functional data that are fully (densely in practice) observed on subsets of the domain. A related but different (in terms of applicability, used methods and achievable results) type of imperfectly observed functional data was studied by Yao et al. (2005a) who considered sparsely observed functions, i.e. situations where only a few observed values are available for each function, making it impossible to reconstruct each curve from these values. Our approach is novel in that it enables us, under the assumed observation regime, to investigate some genuinely functional aspects of the data. From the theoretical point of view, exploiting the continuous time nature of the observed data, we can 780 D. Kraus obtain stronger results than in the sparse regime. For example, the rates of convergence of estimators of parameters (the covariance operator and eigenelements) are parametric, unlike with sparsely observed data (see also Hall et al. (2006)). Also, the consistency result for our functional completion procedure is fully functional, whereas the restrictions of the sparse regime enabled Yao et al. (2005a) to achieve pointwise or finite dimensional convergence of the reconstructed trajectory. From the practical perspective, an important advantage of our method is that derivatives can be readily analysed in our setting whereas with methods for sparsely observed functions it is complicated. The method of Liu and M¨uller (2009) is a variant of that of Yao et al. (2005a) that can deal with derivatives in the sparse regime to some extent. Although the method of Liu and M¨uller (2009) can reconstruct derivatives, it does not provide insight into their covariance structure because it neither estimates the covariance operator of the derivatives nor performs principalcomponentanalysisofthederivatives(itisbasedonderivativesofeigenfunctionsrather than on eigenfunctions of derivatives). Since derivatives describe the dynamics of the underlying real world process, the analysis of derivatives, and especially of the principal sources of their variability, is often revealing in many applications, including the one we consider in this paper. Mathematically, the problem that we need to solve for the computation of unobserved quantities (prediction of principal scores or reconstruction of missing parts of trajectories) is seen to be an ill-posed inverse problem (e.g. Groetsch (1993)), and regularization techniques need to be applied. Such problems previously appeared in the literature on complete functional data mainly in the area of functional regression modelling; see, for example, Cardot et al. (1999, 2007), M¨uller and Stadtm¨uller (2005), Cai and Hall (2006), Hall and Horowitz (2007) or He et al. (2010). Inverse problems similar to those which we encounter here also arise in connection with functional canonical correlations (e.g. He et al. (2003)) or with tests of hypotheses on parameters of functional data (e.g. Mas (2007), Horv´ath et al. (2010, 2013), Aston and Kirch (2012), Kraus and Panaretos (2012) and Jaruˇskov´a (2013)). Our problem is related to the task of prediction that was previously studied in the literature on functional time series; see, for example, Bosq (2000), Antoniadis and Sapatinas (2003) or Kargin and Onatski (2008). None of these references, however, assumes the partial observation pattern that we consider in this paper. The paper is organized as follows. In Section 2 we formalize the mechanism of partial observation of functional data and deal with the estimation of the mean function and covariance operator. Section 3 develops principal component analysis for incompletely observed functions. In Section 4, a method is proposed to reconstruct the missing part of a partially observed curve. Sections 5 and 6 present a simulation study and a data example. Appendix A contains proofs of the main theoretical results (theorems 1 and 2). A supplementary document available on line contains proofs of propositions 1–4 and a detailed description of computational procedures. The programs that were used to analyse the data and some example data can be obtained from http://wileyonlinelibrary.com/journal/rss-datasets 2. Partially observed functional data Functional data X1,:::, Xn are seen as independent identically distributed random variables in the separable Hilbert space of square integrable functions on a bounded domain. Without loss of generality, we consider the space L2.[0,1]/ with inner product f,g = 1 0 f.t/g.t/dt, f, g ∈L2.[0, 1]/ and norm f = f,f 1=2 . It is possible to extend our results to vector-valued functions or more general domains for applications with spatial curves, surfaces, images etc. Partially Observed Functional Data 781 Intraditionalfunctionaldataanalysis,itisassumedthatthefunctionsX1,:::,Xn areobserved on the whole interval [0, 1]. We consider situations where each curve Xi is observed only on a subset of [0,1]. Specifically, let the observation periods be Oi ⊂ [0,1], i = 1,:::,n. Then the observed data for the ith curve are Xi.t/, t ∈Oi. (In practice, the raw data are most often in the form of possibly noisy observations on a dense grid of points in Oi, which enables us to assume that the curves are observed fully in Oi, as is explained by Hall et al. (2006).) We collectively denote the observed part of the curve as XiOi , which can be seen as a random element of the space L2.Oi/. The values of Xi on the complement of Oi, Mi = [0,1] \ Oi, are not observed; the missing part of the trajectory is denoted as XiMi . The observation periods Oi, i=1,:::,n, are modelled as random subsets of [0,1]. We assume that each realization of Oi is the union of a finite number of intervals. This assumption is not restrictive for practical applications, although some generalizations are probably possible. We assume that the observation periods are independent of the functions X1,:::,Xn, i.e. the data are missing completely at random. (Under this assumption, the observation periods can also be seen as fixed when inference is made about the curves.) The main characteristics of the distribution that generates the data are the mean function and the covariance operator. Let the mean function be μ = E.X1/. The covariance operator R :L2.[0, 1]/→L2.[0, 1]/ is defined as Rf =E{ f,X1 −μ .X1 −μ/}= 1 0 ρ.·,t/f.t/dt, where ρ.s,t/=cov{X1.s/,X1.t/} is the covariance kernel of the stochastic process X1. Like in the multivariate case, the mean function μ at point t ∈ [0,1] can be estimated by the sample mean of observed values at this point. Formally, the estimator can be written as ˆμ.t/= J.t/ n i=1 Oi.t/ n i=1 Oi.t/Xi.t/, where the notation Oi.t/ is used for the indicator 1Oi .t/ and J.t/ = 1[Σn i=1Oi.t/>0]. The values of Xi.t/ are available only if Oi.t/ = 1; otherwise, the contribution Oi.t/Xi.t/ in the sum above is zero. The term J.t/ is included to avoid division by 0: if J.t/ = 0, the estimate of the mean is 0 (or arbitrary, as such situations vanish asymptotically). The estimator ˆR of the covariance operator R is defined through an estimator of its covariance kernel ρ. We estimate ρ.s, t/ by the sample covariance computed from all complete pairs of functional values at s and t. The estimator equals ˆρ.s,t/= I.s,t/ n i=1 Ui.s, t/ n i=1 Ui.s,t/{Xi.s/− ˆμst.s/}{Xi.t/− ˆμst.t/}, .1/ where Ui.s,t/=Oi.s/Oi.t/ and I.s,t/=1[Σn i=1Ui.s,t/>0]. The estimator of the mean function used here is ˆμst.s/= I.s,t/ n i=1 Ui.s,t/ n i=1 Ui.s,t/Xi.s/, i.e., for the computation of the covariance at s,t, functional values are centred at the sample mean computed from complete pairs. (It is also possible to centre by the estimator ˆμ that was introduced before; all results remain valid when ˆμ is used in place of ˆμst.) 782 D. Kraus The sample covariance operator computed from incomplete functions may be indefinite. This is similar to the multivariate setting. However, unlike with multivariate data, our experience in the functional context is that this problem is unimportant in practice because negative eigenvalues occur far in the tail of the spectrum and are small in comparison with the leading eigenvalues. The corresponding high frequency features of the data are practically never of interest. If needed, the estimate ˆR can be modified by setting negative eigenvalues equal to 0. It is seen that ˆμ.t/ is an unbiased estimator of μ.t/. Similarly, if we subtract 1 in the denominator of ˆρ.s,t/, the estimator becomes unbiased for ρ.s,t/. For the estimators ˆμ and ˆR to be consistent, we need to assume that the observation pattern asymptotically provides enough information. For the mean function, the right assumption is that there exists δ1 >0 such that sup t∈[0,1] P n−1 n i=1 Oi.t/ δ1 =O.n−2 / as n→∞: .2/ Similarly, for the covariance operator, we need the stronger assumption that there exists δ2 >0 such that sup .s,t/∈[0,1]2 P n−1 n i=1 Ui.s,t/ δ2 =O.n−2 / as n→∞: .3/ Assumption (2) is satisfied, for example, when the observation sets O1,:::,On are independent and identically distributed and π0 =inft∈[0,1]P{O1.t/=1}>0. To see this, set δ1 =π0=2 and use Hoeffding’s inequality to show that sup t∈[0,1] P n−1 n i=1 Oi.t/ δ1 exp.−π2 0n=2/: Analogously, assumption (3) is satisfied when we further assume that inf.s,t/∈[0,1]2 P{U1.s,t/= 1}>0. Under these weak assumptions, we obtain a consistency result as follows. Proposition 1. (a) LetE. X1 2/<∞andassumption(2)besatisfied.ThenE. ˆμ−μ 2/=O.n−1/forn→∞. (b) Let E. X1 4/ < ∞ and assumption (3) be satisfied. Then E. ˆR − R 2 2/ = O.n−1/ for n→∞ (here · 2 denotes the Hilbert–Schmidt norm). Note that the properties of the estimators are unaffected by the fact that the functions are observedonlypartially.Thefull(dense)observationregime,albeitonlyonsubsetsofthedomain, preserves the convergence rates that are known for complete functional data (see Bosq (2000) or Horv´ath and Kokoszka (2012) for results in the traditional setting). 3. Principal component analysis 3.1. Estimation of eigenfunctions and eigenvalues Probably the most fundamental method for functional data is functional principal component analysis. It provides insight into the complex covariance structure of functional data and is used to identify main sources of variability and to quantify their importance and to reduce the dimension of the data. The theoretical foundation of functional principal component analysis is the Karhunen– Lo`eve theorem (e.g. Bosq (2000), theorem 1.5) stating that there are random variables βij and non-random functions ϕj such that the stochastic process Xi admits the decomposition Partially Observed Functional Data 783 Xi.t/=μ.t/+ ∞ j=1 βij ϕj.t/, t ∈[0,1], where the series converges in mean square, uniformly in t. Here ϕj,j = 1,2,:::, are the orthonormal eigenfunctions of the operator R and βij, j = 1,2,:::, are uncorrelated mean 0 variables with variances λj, where λ1 λ2 :::>0 are the eigenvalues of R. Functional principal component analysis is the empirical version of the Karhunen–Lo`eve expansion that aims to estimate the elements involved in the expansion. For background information on this classical topic, we refer to Ramsay and Silverman (2005), chapter 8, for an introduction from an applied perspective, and to Dauxois et al. (1982), Bosq (2000) or Hall and Hosseini-Nasab (2006) for theoretical studies. In the case of completely observed functional data, to estimate the eigenvalues λj and eigenfunctions ϕj, one performs eigendecomposition of the usual sample covariance operator. When the functions are observed partially, we can proceed similarly and define the estimators ˆλj and ˆϕj as the eigenvalues and eigenfunctions of the operator ˆR given by the kernel ˆρ in equation (1). It turns out that the asymptotic properties of the empirical eigenvalues and eigenfunctions remain unchanged by the incompleteness of the observed functions. The following proposition shows that, first, the empirical eigenvalues are consistent estimators of the true eigenvalues and this consistency is uniform over all indices and, second, the empirical eigenfunctions are consistent estimators of the true eigenfunctions, up to the usual sign ambiguity. Proposition 2. Let E. X1 4/ < ∞ and assumption (3) be satisfied. Then E[supj∈N{| ˆλj − λj|2}] = O.n−1/. If moreover all eigenvalues of R have multiplicity 1, then E. ˆϕj − ˆsjϕj 2/ =O.n−1/ forall j ∈N, where ˆsj = sgn ˆϕj,ϕj . The rates of convergence are parametric because of the full observation regime on subsets; the situation is different from that of sparsely observed functions, where the estimators of eigenelements (constructed differently) converge at non-parametric rates (Yao et al., 2005a; Hall et al., 2006). 3.2. Estimation of principal component scores In principal component analysis, one is usually interested not only in estimating the eigenfunctions and eigenvalues but also in the estimation of the principal component scores βij = Xi −μ,ϕj , i=1,:::,n, j =1,2,:::, representing the individual co-ordinates of each curve with respect to the eigenbasis (the expression of the feature ϕj for the ith observation). The leading principal scores provide the optimal finite dimensional representation of each curve and can be further analysed by traditional techniques. In the standard situation of complete functional data, the scores are easily estimated by ˆβij = Xi − ˆμ, ˆϕj . When the functional observations are incomplete, the direct computation of Xi − ˆμ, ˆϕj is impossible because the last term in the expression Xi − ˆμ, ˆϕj = XiOi − ˆμOi , ˆϕjOi + XiMi − ˆμMi , ˆϕjMi is not available. In this equation the subscript Oi or Mi denotes the restriction of the corresponding function to the ith observed or missing period respectively. We develop a procedure to estimate the missing quantity XiMi − ˆμMi , ˆϕjMi from the observed data. 784 D. Kraus First, we consider the population version of the problem. Let the function X with mean 0 and covariance operator R be observed on the set O and missing on M. For the following considerations, the sets O and M, which are independent of X, can be regarded as non-random; equivalently, derivations can be made conditionally on them. The goal is to predict βjM = XM, ϕjM from the observed part XO. It is a standard fact that, in terms of the meansquared prediction error, the best approximation of βjM by a functional of XO is the conditional expectation E.βjM|XO/. The conditional expectation may be a non-linear functional of the condition and thus difficult to estimate. Therefore, we propose to look for the best linear prediction corresponding to a continuous linear functional of the observed curve. This is equivalent to the best linear approximation of the conditional expectation. By the Riesz representation theorem, a continuous linear functional takes the form aj,XO , where aj is an element of L2.O/. The best continuous linear prediction of βjM equals ˜βjM = ˜aj,XO , where ˜aj solves the infinite dimensional optimization problem min aj ∈L2.O/ E{.βjM − aj,XO /2 }: .4/ The objective functional can be rewritten as E{.βjM − aj,XO /2 }=E{ ϕjM,XM 2 −2 ϕjM,XM aj,XO + aj,XO 2 } = ϕjM, RMMϕjM −2 ϕjM, RMOaj + aj, ROOaj , where ROO is the covariance operator of XO and RMO is the cross-covariance operator of XM and XO. It is obvious that the objective functional is convex. If a minimizer exists, it can be found by setting the derivative equal to 0. The derivatives in this context are in the Fr´echet sense. In particular, we see that @ @aj E{.βjM − aj,XO /2 }=−2rj +2ROOaj, where rj = ROMϕjM with ROM = RÅ MO (the asterisk denotes the adjoint operator). Thus we need to solve the equation ROOaj =rj: .5/ We recognize that this is a linear inverse problem where we need to recover the function aj ∈ L2.O/ from its image through the linear operator ROO. Let λOOk, k = 1,2,:::, be the decreasing positive eigenvalues and ϕOOk the corresponding orthonormal eigenfunctions of the operator ROO. By comparing the coefficients of the leftand right-hand side of equation (5) with respect to the basis ϕOOk, we arrive at the system of equations λOOk aj,ϕOOk = rj,ϕOOk ,k =1,2,:::. This suggests that a candidate for the solution is ˜aj = ∞ k=1 rj,ϕOOk λOOk ϕOOk, .6/ i.e. ˜aj =R−1 OOrj. This is a valid solution, if it is an element of L2.O/, i.e. if ∞ k=1 rj,ϕOOk 2 λ2 OOk <∞: .7/ Partially Observed Functional Data 785 This condition is known in the theory of inverse problems as Picard’s condition. A solution to the inverse problem (5) exists if and only if condition (7) is satisfied. Condition (7) is equivalent to the condition ∞ k=1 corr.βjM, XO,ϕOOk /2 var. XO,ϕOOk / <∞, .8/ which has a clear interpretation. It states that the missing variable βjM must not be strongly correlated with complicated, high frequency components of the observed function. The variability of these components must be sufficiently large to provide enough information for the prediction of βjM. The precise balance between the complexity of the correlation of the unobserved score with the predictor components and the variability of the predictor components is quantified by the requirement on the series above to converge. In the Gaussian case, the conditional expectation of βjM given the principal scores XO, ϕOOk ,k=1,2,:::, is an infinite linear combination of these scores (an almost surely convergent infinite series). One can show this by conditioning on finitely many components (this multivariate conditional expectation is linear) and applying L´evy’s 0–1 law (Kallenberg (2002), theorem 7.23) to obtain the limit. The infinite sum of variances of terms in this series converges, which is equivalent to the convergence of Σ∞ k=1 rj,ϕOOk 2 λOOk or Σ∞ k=1 corr.βjM, XO, ϕOOk /2. If, moreover, condition (7) or (8) is satisfied, then the coefficients in the infinite linear combination for the conditional expectation form an l2-sequence; hence the conditional expectation is continuous in the condition. From now on, to guarantee the existence of a continuous solution to condition (5), we assume that condition (7) holds. If it is a priori known that the conditional expectation E.βjM|XO/ is a continuous linear functional of XO, then condition (7) is automatically satisfied. The operator ROO is a compact operator with infinite dimensional range; therefore, its inverse R−1 OO is not bounded (i.e. not continuous). Consequently, small perturbations of rj may lead to large perturbations of ˜aj = R−1 OOrj. It is seen from equation (6) that an overall small change of rj may result in an arbitrarily large change of ˜aj, if the change of rj occurs on a coefficient with a sufficiently high index k; the division by a sufficiently small eigenvalue may enormously magnify the perturbation. In other words, the solution ˜aj = R−1 OOrj is extremely unstable and the inverse problem (5) is ill posed. It is important for a solution to be stable with respect to perturbations of the right-hand side rj because rj is unknown and needs to be estimated. With estimated right-hand side, the solution to the inverse problem may be arbitrarily far from the true solution no matter how accurate the estimate is. This is true even when ROO is known. Moreover, the operator ROO is not known either; its estimate has finite rank and therefore is not invertible in L2.O/. To obtain a stable solution, one needs to use regularization, i.e. to modify the ill-posed inverse problem in such a way that it becomes well posed with a stable solution. We use ridge regularization. Instead of problem (5), we solve the problem R .α/ OOaj =rj with R .α/ OO =ROO + αIO, where α > 0 and IO is the identity operator on L2.O/. The inverse R .α/−1 OO of the bounded operator R .α/ OO is bounded and therefore the solution ˜a .α/ j = R .α/−1 OO rj is stable. Denote the regularized best linear prediction of βjM by ˜β .α/ jM = ˜a .α/ j ,XO . The stability of the solution increases with α but the bias of the solution increases also because the problem becomes more different from the original problem; conversely, with α decreasing, the solution becomes closer to the exact but unstable solution of the original problem. We now turn to the practical, empirical version of the problem of computation of principal scores from partially observed functional data. We have a sample of n functions X1O1 ,:::,XnOn observed on the sets O1,:::, On. The mean function μ and the covariance operator R are 786 D.Kraus estimated by ˆμ and ˆR introduced in Section 2. The principal score of the ith curve with respect to the jth eigenfunction is estimated by ˆβ .α/ ij = ˆβijOi + ˆβ .α/ ijMi , where ˆβijOi = XiOi − ˆμOi , ˆϕjOi and ˆβ .α/ ijMi = ˆa .α/ ij , XiOi − ˆμOi . Here the function ˆa .α/ ij = ˆR .α/−1 OiOi ˆrij solves the empirical regularizedinverseproblem ˆR .α/ OiOi aij = ˆrij,where ˆR .α/ OiOi = ˆROiOi +αIOi with ˆROiOi beinganintegral operator on L2.Oi/ with kernel equal to the restriction of the kernel ˆρ of ˆR (see equation (1)) to Oi ×Oi, and ˆrij = ˆROiMi ˆϕjMi with ˆROiMi defined analogously by restriction of ˆρ to Oi ×Mi. We are ready to state the main convergence result that justifies this method. The difference between the regularized estimator ˆβ .α/ ijMi and the best linear prediction ˜βijMi can be decomposed into the sum of the estimation error for the regularized prediction and the approximation error due to regularization, i.e. ˆβ .α/ ijMi − ˜βijMi = ˆβ .α/ ijMi − ˜β .α/ ijMi + ˜β .α/ ijMi − ˜βijMi . We show that, when the amount of regularization decreases at a suitable rate as the sample size increases, both terms converge to 0 in L2.P/ and thus the regularized estimator of the prediction is consistent. Theorem 1. Let E. X1 4/ < ∞, assumption (3) be satisfied, all eigenvalues of R have multiplicity 1 and condition (7) be satisfied for Oi and Mi in place of O and M respectively. Then E{. ˆβ .α/ ijMi − ˜βijMi /2 } O.α−3 /O.n−1 /+O.α/ as α → 0 and n → ∞. Hence, if α = αn such that αn → 0 and αnn1=3 → ∞ as n → ∞, then ˆβ .αn/ ijMi is a consistent estimator of the best linear prediction ˜βijMi of βijMi . Sometimes one is interested in estimating other linear functionals than the principal score Xi −μ, ϕj . Our consistency results remain valid when ˆϕjOi is replaced by an arbitrary random or fixed function ˆfOi ∈ L2.Oi/ such that E. ˆfOi − fOi 2/ = O.n−1/ for some deterministic fOi ∈L2.Oi/. Note that theorem 1 has no strong assumptions. Picard’s condition (7) is a basic assumption that is required in all inverse problems to guarantee the existence of a solution. Except this standard requirement, no other condition on the rate of decrease of the eigenvalues λOiOik is needed. This is because we estimate the prediction ˜aij,XiOi rather than the prediction functional ˜aij itself. Intuitively, the integration in ˜aij,XiOi brings additional smoothness; the exact way that this happens is seen in the proof of theorem 1. In a related context of prediction in functional linear regression, it was observed by Cai and Hall (2006) and Cardot et al. (2007) that weaker assumptions are needed and stronger results can be obtained when the focus is on prediction rather than on the estimation of the regression functional. The inverse problem is similar to that solved in the functional linear model (Cardot et al., 1999, 2007; Hall and Horowitz, 2007). However, the way that we arrive at it differs from the functional linear model because, for instance, of the incompleteness of observations there is no collection of response–covariate pairs in the present situation. As an alternative to ridge regularization, one may consider the spectral truncation approach. Both methods have their advantages and disadvantages. For instance, it is known that the behaviour of spectral cut-off methods depends on the spacings between the eigenvalues of the operator to be inverted which makes them less robust with respect to situations with similar or even identicaleigenvalues(seeHallandHorowitz(2007)).Indeed,inapreliminaryanalysisofourmotivating data set we observed some very similar estimated eigenvalues. There is also an important computational advantage of the ridge method. For this method, one needs to solve only a linear equation with ˆR .α/ OiOi which is very easy and fast. In contrast, the spectral truncation approach requires computing the eigendecomposition of ˆROiOi and projecting on the corresponding subspace. This is computationally more demanding, especially since it must be done repeatedly for each function because different suboperators ˆROiOi of ˆR corresponding to different functions Partially Observed Functional Data 787 have different spectral decompositions. Yet another approach may be based on smoothing, for instance, by penalizing the roughness of the solution of the inverse problem. 3.3. Regularization parameter selection Theorem 1 shows that, for an appropriate choice of αn, the estimator ˆβ .αn/ ijMi is consistent for the best prediction ˜βijMi . Theorem 1, however, does not give a practical recommendation on how to select the regularization parameter. It is desirable to have an automatic, data-driven selection procedure. Since the parameter α is difficult to understand, we first translate it into more comprehensible values. By analogy with ridge regression or various standard smoothing techniques, we define the number of effective degrees of freedom as the trace of the covariance of the predictors composed of its regularized inverse, i.e. dfi.α/=tr. ˆR .α/−1 OiOi ˆROiOi /= ∞ k=1 ˆλOiOik ˆλOiOik +α , .9/ which is a decreasing function of α. Unlike in standard situations the covariance operator here is computed from partially observed data. Another way to measure the amount of regularization is the proportion of retained variability like in classical principal component analysis using, for example, tr. ˆROiOi ˆR .α/−1 OiOi ˆROiOi ˆR .α/−1 OiOi ˆROiOi / tr. ˆROiOi / = ∞ k=1 ˆλ 3 OiOik=. ˆλOiOik +α/2 ∞ k=1 ˆλOiOik .10/ or a similar quantity. One can determine α such that the effective degrees of freedom equal some value or the proportion of retained variability exceeds some threshold. These quantities, however, do not measure the predictive performance of the regularized solution. A universal recipe for situations of this type is to use generalized cross-validation. In traditional settings, the generalized cross-validation score is the residual sum of squares (a measure of goodness of fit) divided by a decreasing function of the effective degrees of freedom (a penalty included to avoid underregularization). The residual sum of squares is the sum of squared differences of the response variables and their predictions, which in our case are ˆβkjMi = XkMi − ˆμMi , ˆϕjMi and ˆβ .α/ kjMi = ˆa .α/ ij ,XkOi − ˆμOi , k =1,:::,n, respectively. In the situation of partially observed functions, the pair of the response variable ˆβkjMi and the explanatory variable XkOi is not available for all individuals k =1,:::,n. The idea is, therefore, to consider the set of completely observed functions with indices C ={k :1 k n, 1 0 Ok.t/dt =1}. If this set is reasonably large, we can compute the residual sum of squares over the complete functions rssij.α/= k∈C . ˆβkjMi − ˆβ .α/ kjMi /2 : The cross-validation score for the regularized estimation of the jth score of the ith function is gcvij.α/= rssij.α/ {1−.1=|C|/dfi.α/}2 , where |C| is the number of complete functions. One selects the value of α that minimizes this quantity. Separate values of the regularization parameter are used for each function and each score. 788 D. Kraus 3.4. Prediction uncertainty For a statistical procedure to be useful, it is important to quantify its uncertainty, i.e. to assess how far ˆβ .αn/ ijMi can be from βijMi . The following proposition answers these questions. Proposition 3. Let the assumptions of theorem 1 be satisfied and let αn →0 and αnn1=4 →∞ as n→∞. Then ˆβ .αn/ ijMi −βijMi is asymptotically distributed as ˜βijMi −βijMi , which is a zero-mean random variable with variance that can be consistently estimated by ˆv2 ij = ˆϕjMi ,. ˆRMiMi − ˆRMiOi ˆR .αn/−1 OiOi ˆROiOi ˆR .αn/−1 OiOi ˆROiMi / ˆϕjMi : If the distribution of the data is Gaussian, then the limiting variable is Gaussian. The assumptions of this proposition are similar to those of the consistency result of theorem 1, except that a slower rate of convergence of the regularization parameter to 0 is needed to estimate the limiting variance consistently. The prediction uncertainty, as expressed by the variance ˆv2 ij, does not converge to 0 as the sample size converges to ∞. This is because the situation is a prediction problem rather than an estimation problem in the sense that we try to recover a random variable rather than a non-random parameter. Thus, although increasing the sample size eventually removes the uncertainty due to unknown estimated quantities (the mean function and covariance operator) and regularization, there is a fundamental uncertainty that cannot be removed asymptotically. In other words, the knowledge of the principal score will never be precise, if the functional observation is incomplete, and the limits of accuracy of the prediction are given by the asymptotic variance v2 ij. We refer to Didericksen et al. (2012) for an interesting discussion of similar questions in somewhat related prediction problems in the context of functional time series. Proposition 3 immediately enables us to construct a prediction interval for the score. Assume that a Gaussian distribution is a good approximation for the distribution of the data. Then Iij;η =. ˆβ .αn/ ij −z1−η=2 ˆvij, ˆβ .αn/ ij +z1−η=2 ˆvij/, .11/ where z1−η=2 is the .1−η=2/-quantile of the standard normal distribution, is a prediction interval for βij with asymptotic coverage probability 1−η, i.e. P.βij ∈Iij;η/→1−η as n→∞. Since principal component analysis is often used as a dimension reduction procedure and the resulting principal scores are subsequently analysed by traditional techniques, it is useful to have a measure of reliability of the computed scores. The true score βij is a random variable with variance estimated by ˆλj. The predicted score ˆβ .αn/ ij can be seen as the true score contaminated by error with variance estimated by ˆv2 ij. One can define the relative error ˆvij= ˆλ 1=2 j , .12/ which is the ratio of the error variability and the natural intrinsic variability of the score. This value, lying between 0 and 1, can be used as an indicator of observations that are too uncertain, and the scores whose relative error exceeds a certain threshold (e.g. 0.2) can be excluded from the subsequent analysis. The uncertainty will be high when the association between the missing part of the score and the observed fragment is weak. The high uncertainty of predictions due to a small amount of observed information is one example of situations where we must be cautious. Another such case could be when missingness is very frequent in certain regions or the overlap of observation periods is not sufficiently frequent because then the precision of the estimation of the covariance function will be locally reduced, and consequently the prediction procedure may be less accurate. The performance Partially Observed Functional Data 789 of generalized cross-validation may also be negatively influenced. Yet another problem could arise when the data are not missing at random (e.g. when missingness is more likely to occur when functional values are high). In such cases, missing functional chunks may be indeed very insidious because important features of the data distribution may be lost. Furthermore, the presence of functional outliers can be a complication as they may be more difficult to detect when only fragments are available. 4. Functional completion 4.1. Reconstruction of incomplete functions It is natural to ask whether it is possible to recover not only the missing part of a principal score (and thus to compute the score of an incomplete function) like in Section 3 but also the whole missing part of the trajectory (and thus to reconstruct the whole functional variable). The answer is positive. In the population version of the problem, the best prediction of XM by a function of XO in the sense of the mean integrated prediction squared error is the conditional expectation E.XM|XO/. It is in general a non-linear operator from L2.O/ to L2.M/ and, similarly to the case of principal scores, we consider its best continuous linear approximation. Assuming for simplicity that the functional variable has mean 0, the minimization problem to be solved is min A: A ∞<∞ E. XM −AXO 2 /, where the solution is looked for in the class of continuous (bounded) linear operators from L2.O/ to L2.M/ (by · ∞ we denote the operator norm). We see (by Fr´echet differentiation or direct computation) that solving this minimization is equivalent to solving the (normal) equation AROO =RMO. This suggests the solution ˜A=RMOR−1 OO and the best linear prediction of XM in the form ˜XM = ˜AXO. From now on, we assume the existence of a bounded solution, i.e. we assume that RMOR−1 OO ∞ <∞. Similarly to the case of principal scores, the inverse problem to be solved is ill posed. Using ridge regularization we obtain the solution ˜A .α/ =RMOR .α/−1 OO . The regularized best linear prediction equals ˜X .α/ M = ˜A .α/ XO. Practically, when the sample X1O1 ,:::,XnOn is observed on the subsets O1,:::,On, we replace the covariance operator by its estimate and set ˆA .α/ i = ˆRMiOi ˆR .α/−1 OiOi . The mean function needs to be estimated as well. For the ith curve, the best linear prediction of XiMi is estimated by ˆX .α/ iMi = ˆμMi + ˆA .α/ i .XiOi − ˆμOi /: To prove the consistency, we assume not only that the solution to the inverse problem (the prediction operator) is bounded but that it is Hilbert–Schmidt. We have a result as follows. Theorem 2. Let E. X1 4/<∞, assumption (3) be satisfied and RMiOi R−1 OiOi 2 <∞. Then E. ˆX .α/ iMi − ˜XiMi 2 / O.α−3 /O.n−1 /+O.α/ as α → 0 and n → ∞. Hence, if α = αn such that αn → 0 and αnn1=3 → ∞ as n → ∞, then ˆX .αn/ iMi is a consistent estimator of the best linear prediction ˜XiMi of XiMi . Note that our consistency result is genuinely functional. It is different from theorem 3 of Yao et al. (2005a) where it was possible to obtain only a pointwise consistent estimator of the functional variable. The reason is that we assume that the functions are observed fully (or densely in practice) on subsets of the domain whereas Yao et al. (2005a) worked in a sparse 790 D. Kraus observation regime. In other words, we can achieve stronger results because our data contain more information. The assumption that the prediction operator ˜Ai =RMiOi R−1 OiOi is Hilbert–Schmidt ( ˜Ai 2 < ∞) which is needed for the proof is a strengthening of the basic assumption on the continuity of ˜Ai ( ˜Ai ∞ < ∞). Assumptions of this type were used in related contexts of, for example, prediction in functional time series (Bosq (2000), chapter 8, and Kargin and Onatski (2008)) and the functional linear model (Yao et al., 2005b; He et al., 2010). It seems possible to replace this assumption by a combination of the condition ˜Ai ∞ <∞ and a condition on the eigenvalue sequence λOiOik such that the regularization error can be controlled. The condition ˜Ai 2 < ∞ can be written explicitly in terms of the covariance structure of the principal scores of the observed and unobserved part of the function. If the eigendecompositions of ROiOi and RMiMi are ROiOi = ∞ k=1 λOiOikϕOiOik ⊗ϕOiOik, RMiMi = ∞ k=1 λMiMikϕMiMik ⊗ϕMiMik (where ‘⊗’ stands for the tensor product: .f ⊗g/u= g,u f), then we can write RMiOi = ∞ j=1 ∞ k=1 γMiOijkϕMiMij ⊗ϕOiOik, where γMiOijk = ϕMiMij, RMiOi ϕOiOik = cov. XMi −μMi ,ϕMiMij , XOi −μOi ,ϕOiOik /. Then the operator ˜Ai is Hilbert–Schmidt whenever ∞ j=1 ∞ k=1 γ2 MiOijk λ2 OiOik <∞, which is equivalent to ∞ j=1 λMiMij ∞ k=1 corr. XMi −μMi ,ϕMiMij , XOi −μOi ,ϕOiOik /2 λOiOik <∞: It is seen that this condition combines conditions for the prediction of XMi −μMi ,ϕMiMij , j =1, 2,::: (compare the inner series above with condition (8)). 4.2. Selection of the regularization parameter To understand the amount of regularization corresponding to α, we can use the effective degrees of freedom or the proportion of retained variability as defined in equations (9) and (10) respectively. For the selection of α automatically balancing the stability and accuracy of the prediction of XiMi , we propose a similar cross-validation procedure to that in Section 3.3 for principal scores. The residual sum of squares for the prediction of trajectories on Mi computed for the completely observed curves in the sample is rssi.α/= k∈C XkMi − ˆX .α/ kMi 2 : The value of α that is used for the prediction of a function on Mi from its observation on Oi minimizes Partially Observed Functional Data 791 gcvi.α/= rssi.α/ {1−.1=|C|/dfi.α/}2 : 4.3. Uncertainty and prediction bands Theorem 2 shows that ˆX .αn/ iMi consistently estimates the best linear prediction ˜XiMi . We are now interested in the variation of ˆX .αn/ iMi around the target quantity: the unobserved function XiMi . Proposition 4. Let the assumptions of theorem 2 be satisfied and let αn →0 and αnn1=4 →∞ as n→∞. Then ˆX .αn/ iMi −XiMi is asymptotically distributed (in the sense of weak convergence of probability measures on L2.[0,1]// as the mean 0 stochastic process ˜XiMi −XiMi . The limiting covariance operator is consistently estimated (with respect to the Hilbert–Schmidt norm) by ˆVi = ˆRMiMi − ˆRMiOi ˆR .αn/−1 OiOi ˆROiOi ˆR .αn/−1 OiOi ˆROiMi : If the data are Gaussian, then the limiting stochastic process is Gaussian. The trace of ˆVi quantifies the total amount of uncertainty of the linear prediction of XiMi . It approaches 0 as the Lebesgue measure of the missing region Mi approaches 0, i.e. as we approach a completely observed function. When the measure of the observation period Oi converges to 0, the total prediction uncertainty converges to the trace of ˆR, which corresponds to the situation of no information about the ith curve. The scale invariant ratio tr. ˆVi/1=2 =tr. ˆR/1=2 .13/ measures the relative prediction error, i.e. the amount of uncertainty about the ith curve as a proportion of the total spread of the distribution of the functional random variable. 1 minus this value corresponds to the reduction of uncertainty that is achieved by the best linear prediction and can be seen as a measure of performance of the completion procedure. Alternatively, we can use ˆRMiMi instead of ˆR in the denominator in the relative prediction error, leading to the ratio of the uncertainty about the missing trajectory when the prediction method is used versus the uncertainty that there would be about XiMi if we ignored the observed part. We use the asymptotic distribution of ˆX .αn/ iMi −XiMi for the construction of prediction bands for the unobserved part of the trajectory, i.e. regions containing the curve XiMi with high probability. We consider bands of the form {.t,x/: ˆX .αn/ iMi .t/−c1−η ˆh.t/ x ˆX .αn/ iMi .t/+c1−η ˆh.t/, t ∈Mi}, .14/ where ˆh is a function that consistently estimates some limiting function h that is bounded away from zero, and c1−η is the .1 − η/-quantile of the random variable supt∈Mi | ˜XiMi .t/ − XiMi .t/|=h.t/. This band has asymptotic coverage 1 − η. One can choose ˆh = 1, leading to a band with constant width, but typically one prefers a band whose width at time t reflects the uncertainty of the prediction of the missing function at t. We use ˆh.t/=max{ ˆh0, ˆvi.t/} where ˆvi.t/ is the estimated standard deviation of the limiting predictive distribution at time t, i.e. the square root of the diagonal of the kernel of ˆVi, and ˆh0 is a threshold guaranteeing that the limiting function h is bounded away from 0. For example, the choice ˆh0 = 0:2 supt∈Mi ˆvi.t/ works well in practice. If the distribution of the data can be considered as Gaussian, the quantile c1−η can be computed by simulation as follows. Generate a large number of independent realizations of the Gaussian process with mean 0 and covariance operator ˆVi, divide them by ˆh.t/, compute the maxima of their absolute values and determine the .1 − η/-quantile of this sample. The 792 D. Kraus simulation of the trajectories and the computation of the maxima are performed on a fine grid of points. Note that the width of the band does not converge to 0 because it is a prediction band, i.e. it must contain, with high probability, a random function. We conlude this section with a theoretical remark. Although the prediction bands proposed work well in practice, as is documented in the simulation study in Section 5, for a strictly rigorous justification arguments based on proposition 4 (which is a consequence of theorem 2) need to be extended. Proposition 4 guarantees the convergence in distribution in the sense of the topology of the L2-norm of the Hilbert space L2.[0,1]/. This justifies the construction of prediction regions in the form of balls in L2.[0,1]/ which, however, are not practical because they cannot be plotted. For prediction bands, the convergence is needed in the sense of the uniform topology. For this, we need to leave the geometric world of L2.[0,1]/ and to switch to the space of continuous functions C.[0, 1]/. Under modified assumptions (which would include conditions on sample paths, such as H¨older continuity), it seems possible to prove the convergence in the uniform topology. We do not pursue this theoretical study but give arguments indirectly justifying the use of the bands. Suppose that the asymptotic approximation that is suggested by theorem 2 and proposition 4 is considered applicable if the L2-distance from the limiting variable is sufficiently small. The probability that this L2-distance exceeds some ">0 is, in light of Chebyshev’s inequality, bounded as P. ˆX .αn/ iMi − ˜XiMi 2 2 >"/ "−2E. ˆX .αn/ iMi − ˜XiMi 2 2/. However, convergence in the L2-norm does not imply uniform convergence because large deviations may occur on a small set of arguments. Let us compute the Lebesgue measure γ of the set where | ˆX .αn/ iMi − ˜XiMi | deviates more than " from 0. We compute γ.{t : | ˆX .αn/ iMi .t/ − ˜XiMi .t/| > "}/ "−2 ˆX .αn/ iMi − ˜XiMi 2 2 by using Chebyshev’s inequality. Taking expectations on both sides, we obtain on the right-hand side the same bound as before. Hence, if the bound is considered to be sufficiently small for the asymptotic approximation in the L2-norm to be applicable, then also the expected Lebesgue measure of the set of large pointwise deviations will be negligible. 5. Simulations A simulation study was designed to address the following goals: to investigate the performance of generalized cross-validation as a selector of the regularization parameter, to verify the validity and accuracy of the prediction intervals and bands and to explore the effect of the observation pattern. We generate random samples of curves of the form X.t/= 100 k=1 21=2 ν 1=2 k ξk cos.2πkt/+ 100 k=1 21=2 ω 1=2 k ηk sin.2πkt/, t ∈[0,1], where ξk and ηk are independent standard normal variables and the eigenvalues are of the form νk =3−.2k−1/ and ωk =3−2k. The three most important components represent 67%, 22% and 7% of the total variability. For each curve we generate independently a random period on which this curve is not observed. The functional values on this period are removed. For the ith function, the missing period Mi is simulated in the form Mi =[Ci −Ei,Ci +Ei]∩[0,1] with Ci =dU 1=2 i,1 and Ei =fUi,2, where d and f are parameters and Ui,1 and Ui,2 are independent variables uniformly distributed on [0,1]. The performance of our procedures is measured on one curve in the sample, say X1. For this curve, we use a fixed (non-random) missing period to guarantee that values computed from different simulation runs have the same meaning. In all simulations, we use L=1000 repetitions. Partially Observed Functional Data 793 Table 1. Performance of the generalized cross-validation selection procedure† Target quantity n MSPE for α=cαgcv Median (and its variability) and the following values of c: degrees of freedom 0.04 0.2 1 5 25 for α=αgcv Score 1 (333) 100 1.91 1.55 1.32 1.61 3.78 7.68 500 0.60 0.44 0.36 0.42 1.07 12.73 Score 2 (111) 100 0.46 0.37 0.35 0.44 0.80 8.61 500 0.16 0.13 0.12 0.15 0.27 13.71 Score 3 (37) 100 1.45 1.13 0.95 1.08 2.00 8.62 500 0.48 0.34 0.28 0.29 0.53 13.71 Missing trajectory (500) 100 10.07 7.90 6.95 8.24 15.16 7.98 500 4.04 2.79 2.24 2.30 3.48 15.02 †MSPE and the variability of the target quantity are multiplied by 1000. For the first two sets of simulations, we set d =1:4 and f =0:2. This leads to an observation pattern with similar characteristics to those in our motivating data set. The cross-sectional probability of observation ranges from 99% at time 0 to 79% at time 1. The percentage of complete curves is 39%. The median length of the missing period (given the curve has a missing period) is 0:15. For the curve X1, on which the performance is measured, we set M1 =.0:4,0:7/. First, we investigate the performance of generalized cross-validation based on complete curves. As a measure of quality of the prediction of a missing quantity, we use the meansquared prediction error MSPE which is the average over all simulation runs of the squared distances of the predicted value and the true value, i.e. L−1ΣL l=1 . ˆβ .α/[l] 1jM1 − ˆβ [l] 1jM1 /2 for the jth score and L−1 ΣL l=1 ˆX .α/[l] 1M1 − X [l] 1M1 2 for the missing part of the trajectory, where the superscript [l] indicates that the value pertains to the lth generated sample. Table 1 shows values of the mean-squared prediction error for the first three principal scores and for the missing part of the trajectory. Table 1 also includes the variability of the target quantities (i.e. the true eigenvalues for the scores and the trace of the true covariance operator R for the trajectory) to put the values into context. The mean-squared prediction error is reported for α set to the value selected by generalized cross-validation and to values slightly smaller or bigger in the form of multiples of the selected value. We see that the method successfully approximates the best value of α and can be recommended as the tuning parameter selector. The accuracy increases with increasing sample size n; however, it should be noted that the mean-squared prediction error cannot converge to 0 because there is always some uncertainty due to the randomness of the target quantity, as discussed in Sections 3.4 and 4.3. The last column of Table 1 reports the median of the effective degrees of freedom corresponding to the selected value of α. It is seen that in all cases the typical number of degrees of freedom is in a reasonable relation to the sample size. The second set of simulations explores the properties of the approximate distribution of the deviation of the prediction from the predicted quantity that is established in propositions 3 and 4. We simulate from the same distribution and observation pattern as before. The regularization parameterisselectedbygeneralizedcross-validation.Weconsiderpredictionintervalsandbands of the form (11) and (14) respectively, with nominal coverage 95%. We compute bands with both constant and variable width, as discussed in Section 4.3. Empirical coverage probabilities (i.e. the percentage of cases when the unobserved quantity was covered by the constructed region) are reported in Table 2. We see that the intervals and bands proposed have coverage that is close 794 D. Kraus Table 2. Empirical coverage of prediction regions (intervals for scores; bands with constant and variable width for curves) and the median relative error measure n Results for Results for Results for Results for missing trajectory score 1 score 2 score 3 Coverage Coverage Median Coverage Median Coverage Median Coverage Median (constant (variable relative (%) relative (%) relative (%) relative width) (%) width) (%) error error error error 100 97.2 0.073 95.2 0.056 94.5 0.143 94.3 96.7 0.123 500 97.4 0.042 95.0 0.036 96.3 0.092 94.2 98.4 0.07 Table 3. Standardized mean-squared prediction error for different observation patterns n Observation Results for Results for Results for Results for pattern (X1) score 1 and score 2 and score 3 and missing trajectory and the following the following the following the following observation observation observation observation patterns (sample): patterns (sample): patterns (sample): patterns (sample): A B A B A B A B 100 I 0.022 0.045 0.052 0.093 0.035 0.067 0.040 0.075 II 0.039 0.073 0.078 0.128 0.107 0.155 0.076 0.136 500 I 0.006 0.013 0.018 0.031 0.010 0.023 0.013 0.024 II 0.019 0.027 0.037 0.051 0.060 0.076 0.035 0.051 to the nominal coverage and, therefore, provide useful information on the probable values of the scores or the missing trajectory. Table 2 also reports the median of relative error measures (12) and (13). For instance, we can see that the approximate distribution is relatively more spread for less variable (higher index) scores. This is in line with conclusions from Table 1 where we observed a similar relationship between MSPE and the variability of the target quantity. Hence the relative error measures (12) and (13), which can be computed from the data, seem to be valuable indicators of the accuracy of the reconstruction procedure. In the last set of simulations, we study the effect of the observation pattern on the accuracy of our methods. We vary the amount of observed information both for X1 (whose characteristics are to be reconstructed) and for the whole sample (which is used to learn the reconstruction procedure). Two settings are used for the missing period of X1: I, M1 = .0:4, 0:7/; II, M1 = .0:4,0:9/. For the simulation of the missing periods of other curves in the sample, we simulate Mi of the form given earlier in this section, with parameter pairs A, d = 1:4 and f = 0:2, and B, d = 1:4 and f = 0:5. Basic characteristics of the observation pattern for A were discussed before; for B, the cross-sectional observation probability varies from 95% at t = 0 to 50% at t = 1, 21% of curves are complete and the average length of missing periods (among incomplete curves) is 0.29. Configuration IA was used in the first two sets of simulations; other combinations contain less observed information. Results are reported in Table 3 where mean-squared prediction errors are presented after standardization by the true variance of the predicted quantity, i.e. by the variance of the missing part of the score, var.β1jM1 /, or by the trace of the covariance operator of the missing part of the trajec- Partially Observed Functional Data 795 tory, tr.RM1M1 /; after this standardization it is possible to compare values under pattern I with their counterparts computed under II. We see that the precision of estimation decreases as the amount of observed information (either on the curve of interest or on the sample) decreases. 6. An illustration: ambulatory blood pressure monitoring data Heart rate profiles displayed in Fig. 1 and their first derivative plotted in Fig. 2 were obtained from raw observations by penalized spline smoothing described in the supplementary file that is available on line. The curves were registered by shifting the individual timescales so that every person’sbedtimeis23(i.e.11p.m.);individualbedtimeswereavailablefromaquestionnaire.The methodology that is developed in this paper requires that the observation periods be independent of the curves. The expert opinion is that this is a realistic assumption; in addition, we performed exploratory graphical checks that did not indicate any problem with regard to this assumption. From the shape of the mean functions of the profiles and their first derivatives it is obvious that on average heart rate profiles have a decreasing shape in this part of the day and they decrease fastest around the bed time. We wish to understand the main sources of variability between individual heart rate profiles. In Fig. 3 we plot the first three eigenfunctions of the profiles and of their derivatives as perturbations of the mean shape (see Ramsay and Silverman (2005), section 8.3.1) i.e. we plot the mean profile plus and minus a suitable multiple of each eigenfunction (the eigenfunctions are multiplied by 0:9 ˆλ 1=2 j ). For the profiles, we see that the most important component is the global level of heart rate, followed by a component describing the difference between the day and night values and a component that can be interpreted as a time shift. In terms of the first derivative, the first component quantifies the global level of the speed of decrease, the second component captures a shift in time and the third characterizes whether the individual’s heart rate decreases rather suddenly or more gradually. The first three components explain a large proportion of the total variability and provide enough flexibility to capture individual shape features, e.g. the increasing trend of some curves in regions where the mean and most curves decrease. Let us now focus on the individual level. To illustrate our prediction method for principal scores, we first consider the curve that is plotted as short dashes in Figs 1(b) and 2(b). The functional values are missing on a subset of the time interval and hence the principal scores cannot be computed directly. They can, however, be predicted. We give the results for the profile only 20 21 22 23 24 25 26 −8−6−4−2024 Time (a) (b) 20 21 22 23 24 25 26 −8−6−4−2024 Time Fig. 2. (a) Subset of the sample of the first derivatives of heart rate profiles and (b) several curves in detail 796 D. Kraus 20 22 24 26 60708090 Time (a) (b) (c) (d) (e) (f) 20 22 24 26 607080 Time 20 22 24 26 6575 Time 20 22 24 26 −4.0−2.5−1.0 Time 20 22 24 26 −3.5−2.0−0.5 Time 20 22 24 26 −3.0−2.0 Time Fig. 3. (a)–(c) First three eigenfunctions of heart rate profiles and (d)–(f) of their first derivative plotted as perturbations (- - - - -, ) of the mean ( ):(a) principal component 1, 87.2%;(b) principal component 2, 9.3%; (c) principal component 3, 2.1%; (d) principal component 1, 59.5%; (e) principal component 2, 33.8%; (f) principal component 3, 4.5% (one can proceed analogously for the first derivative). The predicted values for the first three components are .−28:7, 2:9, −1:9/. Their prediction standard deviations quantifying the uncertainty are .1:7,2:3,1:8/. Mainly for the first two components they are relatively small compared with the standard deviations of the intrinsic variability .24:0,7:8,3:7/ (the square root of the eigenvalues); the corresponding relative errors are .0:07,0:29,0:48/. It is not surprising that the best precision is achieved for the first component: this component dominates the spectrum and is quite simple (constant), so even a fraction of the curve provides relatively much information about the score. Next, we illustrate the method on the completely observed function plotted as the chain curves in Figs 1(b) and 2(b) from which we artificially remove observations in the time interval [23.75,26]. Using the remaining part for the prediction, we estimate the scores by .5:84,4:43,4:18/ (with prediction standard deviations .2:12,2:68,2:01/), which is quite close to the true values .5:76,4:55,4:32/ computed from the complete curve (recall, however, that there will always be some random non-vanishing discrepancy between the predicted and true values because we predict random variables by their conditional expectations). Finally, we illustrate the functional reconstruction procedure. In Fig. 4 we plot the two curves (and their derivatives) that we considered before and the reconstructed missing parts along with 95% prediction bands. For the originally complete function (Figs 4(b) and 4(d)), we chose a difficult scenario: the missing period is relatively large (2.25 h) and it contains a non-trivial change of shape of the curve mainly in terms of the first derivative which is decreasing in the observed region and increasing in the missing period. However, it is seen that the completion procedure can recover the missing part of information as the predicted curve (thick) approximates very well the true function (thin). It is interesting that our method captures to some extent the presence Partially Observed Functional Data 797 20 21 22 23 24 25 26 406080100 Time (a) (c) (b) (d) 20 21 22 23 24 25 26 −6−4−202 Time 20 21 22 23 24 25 26 406080100 Time 20 21 22 23 24 25 26 −6−4−202 Time Fig. 4. (a), (b) Observed (———) and reconstructed ( ) heart rate profiles and (c), (d) derivatives along with 95% prediction bands for (a), (c) an incompletely observed curve and (b), (d) a complete curve with an artificially introduced missing period of a local minimum in the first derivative. This illustrates the usefulness of the reconstruction procedure: without it important shape features like this would be concealed from the analyst. At first glance, some of the bands may seem to be wide but one needs to keep in mind that they are prediction (not confidence) bands and, therefore, must cover the random trajectory (rather than a non-random function) with a high probability. The uncertainty of the completion is in fact not big in proportion to the intrinsic variability of the stochastic process: the relative error is 0.10 and 0.11 for the curves in Figs 4(a) and 4(c), and 4(b) and 4(d) respectively. A referee pointed out that the prediction bands for the derivatives are narrower than those for the curves. This is not a general phenomenon: it is possible to construct simple examples with prediction bands for derivatives that are wider than those for curves or examples with no such inequality. Differentiation is an operation that changes the covariance structure of functional data in a complex manner. We compared our method with that of Yao et al. (2005a) applied to the raw heart rate values (not preprocessed by smoothing). Although their method was primarily developed for sparsely observed curves, it can be also used in our situation. Main results regarding the covariance structure of the profiles were similar for both methods. The proportion of variance explained by the first three principal components was 82.9%, 10.8% and 3.4%. The first three eigenfunctions had a similar shape and interpretation with both methods. There was a high degree of agreement between principal scores that were obtained by the two methods. The method of Liu and M¨uller (2009) can reconstruct derivatives. However, our method seems to be the only currently available 798 D. Kraus method that can perform principal component analysis of derivatives under incompleteness. This is an important asset of our method over the other approach provided that the data are sufficiently dense on subsets of the domain. Acknowledgements This work was done within the ‘Swiss kidney project on genes in hypertension’, which is a collaboration between Murielle Bochud (Principal Investigator), M. Burnier, O. Devuyst, P.-Y. Martin, M. Mohaupt, F. Paccaud, A. P´ech`ere-Bertschi, B. Vogt, D. Ackermann, H. Alwan, Y. Bouatou, N. Dhayat, G. Ehret, I. Guessous, P. Monney, M.-E. Mueller, B. Ponte, M. Pruijm, S. Reverdin, P. Vuistiner, Z. Kutalik and S. Estoppey. The project was funded by the Swiss National Science Foundation. Special thanks are given to Murielle Bochud for her support and interest, and for her understanding of the importance of methodological developments in statistics. The hospitality of the Institute of Social and Preventive Medicine Lausanne is gratefully acknowledged. I am also grateful to the Joint Editor, the Associate Editor and two referees for their interesting comments and encouragement. Appendix A: Main proofs Here we prove theorems 1 and 2. Propositions 1–4 are proven in the supplementary document that is available on line. Recall that we denote by · the L2 -norm of square integrable functions on a domain S that is obvious from the context (S will be [0, 1] or Oi or Mi). For linear operators, the symbols · ∞ and · 2 are used for the operator norm and the Hilbert–Schmidt norm respectively, where the operator will be a mapping between L2 .S1/ and L2 .S2/ with S1 and S2 that is obvious from the context. For definitions of basic notions from operator theory, we refer to Bosq (2000). A.1. Proof of theorem 1 We neglect the fact that the data are centred by the estimated mean function and assume that the mean is known and equal to 0. The result remains valid when the curves are centred empirically, as the additional terms are negligible. It is enough to prove the inequality in the statement of the theorem; the remaining assertions follow easily. We write | ˆβ .α/ ijMi − ˜βijMi | | ˆβ .α/ ijMi − ˜β .α/ ijMi |+| ˜β .α/ ijMi − ˜βijMi |, which is a decomposition into the estimation error and approximation error. If we show that both errors converge in L2 .P/ to 0, the result will follow. We denote the approximation error A1 =| ˜β .α/ ijMi − ˜βijMi | and compute E.A2 1/=E{ XiOi , ˜a.α/ ij − ˜a2 iji } = R 1=2 OiOi . ˜a.α/ ij − ˜aij/ 2 = R 1=2 OiOi .R.α/−1 OiOi −R−1 OiOi /rij 2 = ∞ k=1 λOiOik 1 λOiOik +α − 1 λOiOik 2 rij, ϕOiOik 2 =α ∞ k=1 αλOiOik .λOiOik +α/2 rij, ϕOiOik 2 λ2 OiOik =O.α/, where λOiOik and ϕOiOik are the eigenvalues and eigenfunctions of ROiOi and the result follows from the fact that αλOiOik=.λOiOik +α/2 1 and Picard’s condition (7). Let us turn to the estimation error | ˆβ .α/ ijMi − ˜β .α/ ijMi |. The computation of expectations is complicated by the fact that the quantities ˆROiOi and ˆrij are obtained from the whole sample including the ith function and thus are dependent on the ith function. We overcome this complication by first considering a modified problem with estimates of ROiOi and rij independent of the ith function and then showing Partially Observed Functional Data 799 that this modification is asymptotically negligible. Specifically, we introduce ˆβ .α/ ijMi.−i/ = ˆR .α/−1 OiOi.−i/ ˆrij.−i/ with ˆR .α/ OiOi.−i/ = ˆROiOi.−i/ +αIOi and ˆrij.−i/ = ˆROiMi.−i/ ˆϕjMi.−i/. Here ˆROiOi.−i/ and ˆROiMi.−i/ are suboperators of the estimated covariance operator ˆR.−i/ that is computed from all functions except the ith, and ˆϕjMi.−i/ is a subfunction of the jth eigenfunction ˆϕj.−i/ of ˆR.−i/. We decompose | ˆβ .α/ ijMi − ˜β .α/ ijMi | as follows: | ˆβ .α/ ijMi − ˜β .α/ ijMi | | ˆβ .α/ ijMi − ˆβ .α/ ijMi.−i/|+| ˆβ .α/ ijMi.−i/ − ˜β .α/ ijMi |, .15/ and we show that both terms converge in L2 .P/ to 0. For the second term on the right-hand side in inequality (15), A2 =| ˆβ .α/ ijMi.−i/ − ˜β .α/ ijMi |, we have E.A2 2/=E{E.| ˆβ .α/ ijMi.−i/ − ˜β .α/ ijMi |2 |{XkOk :k =i}/} =E{E.| XiOi , ˆa.α/ ij.−i/ − ˜a.α/ ij 2 |{XkOk :k =i}/} =E{ R 1=2 OiOi . ˆa.α/ ij.−i/ − ˜a.α/ ij / 2 }: Using the definitions of ˆa.α/ ij.−i/ and ˜a.α/ ij and the triangle inequality, we obtain R 1=2 OiOi . ˆa.α/ ij.−i/ − ˜a.α/ ij / R 1=2 OiOi ˆR .α/−1 OiOi.−i/. ˆROiMi.−i/ −ROiMi / ˆϕjMi.−i/ + R 1=2 OiOi ˆR .α/−1 OiOi.−i/ROiMi . ˆϕjMi.−i/ − ˆsjϕjMi / + R 1=2 OiOi . ˆR .α/−1 OiOi.−i/ −R.α/−1 OiOi /ROiMi.−i/ϕjMi with ˆsj =sgn ˆϕj.−i/, ϕj . Denote these three terms A21, A22 and A23 respectively. We see that A21 R 1=2 OiOi ∞ ˆR .α/−1 OiOi.−i/ ∞ ˆROiMi.−i/ −ROiMi ∞ ˆϕjMi.−i/ : Here, R 1=2 OiOi ∞ is a finite constant, ˆR .α/−1 OiOi.−i/ ∞ α−1 and ˆϕjMi.−i/ ˆϕj.−i/ =1. Using proposition 1 we obtain E.A2 21/ α−2 O.n−1 /. For the term A22 we have the bound A22 R 1=2 OiOi ∞ ˆR .α/−1 OiOi.−i/ ∞ ROiMi ∞ ˆϕjMi.−i/ − ˆsjϕjMi : In light of proposition 2, we see that E. ˆϕjMi.−i/ − ˆsjϕjMi 2 / E. ˆϕj.−i/ − ˆsjϕj 2 /=O.n−1 /. This implies that E.A2 22/ α−2 O.n−1 /. For the term A23, first note that ˆR .α/−1 OiOi.−i/ −R.α/−1 OiOi =R.α/−1 OiOi .R.α/ OiOi − ˆR .α/ OiOi.−i// ˆR .α/−1 OiOi.−i/ =R.α/−1 OiOi .ROiOi − ˆROiOi.−i// ˆR .α/−1 OiOi.−i/: Therefore, we see that A23 R 1=2 OiOi R.α/−1 OiOi ∞ ˆROiOi.−i/ −ROiOi ∞ ˆR .α/−1 OiOi.−i/ ∞ ROiMi ∞ ˆϕjMi.−i/ : The first, third and fifth term are dominated by α−1=2 , α−1 and 1 respectively. The fourth term is a finite constant. Using these bounds and proposition 1 we obtain E.A2 23/ α−3 O.n−1 /. Hence with the help of the Cauchy–Schwarz inequality we finally obtain that E.A2 2/ α−3 O.n−1 /. It remains to analyse the first term on the right-hand side of inequality (15). It reflects the effect of omitting the ith observation in the estimation. As this effect is of order O.n−2 / in terms of mean-squared difference, this term is negligible compared with the second term. In particular, it can be shown that E{. ˆβ .α/ ijMi − ˆβ .α/ ijMi.−i//2 } α−3 O.n−2 /. We omit the technical details. A.2. Proof of theorem 2 To simplify the proof of theorem 2 we assume that the mean is known to be 0 and no centring is performed. The difference due to the estimation of the mean is of negligible order in comparison with other terms. Similarly to the proof of theorem 1, we split the prediction error into the estimation error and regularization error as follows: ˆX .α/ iMi − ˜XiMi ˆX .α/ iMi − ˜X .α/ iMi + ˜X .α/ iMi − ˜XiMi : For the regularization error we compute 800 D. Kraus E. ˜X .α/ iMi − ˜XiMi 2 /= . ˜A .α/ i − ˜Ai/R 1=2 OiOi 2 2 = αRMiOi R−1 OiOi R.α/−1 OiOi R 1=2 OiOi 2 2 α RMiOi R−1 OiOi 2 2 α1=2 R.α/−1 OiOi R 1=2 OiOi 2 ∞ =α ˜Ai 2 2 sup k∈N α1=2 λ 1=2 OiOik λOiOik +α 2 O.α/: We turn to the estimation error. Similarly to the proof of theorem 1 we avoid the dependence between ˆA .α/ i and XiOi in ˆX .α/ iMi = ˆA .α/ i XiOi by considering ˆX .α/ iMi.−i/ = ˆA .α/ i.−i/XiOi , where the estimator of the covariance operator in the prediction operator is replaced by its analogue based on all curves except the ith. The difference is negligible in comparison with the remaining terms; for an analogous discussion see the proof of theorem 1. The modified estimation error equals E. ˆX .α/ iMi.−i/ − ˜X .α/ iMi 2 /=E{ . ˆRMiOi.−i/ ˆR .α/−1 OiOi.−i/ −RMiOi R.α/−1 OiOi /R 1=2 OiOi 2 2} E{ . ˆRMiOi.−i/ −RMiOi / ˆR .α/−1 OiOi.−i/R 1=2 OiOi 2 + RMiOi . ˆR .α/−1 OiOi.−i/ −R.α/−1 OiOi /R 1=2 OiOi 2}2 : The proof is complete on computing E{ . ˆRMiOi.−i/ −RMiOi / ˆR .α/−1 OiOi.−i/R 1=2 OiOi 2 2} E. ˆRMiOi.−i/ −RMiOi 2 2 ˆR .α/−1 OiOi.−i/ 2 ∞ R 1=2 OiOi 2 ∞/ E. ˆRMiOi.−i/ −RMiOi 2 2α−2 λOiOi1/ =α−2 O.n−1 /, E{ RMiOi . ˆR .α/−1 OiOi.−i/ −R.α/−1 OiOi /R 1=2 OiOi 2 2} E. RMiOi 2 ∞ ˆR .α/−1 OiOi.−i/ 2 ∞ × ˆROiOi.−i/ −ROiOi 2 2 R.α/−1 OiOi R 1=2 OiOi 2 ∞/ RMiOi 2 ∞α−2 E. ˆROiOi.−i/ −ROiOi 2 2α−1 / =α−3 O.n−1 /: References Antoniadis, A. and Sapatinas, T. (2003) Wavelet methods for continuous-time prediction using Hilbert-valued autoregressive processes. J. Multiv. Anal., 87, 133–158. Aston, J. A. D. and Kirch, C. (2012) Detecting and estimating changes in dependent functional data. J. Multiv. Anal., 109, 204–220. Benko, M., H¨ardle, W. and Kneip, A. (2009) Common functional principal components. Ann. Statist., 37, 1–34. Bosq, D. (2000) Linear Processes in Function Spaces. New York: Springer. Bugni, F. A. (2012) Specification test for missing functional data. Econmetr. Theor., 28, 959–1002. Cai, T. T. and Hall, P. (2006) Prediction in functional linear regression. Ann. Statist., 34, 2159–2179. Cardot, H., Ferraty, F. and Sarda, P. (1999) Functional linear model. Statist. Probab. Lett., 45, 11–22. Cardot, H., Mas, A. and Sarda, P. (2007) CLT in functional linear regression models. Probab. Theor. Reltd Flds, 138, 325–361. Dauxois, J., Pousse, A. and Romain, Y. (1982) Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. J. Multiv. Anal., 12, 136–154. Delaigle, A. and Hall, P. (2013) Classification using censored functional data. J. Am. Statist. Ass., 108, 1269–1283. Didericksen, D., Kokoszka, P. and Zhang, X. (2012) Empirical properties of forecasts with the functional autoregressive model. Computnl Statist., 27, 285–298. Ferraty, F. and Romain, Y. (eds) (2011) The Oxford Handbook of Functional Data Analysis. Oxford: Oxford University Press. Ferraty, F. and Vieu, P. (2006) Nonparametric Functional Data Analysis. New York: Springer. Goldberg, Y., Ritov, Y. and Mandelbaum, A. (2014) Predicting the continuation of a function with applications to call center data. J. Statist. Planng Inf., 147, 53–65. Partially Observed Functional Data 801 Groetsch, C. W. (1993) Inverse Problems in the Mathematical Sciences. Braunschweig: Vieweg. Hall, P. and Horowitz, J. L. (2007) Methodology and convergence rates for functional linear regression. Ann. Statist., 35, 70–91. Hall, P. and Hosseini-Nasab, M. (2006) On properties of functional principal components analysis. J. R. Statist. Soc. B, 68, 109–126. Hall, P., M¨uller, H.-G. and Wang, J.-L. (2006) Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist., 34, 1493–1517. He, G., M¨uller, H.-G. and Wang, J.-L. (2003) Functional canonical analysis for square integrable stochastic processes. J. Multiv. Anal., 85, 54–77. He, G., M¨uller, H.-G., Wang, J.-L. and Yang, W. (2010) Functional linear regression via canonical analysis. Bernoulli, 16, 705–729. Horv´ath, L., Huˇskov´a, M. and Kokoszka, P. (2010) Testing the stability of the functional autoregressive process. J. Multiv. Anal., 101, 352–367. Horv´ath, L. and Kokoszka, P. (2012) Inference for Functional Data with Applications. New York: Springer. Horv´ath, L., Kokoszka, P. and Reeder, R. (2013) Estimation of the mean of functional time series and a twosample problem. J. R. Statist. Soc. B, 75, 103–122. James, G. M. and Hastie, T. J. (2001) Functional linear discriminant analysis for irregularly sampled curves. J. R. Statist. Soc. B, 63, 533–550. James, G. M., Hastie, T. J. and Sugar, C. A. (2000) Principal component models for sparse functional data. Biometrika, 87, 587–602. Jaruˇskov´a, D. (2013) Testing for a change in covariance operator. J. Statist. Planng Inf., 143, 1500–1511. Jolliffe, I. T. (2002) Principal Component Analysis. New York: Springer. Kallenberg, O. (2002) Foundations of Modern Probability. New York: Springer. Kargin, V. and Onatski, A. (2008) Curve forecasting by functional autoregression. J. Multiv. Anal., 99, 2508–2526. Kraus, D. and Panaretos, V. M. (2012) Dispersion operators and resistant second-order functional data analysis. Biometrika, 99, 813–832. Krzanowski, W. J. (2000) Principles of Multivariate Analysis. Oxford: Oxford University Press. Liebl, D. (2013) Modeling and forecasting electricity spot prices: a functional data perspective. Ann. Appl. Statist, 7, 1562–1592. Liu, B. and M¨uller, H.-G. (2009) Estimating derivatives for samples of sparsely observed functions, with application to online auction dynamics. J. Am. Statist. Ass., 104, 704–717. Mas, A. (2007) Testing for the mean of random curves: a penalization approach. Statist. Inf. Stoch. Processes, 10, 147–163. M¨uller, H.-G. and Stadtm¨uller, U. (2005) Generalized functional linear models. Ann. Statist., 33, 774–805. Panaretos, V. M., Kraus, D. and Maddocks, J. H. (2010) Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Am. Statist. Ass., 105, 670–682. Pruijm, M., Ponte, B., Ackermann, D., Vuistiner, P., Paccaud, F., Guessous, I., Ehret, G., Eisenberger, U., Mohaupt, M., Burnier, M., Martin, P.-Y. and Bochud, M. (2013) Heritability, determinants and reference values of renal length: a family-based population study. Eur. Radiol., 23, 2899–2905. Ramsay, J. O., Hooker, G. and Graves, S. (2009) Functional Data Analysis with R and MATLAB. New York: Springer. Ramsay, J. O. and Silverman, B. W. (2002) Applied Functional Data Analysis. New York: Springer. Ramsay, J. O. and Silverman, B. W. (2005) Functional Data Analysis. New York: Springer. Sangalli, L. M., Secchi, P., Vantini, S. and Veneziani, A. (2009) A case study in exploratory functional data analysis: geometrical features of the internal carotid artery. J. Am. Statist. Ass., 104, 37–48. Yao, F., M¨uller, H.-G. and Wang, J.-L. (2005a) Functional data analysis for sparse longitudinal data. J. Am. Statist. Ass., 100, 577–590. Yao, F., M¨uller, H.-G. and Wang, J.-L. (2005b) Functional linear regression analysis for longitudinal data. Ann. Statist., 33, 2873–2903. Supporting information Additional ‘supporting information’ may be found in the on-line version of this article: ‘Supplementary document: Components and completion of partially observed functional data’. Supplementary document: Components and completion of partially observed functional data David Kraus Institute of Social and Preventive Medicine, University Hospital Lausanne, Switzerland Summary. This supplementary document describes computational details of the proposed methods and provides proofs of Propositions 1, 2, 3 and 4. 1. Computation 1.1. Preliminary steps In most applications, functional data are observed at discrete time points and are possibly subject to measurement error, so it is necessary to preprocess the raw data using smoothing techniques to obtain functions or their derivatives. In the context of partially observed functional data, the measurement time points are located only in observation periods Oi, while there are no measurements in missing periods Mi. We assume that the measurement points are dense in the observation periods, so that it is possible to apply smoothing techniques to obtain the functional values of the ith curve from the measured values of this curve. We use spline smoothing with a roughness penalty, as described in Ramsay and Silverman (2005, Chapter 5), but other methods like kernel smoothing can be used as well. In our experience, a simple approach works well: we apply the smoothing procedure to all values measured for the ith curve but use the computed smooth curve only for t ∈ Oi (ignoring it on Mi where measurements are not available to make it reliable). In practice, the observation and missing periods are typically not given (because they are not designed) and one needs to define them. For instance, one can define Mi to consist of the periods before the first and after the last measurement time and of all gaps between two consecutive measurement times that are larger than a certain threshold g. The value of g is the largest length of intervals without measurements over which we are willing to smooth. The choice of g depends on the particular setting; in general, if, for example, one considers K equidistant points in [0, 1] (e.g., K = 10) as the minimum reliable design for smoothing on the whole domain [0, 1], then g = 1/K seems reasonable. Sometimes, registration of functional data is needed. Shift registration (Ramsay and Silverman, 2005, Section 7.2) is easy to implement for incomplete functions: in the registration criterion the sample mean of partially observed functions is computed by the method described in the next subsection and the distance of each shifted curve from the sample mean is computed by numerical integration over the observed period of the curve; the criterion is minimised by the Procrustes method as usual. Methods based on warping can be modified similarly but further investigation of their performance is needed. 1.2. Principal component analysis, functional reconstruction For practical computation we must use finite dimensional representations of functions and operators. Two traditional approaches exist: we can use either basis expansions or evaluation on a grid 2 David Kraus of points. It is difficult to use the basis approach in our situation because incompletely observed functions are available on different subsets of the time domain. The grid approach is more suited for this type of data since it works directly with time arguments. Let tk = (k−0.5)/d, k = 1, . . . , d be a fine grid of equidistant points on which all functions and kernels of integral operators will be evaluated. Denote by xi the d-dimensional vector of values of Xi at points tk; this vector contains missing values on components corresponding to tk ∈ Mi while for tk ∈ Oi, its values are obtained by evaluation of the spline representation of Xi. Denote by X the (n × d)-dimensional data matrix with xi, i = 1, . . . , n in rows. The vector m of values of the mean function µ on the grid is estimated by ˆm equal to the vector of column means of X computed from available (not missing) data in each column. The covariance kernel ρ of the operator R evaluated on the grid corresponds to the (d × d)-matrix R with entries Rkl = ρ(tk, tl) and is estimated by the sample covariance matrix ˆR with entry ˆRkl computed from the data matrix X using all complete pairs of observations in columns k, l. To estimate the eigenvalues and eigenfunctions, one performs eigen-decomposition of the matrix ˆR. Denote ∆ = 1/d, the distance between the points of the grid. If the eigenvalues and eigenvectors of ˆR are ˆκj and ˆuj, j = 1, . . . , d, then the eigenvalues of the operator ˆR are ˆλj = ˆκj∆ and the corresponding eigenfunctions ˆϕj evaluated on the grid are ˆfj = ˆuj∆−1/2 . The observed part ˆβijOi = XiOi − ˆµOi , ˆϕjOi of the jth principal score of the ith curve is computed by numerical quadrature as ˆβijOi = xiOi − ˆmOi ,ˆfjOi ∆, where the latter inner product is the usual inner product of vectors and the vectors with subscript Oi are subvectors of the original vectors consisting of elements with indices k such that tk ∈ Oi. Within the grid representation, the evaluation of an integral operator B in the sense of numerical integration corresponds to matrix multiplication: for a function h, Bh is computed as Bh∆, where the vector h and the matrix B are the values of h and of the kernel of B on the grid. From a purely computational point of view, even linear operators that have no integral representation may be represented by matrices. In particular, the identity operator I used in ridge regularisation is represented by the matrix I equal to the identity matrix divided by ∆; indeed, its value at h is Ih∆ = h, thus it maps the argument on itself. The regularised operator ˆR (α) OiOi is represented by the matrix ˆR (α) OiOi = ˆROiOi +αIOi , where the subscript Oi denotes the submatrix corresponding to grid points in Oi. Analogously, the operators ˆRMiMi , ˆRMiOi etc. are given by the corresponding submatrices of ˆR. Then the matrix representation of the prediction operator ˆA (α) i is computed as ˆA (α) i = ˆROiMi ˆR (α)−1 OiOi ∆−1 . The regularised prediction of the missing part of the principal score and of the missing part of the trajectory can be computed as ˆβ (α) ij = ˆA (α) i (xiOi − ˆmOi )∆,ˆfjMi ∆, ˆx (α) iMi = ˆA (α) i (xiOi − ˆmOi )∆ + ˆmMi . The covariance operator ˆVi for the missing trajectory is obtained as ˆVi = ˆRMiMi − ˆA (α) i ˆROiOi ˆA (α)T i ∆2 and the variance for the score is ˆv2 ij = ˆfjMi , ˆVi ˆfjMi ∆2 . The effective degrees of freedom can be computed directly using the series in (9) truncated at d terms, with the eigenvalues ˆλOiOik of ˆROiOi obtained from the eigenvalues of the matrix ˆROiOi like in the case of those of ˆR discussed above. Alternatively, one can use the matrix trace formula trace( ˆR (α)−1 OiOi ˆROiOi ∆−1 )∆. The computation of the residual sum of squares for scores Supplementary document: Partially observed functional data 3 is straightforward; in the case of trajectories, the squared norms of functions are computed as the squared norms of vectors, multiplied by ∆. The generalised cross-validation score can be minimised numerically by a Newton-type iterative procedure. In particular, we use the method “L-BFGS-B” available in the function optim in the R package (R Core Team, 2013). For the reliability of the optimisation procedure, we found it useful to scale the input parameters: the minimisation is run with (xi − m)/s in place of xi (and, consequently, with ˆR/s2 in place of ˆR, ˆλOiOij/s2 in place of ˆλOiOij etc.); once the optimal value of α is found, it is multiplied by s2 to return to the original scale and perform other computations with original data. The value s2 = ˆλOiOi1 works well. The evaluation of the generalised crossvalidation score can be unstable for very small values of α. Therefore, we run the minimisation routine with a lower limit for α, namely with α0 = max(ε1/2 , α∗), where ε is the value of machine epsilon and α∗ is such that the effective degrees of freedom equal n/4 (which is a reasonable upper bound for the number of free parameters). We initialise the iterative procedure with α equal to max(¯λOiOi , α0) where ¯λOiOi is the average of the eigenvalues ˆλOiOij. 2. Proofs 2.1. Proof of Proposition 1 We use the notation Zi = Xi − µ. For part (a), denote ¯µ(t) = J(t)µ(t) and write E ˆµ − µ 2 ≤ E( ˆµ − ¯µ + ¯µ − µ )2 = E ˆµ − ¯µ 2 + 2 E( ˆµ − ¯µ ¯µ − µ ) + E ¯µ − µ 2 . (1) The first term on the right-hand side of (1) equals E J n i=1 Oi n i=1 OiZi 2 = n−2 1 0 n j=1 n k=1 E n2 J(t) ( n i=1 Oi(t))2 Oj(t)Zj(t)Ok(t)Zk(t) dt = n−2 1 0 n j=1 E n2 J(t)Oj(t) ( n i=1 Oi(t))2 E Zj(t)2 dt, where the last equality follows from the independence of (O1, . . . , On) and (Z1, . . . , Zn), and from the independence of Zj and Zk for j = k. Rewrite the first expectation in the integrand as E n2 J(t)Oj(t) ( n i=1 Oi(t))2 1[n−1 n i=1 Oi(t)>δ1] + E n2 J(t)Oj(t) ( n i=1 Oi(t))2 1[n−1 n i=1 Oi(t)≤δ1] . For all t ∈ [0, 1], the first summand is bounded from above by δ−2 1 while the second summand is dominated by n2 supt∈[0,1] P(n−1 n i=1 Oi(t) ≤ δ1). Hence we see that E ˆµ − ¯µ 2 ≤ n−1 δ−2 1 + n2 sup t∈[0,1] P n−1 n i=1 Oi(t) ≤ δ1 E Z1 2 = O(n−1 ). For the last term in (1), we obtain 1 0 E(J(t) − 1)µ(t)2 dt = 1 0 P n i=1 Oi(t) = 0 µ(t)2 dt 4 David Kraus ≤ sup t∈[0,1] P n−1 n i=1 Oi(t) ≤ δ1 µ 2 = O(n−2 ). The second term on the right-hand side of (1) is dominated by 2(E ˆµ− ¯µ 2 )1/2 (E ¯µ−µ 2 )1/2 ≤ O(n−1 ). Putting these results together completes the proof of part (a). The proof of part (b) is similar. Rewrite ˆR − R = ( ˆR − ˇR) + ( ˇR − ¯R) + ( ¯R − R), (2) where ˇR and ¯R are integral operators with kernels ˇρ(s, t) = I(s, t) n i=1 Ui(s, t) n i=1 Ui(s, t)Zi(s)Zi(t), and ¯ρ(s, t) = I(s, t)r(s, t). The first term on the right-hand side of (2) reflects the effect of estimation of the mean. By direct computation, we see that E ˆR − ˇR 2 2 = E [0,1]2 I(s, t){ˆµst(s) − µ(s)}2 {ˆµst(t) − µ(t)}2 dsdt = E [0,1]2 I(s, t) ( n i=1 Ui(s, t))4 n i=1 Ui(s, t)Zi(s) 2 n i=1 Ui(s, t)Zi(t) 2 dsdt. Developing the sums in the integrand and using the independence of the functions and observation indicators and the Cauchy–Schwarz inequality, we can show that the above quantity is dominated by n−2 [0,1]2 E n2 I(s, t) ( n i=1 Ui(s, t))2 {(E Z1(s)4 E Z1(t)4 )1/2 + ρ(s, t)2 }dsdt ≤ O(n−2 ), where the last inequality is due to the fact that the first expectation in the integrand is bounded by δ−2 2 +n2 sup(s,t)∈[0,1]2 P(n−1 n i=1 Ui(s, t) ≤ δ2), which can be shown by manipulations similar to those in part (a). Next, analogously to part (a) we obtain for the second and third term on the right-hand side of (2) that E ˇR − ¯R 2 2 ≤ n−1 δ−2 2 + n2 sup (s,t)∈[0,1]2 P n−1 n i=1 Ui(s, t) ≤ δ2 E Z1 ⊗ Z1 − R 2 2 = O(n−1 ) (here ⊗ denotes the tensor product) and E ¯R − R 2 2 ≤ O(n−2 ). Combining these bounds we obtain the assertion of part (b). 2.2. Proof of Proposition 2 Lemma 4.2 of Bosq (2000) and the inequality between the operator norm and Hilbert–Schmidt norm yield that |ˆλj − λj| ≤ ˆR − R ∞ ≤ ˆR − R 2 for all j. The first result then follows from part (b) of Proposition 1. For the second part, Lemma 4.3 of Bosq (2000) gives the inequality Supplementary document: Partially observed functional data 5 ˆϕj − ˆsjϕj ≤ aj ˆR − R ∞, where aj is a constant depending on the eigenvalue spacings. Note that this lemma is formulated in Bosq (2000) for fully observed functions but an inspection of the proof shows that the inequality holds for any two compact linear operators in place of ˆR, R. This inequality, the dominance of the Hilbert–Schmidt norm over the operator norm and part (b) of Proposition 1 complete the proof. 2.3. Proof of Proposition 3 Rewrite ˆβ (αn) ijMi − βijMi = (ˆβ (αn) ijMi − ˜βijMi ) + (˜βijMi − βijMi ) and use Theorem 1 to obtain the first part of the proposition. Compute v2 ij = var(˜βijMi − βijMi ) = ϕjMi , RMiMi ϕjMi − ϕjMi , RMiOi R−1 OiOi ROiMi ϕjMi . The convergence in probability of ˆϕjMi , ˆRMiMi ˆϕjMi to ϕjMi , RMiMi ϕjMi is a direct consequence of Propositions 1 and 2. The last term in the expression for v2 ij and the corresponding term in the estimator ˆv2 ij equal ˜aij, ROiOi ˜aij , ˆa (αn) ij , ˆROiOi ˆa (αn) ij , respectively. In their difference ˆa (αn) ij , ( ˆROiOi − ROiOi )ˆa (αn) ij + ( ˆa (αn) ij , ROiOi ˆa (αn) ij − ˜aij, ROiOi ˜aij ), the convergence of the second term to zero was shown in the proof of Theorem 1. For the first term we compute | ˆa (αn) ij , ( ˆROiOi − ROiOi )ˆa (αn) ij | ≤ ˆROiOi − ROiOi ∞ ˆa (αn) ij 2 ≤ OP (n−1/2 )α−2 n ˆROiMi 2 ∞ → 0. This completes the proof of the consistency of ˆv2 ij. The remaining assertions are obvious. 2.4. Proof of Proposition 4 We can rewrite ˆX (αn) iMi − XiMi = ( ˆX (αn) iMi − ˜XiMi ) + ( ˜XiMi − XiMi ). Due to Theorem 2, the L2 -norm of the first term on the right-hand side converges to 0 in probability. The second term is the limiting stochastic process. The consistency of the covariance estimator can be proven like in the proof of Proposition 3. The assertion for the Gaussian case follows immediately from the fact that the limiting process is a linear function of Xi. References Bosq, D. (2000). Linear Processes in Function Spaces. Springer, New York. R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer, New York. D. Classification of functional fragments by regularized linear classifiers with domain selection By David Kraus and Marco Stefanucci Biometrika, 106(1):161–180, 2019 DOI: 10.1093/biomet/asy060 112 Biometrika (2019), 106, 1, pp. 161–180 doi: 10.1093/biomet/asy060 Printed in Great Britain Advance Access publication 17 December 2018 Classification of functional fragments by regularized linear classifiers with domain selection By DAVID KRAUS Department of Mathematics and Statistics, Masaryk University, Kotláˇrská 2, 61137 Brno, Czech Republic david.kraus@mail.muni.cz AND MARCO STEFANUCCI Department of Statistical Sciences, Sapienza University of Rome, Piazzale Aldo Moro 5, 00185 Roma, Italy marco.stefanucci@uniroma1.it Summary We consider classification of functional data into two groups by linear classifiers based on one-dimensional projections of functions. We reformulate the task of finding the best classifier as an optimization problem and solve it by the conjugate gradient method with early stopping, the principal component method, and the ridge method. We study the empirical version with finite training samples consisting of incomplete functions observed on different subsets of the domain and show that the optimal, possibly zero, misclassification probability can be achieved in the limit along a possibly nonconvergent empirical regularization path. We propose a domain extension and selection procedure that finds the best domain beyond the common observation domain of all curves. In a simulation study we compare the different regularization methods and investigate the performance of domain selection. Our method is illustrated on a medical dataset, where we observe a substantial improvement of classification accuracy due to domain extension. Some key words: Classification; Conjugate gradient; Domain selection; Functional data; Partial observation; Regularization; Ridge method. 1. Introduction We consider classification of a functional observation into one of two groups. Classification of functional data is a rich, longstanding topic and is comprehensively surveyed in Baíllo et al. (2011b). Delaigle & Hall (2012a) showed that depending on the relative geometric positions of the difference of the group means, representing the signal, and the covariance operator, summarizing the structure of the noise, certain classifiers can have zero misclassification probability. This remarkable phenomenon, called perfect classification, is a special property of the infinite-dimensional setting and cannot occur in the multivariate context, except in degenerate cases. Delaigle & Hall (2012a) showed that a particularly simple class of linear classifiers, based on a carefully chosen one-dimensional projection of the function to be classified, can achieve this optimal error rate either exactly or in the limit along a sequence of approximations. Berrendero et al. (2018) further elucidated the perfect classification phenomenon from the point c 2018 Biometrika Trust Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 162 D. Kraus AND M. Stefanucci of view of the Feldman–Hájek dichotomy between mutual singularity and absolute continuity of two Gaussian measures on abstract spaces with respect to each other. Motivated by these findings, we reformulate the problem of determining the best classifier as a quadratic optimization problem on a function space or, equivalently, a linear inverse problem. These problems are ill-posed; however, unlike with most inverse problems, this is not a complication but rather an advantage in the sense that the more ill-posed the problem is, the better the optimal misclassification probability. We use regularization techniques, such as the method of conjugate gradients with early stopping and ridge regularization, to solve the optimization problem, obtaining a class of regularized linear classifiers. The optimal misclassification rate is the limit along the regularization path of solutions which themselves may not converge. We study the empirical version of the problem, where the objective function in the constrained minimization must be estimated from finite training data, and make two contributions. First, we show that it is possible to construct an empirical regularization path towards the possibly nonexistent unconstrained solution such that the classification error converges to its best value, possibly zero. We do this for conjugate gradient, principal component and ridge classification in a truly infinite-dimensional manner, in the sense that the convergence takes place along a path with decreasing regularization and holds without restrictions on the mean difference between classes. Second, all our methods and theory are developed in the setting of partially observed functional data, where trajectories are observed only on subsets of the domain. This type of incomplete data, also called functional fragments, is increasingly common in applications; see, for example, Bugni (2012), Delaigle & Hall (2013), Liebl (2013), Goldberg et al. (2014), Kraus (2015), Delaigle & Hall (2016) and Gromenko et al. (2017). The principal difficulty for inference with fragments is that temporal averaging is precluded by the incompleteness of the observed functions. Our formulation as an optimization problem enables us to overcome this issue under certain assumptions, because only averaging across individuals in the training data is needed, and not individual curves. Since the observation domains may vary in the training sample and the new curve to be classified may be observed on a different subset, it is natural to ask which domain should be used. We propose a domain selection strategy that looks for the best classifier with domain ranging from a minimum common domain to the entire domain of the function to be classified. For various methods of selecting the best observation points, see Ferraty et al. (2010), Delaigle et al. (2012), Pini & Vantini (2016), Berrendero et al. (2018) and Stefanucci et al. (2018). Our simulation study confirms that domain selection can considerably reduce the misclassification rate. Further simulations compare the performances of the three types of regularization. Among other findings, this study shows that the principal component and conjugate gradient classifiers often achieve comparable error rates but that the latter usually needs a lower dimension of the regularization subspace, in agreement with a theoretical result we provide. Application to a dataset on the geometric features of the internal carotid artery in patients with and without aneurysm demonstrates the utility of our proposed approach. These data consist of trajectories observed on intervals of different lengths. Previous analyses of the data used the common domain of all curves in classification. With our results we can include information beyond this minimum domain, which leads to a substantial drop in the error rate of discrimination between risk groups. General references on functional data analysis include Ramsay & Silverman (2005) and Horváth & Kokoszka (2012). Further relevant references are Cuesta-Albertos et al. (2007) for other methods based on one-dimensional projections, Berrendero et al. (2016) for variable selection in classification, Bongiorno & Goia (2016) and Dai et al. (2017) for classification beyond the Gaussian setting, and Cuevas (2014) for an overview. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 163 2. Regularized linear classification 2.1. Projection classifiers We regard functional observations as random elements of the separable Hilbert space L2(I) of square-integrable functions on a compact domain I equipped with inner product f , g = I f (t)g(t) dt and norm f = f , f 1/2. In most applications I is an interval and the observations are curves, but our results can be extended to other objects, such as surfaces or images. We consider classification of a Gaussian random function, X , into one of two groups of Gaussian random functions: group 0 has mean μ0; group 1 has mean μ1. Both groups have covariance operator R defined as the integral operator (Rf )(·) = I ρ(· , t)f (t) dt with kernel ρ(s, t) = cov{X (s), X (t)}. In this section we assume that μ0, μ1 and R are known, which corresponds to the asymptotic situation with an infinite training sample. To simplify the presentation we assume throughout the paper that the new observation to be classified may come from either of the two classes with equal prior probability. The general case is treated in the Supplementary Material. Like Delaigle & Hall (2012a) we consider the class of centroid classifiers that are based on one-dimensional projections of the form X , ψ , where ψ is a function in L2(I). If X belongs to group j (j = 0, 1), the distribution of X , ψ is normal with mean μj, ψ and variance ψ, Rψ . Denote the corresponding Gaussian densities by fψ,j. The optimal classifier based on X , ψ assigns X to the class Cψ(X ) given by Cψ(X ) = 1{fψ,1( X ,ψ )/fψ,0( X ,ψ )>1} = 1{ X −μ0,ψ 2− X −μ1,ψ 2>0} = 1{Tψ (X )>0}, where Tψ(X ) = X − ¯μ, ψ μ, ψ with ¯μ = (μ0 + μ1)/2 and μ = μ1 − μ0. The misclassification probability of this classifier is D(ψ) = P0{Cψ(X ) = 1}/2 + P1{Cψ(X ) = 0}/2 = P0( X − ¯μ, ψ μ, ψ > 0) = P0( X − μ0, ψ > | μ, ψ |/2) = 1 − | μ, ψ | 2 ψ, Rψ 1/2 , (1) where Pj is the distribution of curves in group j and is the standard normal cumulative distribution function. To find the best function ψ, one would ideally like to maximize |Z(ψ)|, where Z(ψ) = μ, ψ ψ, Rψ 1/2 . Similarly to Delaigle & Hall (2012a) and Berrendero et al. (2018), we see that if R−1/2μ < ∞, then by the Cauchy–Schwarz inequality, | μ, ψ | ψ, Rψ 1/2 = | R−1/2μ, R1/2ψ | ψ, Rψ 1/2 R−1/2μ R1/2ψ ψ, Rψ 1/2 = R−1/2 μ . (2) If, moreover, R−1μ < ∞, then the equality is achieved for ψ = R−1μ. For this choice of ψ, or anymultipleofit,theprobabilityofmisclassificationis1− ( R−1/2μ /2),whichispositivedue Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 164 D. Kraus AND M. Stefanucci to the finiteness of R−1/2μ , which can be seen as the signal-to-noise ratio. If R−1/2μ < ∞, then regardless of whether R−1μ < ∞ or not, two Gaussian measures with mean difference μ and covariances R are mutually absolutely continuous and 1− ( R−1/2μ /2) is the Bayes error for distinguishing them, i.e., the lowest possible misclassification probability for this problem among all possible classifiers (Berrendero et al., 2018). If R−1/2μ < ∞ but R−1μ = ∞, then the Bayes risk cannot be achieved by a projection classifier based on a bounded linear functional of the form X , ψ for some ψ ∈ L2(I). One can, however, use the theory of reproducing kernel Hilbert spaces to define a linear classifier that achieves the Bayes risk. We do not pursue this line of development here because, as will be seen in § 2.2, approximations in the form of projections can asymptotically achieve the Bayes risk. The maximization of |Z(ψ)| can be solved as the task of maximizing μ, ψ subject to ψ, Rψ = 1. Using Lagrange multipliers μ, ψ + λ(1 − ψ, Rψ ) and taking the Fréchet derivative with respect to ψ, one obtains the equation 2λRψ = μ. Solutions for all λ > 0, if they exist, i.e., if R−1μ < ∞, yield the same optimal misclassification probability. Without loss of generality we take λ = 1/2. Thus, minimizing the error rate translates into the unconstrained quadratic optimization problem to maximize μ, ψ − ψ, Rψ /2, or minimize ψ, Rψ /2 − μ, ψ , (3) i.e., into the linear problem Rψ = μ. 2.2. Regularization Ifψ = R−1μdoesnotexistinL2(I),i.e., R−1μ = ∞,thereisnomaximizerof|Z(ψ)|.One can instead consider an approximating, regularized problem that can be solved. Regularization is typically used to solve, in a stable way, ill-posed inverse problems for which a solution exists. In such contexts, the path of regularized solutions converges to the solution to the problem of interest. Here it may be that no solution exists, but paths of regularized solutions towards the possibly nonexistent solution still turn out to be useful, since the misclassification probability converges to the optimal value along these paths. If a solution exists, one can approximate it by an iterative numerical method. This approach can also be used when no solution exists. The idea is to construct a sequence of iterations of an appropriate numerical optimization method. The number of steps taken along this divergent sequence towards the nonexistent solution can be seen as a regularization parameter. The conjugate gradient method is particularly suitable for this situation. The first m steps of the conjugate gradient method applied to the linear inverse problem Rψ = μ, or equivalently to the minimization of the quadratic functional ψ, Rψ /2 − μ, ψ , are described in Algorithm 1. This formulation is based on the multivariate version in Phatak & de Hoog (2002, § 5), where one can find further references and details on how applying the conjugate gradient method to the normal equations in linear regression leads to partial least squares regression. The functions νj are conjugate directions in the sense that νj, Rνk = 0 for j |= k, and the functions ζj are called residuals in numerical analysis and are orthogonal, i.e., ζj, ζk = 0 for j |= k. In step j, the algorithm moves from the current approximate solution ˆψCG j along the conjugate direction νj with step length hj that minimizes the quadratic objective. The residual is then updated to ζj+1. The new conjugate direction νj+1 is obtained by projecting the residual ζj+1 onto the orthogonal complement of the span of the previous conjugate directions, where orthogonality is in the sense of the inner product · , R(·) . Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 165 Algorithm 1. Conjugate gradient regularized classification direction. Initialize ψCG 0 = 0, ν0 = ζ0 = μ Repeat for j = 0, . . . , m − 1 hj = νj, ζj / νj, Rνj ψCG j+1 = ψCG j + hjνj ζj+1 = μ − RψCG j+1 (= ζj − hjRνj) gj = − ζj+1, Rνj / νj, Rνj νj+1 = ζj+1 + gjνj Output ψCG m The conjugate gradient approach is an example of dimension reduction regularization. The method solves the minimization problem (3) with ψ restricted to the Krylov subspace Km(R, μ) spanned by μ, Rμ, . . . , Rm−1μ and also by the first m conjugate directions νj or the first m residuals ζj; that is, it seeks to minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Km(R, μ). The projection direction that solves this minimization is ψCG m . Another popular choice is to minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Em(R), where Em(R) is the subspace spanned by the first m eigenfunctions, ϕ1, . . . , ϕm, of R in the spectral decomposition R = ∞ j=1 λjϕj ⊗ ϕj, with λ1 λ2 · · · > 0 being the eigenvalues. The solution ψPC m = m j=1 λ−1 j μ, ϕj ϕj gives the principal component classifier of Delaigle & Hall (2012a). In general one can minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Sm, where Sm is the mdimensional subspace generated by some functions s1, . . . , sm such that the sj (j = 1, 2, . . . ) generate the range of R. Let Pm be the projection operator that projects onto Sm, and let Rm = PmRPm and R− m = PmR−1Pm. Then the solution of the regularized minimization problem is ψm = R− m μ. More explicitly, considering solutions of the form ψm = m j=1 cjsj leads to the m-variate minimization of cT Qc/2 − uT c where the matrix Q is such that Qjk = sj, Rsk and the vector u has components uj = μ, sj , i.e., to the solution with coefficients c = Q−1u. In the case of the Krylov subspace, the iterative conjugate gradient method given in Algorithm 1 is, however, preferred because the matrix Q is ill-conditioned. We can also take another approach to regularization, based on ridge regression. Optimizing the misclassification probability in a ball with radius θ1/2 leads to the task of minimizing ψ, Rψ /2− μ, ψ subject to ψ 2 θ or, equivalently, minimizing ψ, Rψ /2− μ, ψ + α ψ 2/2, where α 0 is a regularization parameter. The solution is ψR α = R−1 α μ, where Rα = R + αI and I denotes the identity operator. Despite its practical performance and amenability to theoretical analysis, the functional ridge classifier does not seem to have been considered before. There is an important difference between the conjugate gradient method and the other approaches.While the principal component and ridge methods regularize the problem without the main goal in mind, the conjugate gradient approach greedily follows the goal of optimal classification. Indeed, the conjugate gradient method as an iterative optimization procedure constructs the regularization path focusing on the minimization of the misclassification probability, whereas the other approaches regularize by modifying the operator to be inverted regardless of the goal. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 166 D. Kraus AND M. Stefanucci From a computational point of view the conjugate gradient method is simplest because it does not require inversion or eigendecomposition. 2.3. Properties of regularization paths While ψm, the solution regularized by a subspace constraint, in general need not converge as m → ∞ since a solution to the unconstrained minimization problem may not exist, the misclassification probability associated with the linear classifier given by ψm converges along the regularization path. The following and all other results are proved in the Appendix. Proposition 1. The misclassification probability of the regularized linear classifier based on ψm = R− m μ converges to 1 − ( R−1/2μ /2) as m → ∞. This result holds regardless of whether the unconstrained minimization problem (3) has a solution, i.e., regardless of whether R−1μ < ∞. The limiting misclassification probability is positive if R−1/2μ < ∞ or zero if R−1/2μ = ∞. As discussed earlier, the optimal error is achieved exactly by the one-dimensional projection onto ψ = R−1μ, when R−1μ < ∞. Even when R−1μ = ∞, both of the dimension reduction techniques, namely the conjugate gradient and principal component methods, and also ridge regularization as we will soon see, achieve the optimal limiting error rate along a possibly nonconvergent path of one-dimensional projection directions. It is natural to investigate and compare how quickly the misclassification rate approaches the limit for the two main types of subspace regularization. It turns out that the conjugate gradient classifier, being a greedy, goal-oriented procedure, performs as well as or better than the principal component classifier with the same dimension. Proposition 2. Regardless of whether the optimal misclassification probability can be achieved exactly or along a regularization path, i.e., whether R−1μ < ∞ or R−1μ = ∞, and regardless of whether the optimal misclassification probability is zero or positive, i.e., whether R−1/2μ = ∞ or R−1/2μ < ∞, the misclassification probability of the principal component classifier using m components is higher than or equal to the misclassification probability of the m-step conjugate gradient classifier. Phatak & de Hoog (2002, § 6.2) showed in the multivariate setting that ‘PLS fits closer than PCR’. In infinite dimensions, in the context of kernel partial least squares, Blanchard & Krämer (2010, Theorem 1) showed that the partial least squares solution is closer to the true solution of the inverse problem than is the principal component solution with the same number of components. Unlike these results, our Proposition 2 does not assume the existence of a solution and instead focuses on the values of the misclassification probability. Although Proposition 2 suggests that the conjugate gradient method will typically use fewer components than the principal component method to achieve the best result, the resulting misclassification probability with the best number of components need not be better. We address this in the simulation study. A similar phenomenon was previously studied in the literature on partial least squares in finite dimensions and in the functional setting by Febrero-Bande et al. (2017). As in the case of subspace regularization, below we obtain the convergence of the error probability of the ridge classifier, whether or not the unconstrained minimization problem (3) has a solution, i.e., regardless of whether R−1μ < ∞. The limiting misclassification probability is positive if R−1/2μ < ∞ or zero if R−1/2μ = ∞. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 167 Proposition 3. The misclassification probability of the regularized linear classifier based on ψR α = R−1 α μ converges to 1 − ( R−1/2μ /2) as α → 0+. 3. Empirical classifiers for fragmentary functions 3.1. Construction of classifiers with incomplete training samples So far we have assumed that the parameters of each group are known. We now present the empirical version with a finite training dataset, and show that under regularity conditions such classifiers can achieve asymptotically the same optimal error rate as if there were infinite training data. We aim to do this not only in the case of fully observed functions but also in the case of incomplete curves. Incompleteness can occur in the training data, with each curve possibly observed on a different domain, as well as in the new curve that we wish to classify. One strategy would be to consider all curves on the intersection of their observation domains, if it is nonempty. However, such a restriction can be too severe and is unnecessary. We will construct classifiers that use the observed new curve on a set I, which may be its entire observation set or a subset thereof, without requiring that all training curves be completely observed on I. For group j let there be a training sample consisting of nj curves, Xj1, . . . , Xjnj . The training data are assumed to be mutually independent. Curves may be observed incompletely, with values known only on a subset Oji of the domain and with no information about the values on the complement. The observation domains are assumed to be independent of the curves and consist of a finite union of intervals. We let Oji(t) denote the indicator of the curve Xji being observed at time t. Similarly, let Uji(s, t) indicate observation at times s and t, i.e., Uji(s, t) = Oji(s)Oji(t). The mean μj of group j can be estimated by the cross-sectional average ˆμj(t) = 1{Nj(t)>0} Nj(t) nj i=1 Oji(t)Xji(t) (j = 0, 1), where Nj(t) = nj i=1 Oji(t) is the total number of observed curves in group j at time t. The covariance kernel ρ(s, t) can be estimated by the empirical covariance using pairwise complete observations of groupwise centred curves. Formally, the estimator is ˆρ(s, t) = M1(s, t) ˆρ1(s, t) + M2(s, t) ˆρ2(s, t) M1(s, t) + M2(s, t) , where Mj(s, t) = nj i=1 Uji(s, t) and ˆρj(s, t) = 1{Mj(s,t)>0} Mj(s, t) nj i=1 Uji(s, t){Xji(s) − ˆμjst(s)}{Xji(t) − ˆμjst(t)} with ˆμjst(s) = 1{Mj(s,t)>0}Mj(s, t)−1 nj i=1 Uji(s, t)Xji(s). If Nj(t) = 0 or Mj(s, t) = 0, the estimators are defined as ˆμj(t) = 0 or ˆρj(s, t) = 0, respectively. This happens with asymptotically vanishing probability under Assumption 1 below. Suppose that the new independent curve to be classified, Xnew, is observed on the domain Onew. Let us fix the target domain I ⊆ Onew on which we aim to apply the classifier to Xnew. The empirical classifier ˆC ˆψ trained on partially observed curves is defined like the theoretical one, with unknown quantities replaced by their estimators. It assigns Xnew restricted to I to the class Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 168 D. Kraus AND M. Stefanucci ˆC ˆψ(Xnew) = 1{ ˆT ˆψ (Xnew)>0}, where ˆT ˆψ(Xnew) = Xnew − ˜μ, ˆψ ˆμ, ˆψ . Here ˜μ = ( ˆμ0 + ˆμ1)/2 and ˆμ = ˆμ1 − ˆμ0, with ˆμj being the estimators defined above restricted to I. The projection direction ˆψ is one of ˆψCG m , ˆψPC m or ˆψR α , constructed respectively by conjugate gradient, principal component or ridge regularization applied to ˆμ and ˆR, where ˆR is the integral operator with kernel ˆρ(s, t) introduced above, restricted to I × I. All methods discussed in the previous section can be formulated in terms of the population parameters, i.e., the mean difference and covariance operator, and not in terms of individual observations in the training set. The population parameters can be consistently estimated by averaging individual observations, whereas temporal averaging of individual curves, for example in inner products, is impossible due the incompleteness of the observed functions. In particular, the conjugate gradient method can be applied to fragmentary training data, whereas the usual algorithms for multivariate or functional partial least squares, such as those in De Jong (1993), Hastie et al. (2009,Algorithm 3.3) and Delaigle & Hall (2012b, § 4.2 andAppendixA.2), involve the computation of certain scores, i.e., inner products, for individual curves. 3.2. Asymptotic behaviour along the empirical regularization path We aim to study the behaviour of classifiers on incomplete training samples of increasing size with decreasing amounts of regularization. Previous asymptotic results in related settings include those of Delaigle & Hall (2013), who established the consistency of empirical principal component classifiers based on partially observed training data. In the setting of complete curves, Berrendero et al. (2018) used dimension reduction regularization by evaluation of curves at a finite set of arguments; they proved consistency of the empirical version but did not study the asymptotics for decreasing amounts of regularization, i.e., they did not consider letting the dimension grow. Baíllo et al. (2011a) studied optimal classifiers for Gaussian measures based on Radon–Nikodym derivatives and investigated the performance of their empirical version in the special class of processes with triangular covariance functions. In contrast, all of our methods, including the ridge approach not considered previously, have been developed for fragmentary training samples and shown to achieve the Bayes error rate for general Gaussian processes along the empirical regularization path, as we now explain. The following assumptions will be needed for the derivation of asymptotic properties of empirically trained regularized linear classifiers. Assumption 1. The distributions in groups j = 0, 1 satisfy EPj ( X 4) < ∞. Assumption 2. For a domain I, there exists δ > 0 such that the observation patterns in training samples j = 0, 1 satisfy, as nj → ∞, sup (s,t)∈I×I pr n−1 j Mj(s, t) > δ = O(n−2 j ). Assumption 1 guarantees the consistency of the empirical mean and covariance operator for samples of completely observed curves; see, for example, Bosq (2000) or Horváth & Kokoszka (2012). Kraus (2015, Proposition 1) showed, under the additional Assumption 2 with I equal to the entire domain of the curves, that the root-n consistency of the sample mean and covariance restricted to I continues to hold in the fragmentary setting. In particular, it follows that ˆμj − μj = Op(n −1/2 j ) and hence ˆμ − μ = Op(n−1/2) for n = min(n0, n1) → ∞, and also that ˆR − R ∞ = Op{(n0 + n1)−1/2}, where · ∞ is the operator norm. When I is a subset of Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 169 the domain, analogous results hold for the restrictions of the functions and integral kernels to I. Assumption 2 means that at all pairs of time-points there is an asymptotically nonnegligible fraction of observed values. Assumption 2 is less restrictive than the requirement that there be complete curves in the sample. It can be satisfied, for example, in situations where the observed curves consist of several shorter fragments. If the assumption is not satisfied because the data contain only one short fragment per curve, other estimation methods can be used; see, for example, Delaigle & Hall (2016) and Descary & Panaretos (2019). We now study the asymptotic behaviour of the empirical classifier when the number mn of steps of the conjugate gradient algorithm grows as the training sample size grows. Under certain conditions on the regularization path, we establish the convergence of the misclassification probability of the conjugate gradient classifier trained on collections of functional fragments to the same optimal limit as for the theoretical conjugate gradient classifier with an infinite training sample, regardless of whether the limiting error rate is zero or positive and regardless of whether the limit can be theoretically achieved exactly or along the path. Theorem 1. Suppose that Assumption 1 holds. Assume that n = min(n0, n1) → ∞ and mn → ∞ in such a way that mn Cn1/2 for some C > 0 and n−1/2 ω−1 mn γ (mn) + n−1 ω−3 mn → 0, (4) where ωmn is the smallest eigenvalue of the mn×mn matrix H with entries hjk = κj, Rκk for κj = Rj−1μ and the mn-vector γ (mn) is defined as γ (mn) = H−1d with d being the mn-vector having components dj = μ, κj . Then the misclassification probability of the empirical regularized linearclassifierbasedon ˆψCG mn convergesinprobabilitytotheoptimalmisclassificationprobability 1 − ( R−1/2μ /2). Condition (4) guarantees that the number of components does not grow too fast in relation to the growing number of training observations and to the increased ill-conditioning of the theoretical problem. Condition (4) is analogous to (5.10) in Delaigle & Hall (2012b) for partial least squares. The vector γ (mn) contains the coefficients of the theoretical regularized solution ψCG mn with respect to the non-orthogonal basis κ1, . . . , κmn of the Krylov subspace Kmn (R, μ), i.e., ψmn = mn j=1 γ (mn) j κj. The eigenvalues of H are called the Ritz values in numerical analysis. For details on connections with partial least squares see Lingjærde & Christophersen (2000). In the proof given in the Appendix we use the results of Delaigle & Hall (2012b) on the consistency of partial least squares regression for functional data. These results were obtained for situations that differ from our setting in several ways. In particular, we work with functional fragments instead of complete curves, the conjugate gradient path differs from partial least squares regression, e.g., inthe groupcentringinthe estimation ofthe covariance, and we do notrequirethat the population inverse problem, Rψ = μ in our context, have a solution. However, inspection of the underlying technical arguments in Delaigle & Hall (2012b) shows that appropriate analogous results can be obtained and used in our setting, as we explain in the proof. Next, we show that the empirically trained principal component classifier with an increasing number of components asymptotically achieves the optimal misclassification probability. Theorem 2. Suppose thatAssumption 1 holds.Assume that n = min(n0, n1) → ∞ and mn → ∞ in such a way that λ4 mn n → ∞ and λ2 mn n( mn j=1 aj)−2 → ∞, where a1 = 23/2(λ1 − λ2)−1 and aj = 23/2max{(λj−1 − λj)−1, (λj − λj+1)−1} for j = 2, 3, . . . . Then the misclassification Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 170 D. Kraus AND M. Stefanucci probability of the empirical regularized linear classifier based on ˆψPC mn converges in probability to the optimal misclassification probability 1 − ( R−1/2μ /2). The conditions on the principal component regularization path are the same as in the case of functional principal component regression (Cardot et al., 1999). Unlike in the functional linear model, it is not assumed that the inverse problem has a solution, since the goal is not to estimate the possibly nonexistent bounded linear regression functional. Finally, the empirical ridge classifier with finite training data asymptotically attains the same optimal error rate as its theoretical counterpart. Unlike for the conjugate gradient and principal component classifiers, the conditions on the ridge path classifier do not involve parameters of the distributions because no subspace is constructed. Theorem 3. Suppose that Assumption 1 holds. Assume that n = min(n0, n1) → ∞ and αn → 0+ in such a way that α4 nn → ∞. Then the misclassification probability of the empirical regularized linear classifier based on ˆψR αn converges in probability to the optimal misclassification probability 1 − ( R−1/2μ /2). 3.3. Selection of the regularization parameter The regularization parameter can be selected by minimizing an estimate of the misclassification probability. We use leave-one-out crossvalidation. The Supplementary Material provides details of crossvalidation in the presence of incomplete curves. The best value of the regularization parameter is searched for over a grid of values, such as the values corresponding to integer degrees of freedom up to some maximum value. The number of degrees of freedom for the subspace methods is the dimension of the subspace, and for the ridge method it is defined as the trace of ( ˆR + αI )−1 ˆR, i.e., n0+n1 j=1 ˆλj/(ˆλj + α) where ˆλj are the eigenvalues of ˆR. The maximum number of degrees of freedom we use is one fifth of the number of curves. 4. Domain selection To classify the new curve Xnew observed on Onew, we apply the classifier on the target domain I ⊆ Onew, the choice of which we now consider. One possibility would be to restrict attention to the intersection of the observation domains of all curves, say I0, if it is nonempty. An obvious drawback of this approach is that one can lose discriminatory power because any differences between the classes may be more pronounced outside I0. An advantage of our approach is its capability of working with incomplete curves, since the empirical construction of the projection direction requires only the estimation of μ and R on the target domain. Hence one can look at a domain larger than I0. A natural choice is the largest subset of Onew that contains enough data for estimation of the classifier, i.e., satisfies Assumption 2, and contains enough functions for validation in the crossvalidation procedure, i.e., has a sufficiently large set V. In this way one hopes to capture the widest range of shapes of the group difference. On the other hand, it could be that not even this maximal domain, Imax, will lead to the best classification accuracy, because one includes more uncertainty in the estimation due to the missing values and because the mean difference may not be important in the added part of the domain. Therefore, it seems reasonable to also consider intermediate choices between I0 and Imax. Here we present a domain selection strategy for the most common case of interval observation sets. The idea, worked out in detail in Stefanucci et al. (2018), is to construct the classifier on a series of intervals that range from the common domain I0 to the maximal domain Imax, extending the working interval by a fixed percentage at each step. More formally, we consider a sequence Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 171 of nested intervals I0 ⊂ I1 ⊂ · · · ⊂ Ik ⊂ · · · ⊂ IK = Imax, starting from I0 and ending in IK = Imax, and build the classifier on each interval. The regularization parameter for the kth domain is selected by crossvalidation as described in the Supplementary Material. Among these K + 1 candidates we select the one that minimizes the crossvalidation estimate of error. The search strategy can be extended by considering larger systems of candidate domains; for example, one could vary the two endpoints independently. The idea can be generalized to other situations, such as non-interval observation sets, multivariate functional data or functions indexed by multivariate arguments. In each situation one needs to define a meaningful system of domains and optimize the crossvalidation score over the system. 5. Simulations 5.1. Behaviour of regularized classifiers on complete data In this section we illustrate the behaviour of the three estimators of ψ in different settings. We consider Gaussian processes on [0, 1] with covariance kernel ρ(s, t) = exp(−|s − t|2/0.01) and mean function depending on the group label. Group 0 has mean μ0(t) = 0 in each setting. Group 1 has mean μ1(t) = μ(t), for which we consider eight different forms: (i) ct, (ii) c(t−0.5)2, (iii) c(t−0.5)3, (iv) c sin(20t), (v) cϕ1(t), (vi) cϕ10(t), (vii) cb(t; 5, 5), and (viii) cb(t; 2, 6), where ϕj is the jth eigenfunction of the kernel ρ and b(t; α, β) = tα−1(1 − t)β−1 is the beta density. In each case the parameter c is selected to yield a reasonable misclassification rate. In each of 5000 repetitions we generated 50 curves from each group and evaluated them on a grid of 100 equispaced points in [0, 1]. We also generated a new observation that could arise from group 0 or group 1 with equal probability. Then we constructed the regularized classification direction by the principal component, conjugate gradient and ridge methods with m degrees of freedom and predicted the label of the new observation. We considered m = 1, . . . , 20, corresponding to a reasonable minimum of five observations per degree of freedom. Figure 1 shows the misclassification proportion over the 5000 repetitions as a function of m for the eight different choices of μ(t).As expected, the conjugate gradient method performs well in all settingsandisnotmuchaffectedbytheshapeofμ(t).Bycontrast,theperformanceoftheprincipal component classifier depends strongly on μ(t). To see this, consider the two extreme situations in settings (v) and (vi).The classification error of the principal component approach is close to that of the conjugate gradient method in case (v), where μ(t) is the first eigenfunction, but is much higher at lower dimensions in case (vi), where μ(t) is the tenth eigenfunction. In the latter case, the principal component method reaches the same level of error as the conjugate gradient method only when m = 10 or more. These findings agree with Proposition 2 and with the conclusions of Delaigle & Hall (2012a) and Febrero-Bande et al. (2017), who pointed out that principal components need more degrees of freedom than partial least squares to achieve good performance. In this regard ridge regularization seems to lie between the two subspace methods, but is more similar to the conjugate gradient method in most cases. In particular, it does not completely fail at low degrees of freedom in case (vi), because it does not construct a subspace that could miss the important information; however, it also suffers in this situation, where μ(t) is on the tail of the spectrum, because ridge penalization shrinks higher-index spectral components more than lower-index components. Nevertheless, with sufficiently many degrees of freedom, the three methods behave similarly. Additional simulation results, reported in the Supplementary Material, show that similar conclusions can be drawn when functions have nonsmooth trajectories and that the capability to discriminate between two groups with different means is robust with respect to the assumption of equal covariances. Results for increased training sample size are also provided in the Supplementary Material. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 172 D. Kraus AND M. Stefanucci 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (i) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (ii) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (iii) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (iv) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (v) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (vi) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (vii) 5 10 15 20 0 10 20 30 40 50 Degrees of freedom Misclassificationrate(%) (viii) Fig. 1. Misclassification rate (%) versus degrees of freedom for different forms of μ(t): (i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal, (v) first eigenfunction, (vi) tenth eigenfunction, (vii) symmetric beta, and (viii) asymmetric beta. The different curves represent the principal component (solid), conjugate gradient (dotted) and ridge (dashed) classifiers. Table 1. Misclassification rates (%), with standard errors in parentheses, achieved by classifiers with degrees of freedom selected by crossvalidation in the different settings; for each classifier the numbers in the second row are the minimum misclassification rates (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) PC 13.0(0.34) 8.3(0.28) 1.3(0.11) 2.5(0.16) 7.2(0.26) 7.6(0.27) 10.7(0.31) 26.2(0.44) 8.1 6.1 0.1 2.2 2.4 7.4 6.1 20.4 CG 8.6(0.28) 6.5(0.25) 0.7(0.09) 2.1(0.14) 2.6(0.16) 7.8(0.27) 6.1(0.24) 20.9(0.41) 8.1 5.7 0.1 2.1 2.2 7.2 5.7 19.9 R 8.4(0.28) 7.7(0.27) 0.7(0.09) 2.2(0.15) 2.4(0.15) 7.9(0.27) 6.1(0.24) 20.8(0.41) 7.9 6.5 0.2 2.0 2.3 7.3 5.7 20.0 PC, principal component classifier; CG, conjugate gradient classifier; R, ridge classifier. 5.2. Performance of crossvalidation for selection of degrees of freedom We used simulation to investigate the performance of leave-one-out crossvalidation in choosing the correct level of regularization. The settings were the same as in § 5.1, but classification was done using the number of degrees of freedom selected by leave-one-out crossvalidation. We summarize the classification errors in Table 1. Crossvalidation performs well as a selector of the best level of regularization since the misclassification rate in Table 1 is in each case close to the corresponding minimum error in Fig. 1. The principal component method appears to perform worst, while the conjugate gradient and ridge methods have comparable performance. The latter two methods nearly achieve the respective minimum errors. Table 2 reports the mean and median selected degrees of freedom. The principal component method often uses considerably more degrees of freedom than the other methods. This is particularly interesting in case (v), where the Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 173 Table 2. Mean and median (in parentheses) degrees of freedom selected by crossvalidation (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) PC 8.2 (7) 14.3 (15) 9.9 (9) 10.9 (10) 4.6 (4) 11.9 (11) 5.3 (4) 8.6 (6) CG 5.4 (3) 10.7 (11) 3.4 (2) 4.5 (2) 2.4 (1) 4.9 (3) 2.7 (1) 8.6 (7) R 6.4 (3) 11.6 (13) 6.0 (3) 6.1 (4) 2.7 (1) 9.3 (8) 3.4 (1) 6.7 (3) PC, principal component classifier; CG, conjugate gradient classifier; R, ridge classifier. 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 50 (a) (b) (c) Domain endpoint Misclassificationrate(%) 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 50 Domain endpoint Misclassificationrate(%) 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 50 Domain endpoint Misclassificationrate(%) Fig. 2. Misclassification rate (%) plotted as a function of the domain extension, for μ(t) being the (a) Be(2, 6), (b) Be(5, 5) or (c) Be(6, 2) density for the principal component (solid), conjugate gradient (dotted) and ridge (dashed) classifiers with selected degrees of freedom. Classification is performed on the domains [0, u] with u ∈ [0.5, 0.9], and the error values are plotted against u. mean difference equals the first eigenfunction and so one component should be the best choice in theory. These results again illustrate the general phenomenon that the principal component approach is inappropriate for inference about means due to the possible lack of informativeness of the principal components about the mean and the extra uncertainty associated with their estimation. 5.3. Missing data and domain extension We now demonstrate the usefulness of the domain extension approach presented in § 4, using Gaussian processes on [0, 1] with the same covariance as in § 5.1 and considering three scenarios for the mean difference in the form of a multiple of a beta density, (a) b(t; 2, 6), (b) b(t; 5, 5) and (c) b(t; 6, 2), which reflect situations where discrimination due to a peak is in the left, central and right parts of the domain, respectively. We sampled 50 curves from each group on a sequence of 100 equispaced points in [0, 1]. Then we generated endpoints of the observation interval for each curve from the uniform distribution on (0.5, 1); that is, each curve was observed between 0 and the endpoint and treated as missing beyond the endpoint. The new observation had an endpoint sampled between 0.5 and 1. So the first half of [0, 1], I0 = [0, 0.5], was the common observation domain of all curves. We considered extensions of I0 to Ik = [0, 0.5 + 0.05k] (k = 0, . . . , 8). For each interval of this form that was contained in the observation domain of the curve to be classified, we estimated the classifiers, choosing the best degrees of freedom via crossvalidation, and classified the new curve. This procedure was repeated 1000 times. We plot the behaviour of the resulting classification error as a function of the endpoint of the extended domain in Fig. 2. When the peak of the mean difference is in the left part of [0, 1], extending the domain does not lead to better classification. In this case the interval where the means mainly differ corresponds to the part of the domain where all the data are available, and inflating the domain only increases Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 174 D. Kraus AND M. Stefanucci Table 3. Misclassification rates (%), with standard errors in parentheses, achieved by classifiers with domain and degrees of freedom selected by crossvalidation in the different settings; the minimum and maximum misclassification rates are given in square brackets (a) (b) (c) PC 18.1 (0.38) [11.3, 33.7] 11.9 (0.32) [11.4, 15.2] 31.1 (0.46) [21.8, 46.0] CG 19.6 (0.39) [15.4, 25.7] 7.4 (0.26) [5.6, 9.3] 30.4 (0.46) [19.2, 45.7] R 22.4 (0.42) [17.2, 22.8] 6.9 (0.25) [5.4, 8.6] 28.4 (0.45) [20.7, 45.9] PC, principal component classifier; CG, conjugate gradient classifier; R, ridge classifier. the uncertainty due to missing data. In the second case, the peak of the mean difference is exactly at 0.5, and extending the domain leads to little improvement. The third scenario is the opposite of the first, as the discrimination is mainly in the right part of [0, 1]. In this case, extending the domain reduces the error considerably because good classification is only possible by employing the right part of the domain. The classification error is about 45% when using only I0, but drops to about 20% when using also the part of the interval where the data are partially observed. 5.4. Performance with selected domain Domain extension may or may not improve the performance of classifiers, depending on the interplay between the form of the mean difference, the covariance structure and the missingness pattern. In practice, the user is not an oracle with access to misclassification errors for candidate subsets whose estimates are plotted in Fig. 2, and hence would select the best domain by crossvalidation. In Table 3 we report simulation results for classifiers with both domain and degrees of freedom selected by crossvalidation, for the same configurations as in § 5.3. Selection of the domain leads to a considerable improvement of the error rate compared with the worst-performing domain. On the other hand, this improvement has some limitations and a gap remains between the achieved value and the best value; this can be explained by the fact that crossvalidation provides only an estimate of the error, not the true value. 6. AneuRisk data example We apply the proposed method to theAneuRisk dataset from an interdisciplinary project aimed at investigating the effects of blood vessel morphology, blood fluid dynamics and biomechanical properties of the vascular wall on the pathogenesis of cerebral aneurysms. An introduction to the data can be found in Sangalli et al. (2014b). This dataset has previously been analysed in several works that focused on different methodological aspects, such as function and derivative estimation (Sangalli et al., 2009b), exploratory analysis and classification (Sangalli et al., 2009a), and alignment and clustering (Sangalli et al., 2014a), among others. The data consist of measurements of the radius and curvature of the internal carotid artery in a sample of 65 patients, 33 of which have an aneurysm at the bifurcation of the vessel or after it, while the other 32 either have an aneurysm before the bifurcation, which is much less dangerous, or are healthy. The goal is to classify the patients based on the morphology of their internal carotid artery. In this example we work with only one of the observed variables, the radius. The data have previously been pre-processed, registered and smoothed, and are observed on a grid of 2000 points in the interval [−100.3, 5.1], where the argument represents the distance between the observation point and the terminal bifurcation of the internal carotid artery, with positive values indicating points inside the skull. As we can see in Fig. 3, the data are partially observed because Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 175 −100 −80 −60 −40 −20 0 1 2 3 4 Internal carotid artery arc length (mm) Radius(mm) Fig. 3. Radius along the carotid artery from the AneuRisk dataset, along with the mean of the group of subjects with an aneurysm after the bifurcation (dotted) and the mean of the group of subjects with an aneurysm before the bifurcation or without an aneurysm (dashed). Curves for two example subjects are highlighted as solid lines. Note the different start and end points for different subjects in the study. the start and end points are different from subject to subject. All subjects are observed on the subset I0 = [−32.9, −7.4], which corresponds to 24.3% of the whole domain. We first apply the regularized linear classifiers to curves restricted to the common domain I0. The classification error estimated by crossvalidation is 29.2% for the principal component method, 29.2% for the conjugate gradient method, and 32.3% for ridge regularized classification. We compare the above procedure with a different approach consisting of a multivariate classification method applied to principal component scores. The covariance kernel is estimated from observations centred to their respective group means, its eigenfunctions are computed, and quadratic discriminant analysis is applied to the inner products of the uncentred curves with the eigenfunctions. This procedure is similar to that in Sangalli et al. (2009a). The best classifier of this type turns out to exhibit a misclassification error of 32.3%, obtained with two eigenfunctions. These values show that in this dataset, when attention is restricted to the common domain I0, our proposed method is comparable to the more standard multivariate technique. Next, we consider classification on extended domains including observed values outside the commondomainI0.WebuildthesequenceofdomainsI0, . . . , IK byenlargingthedomainateach step by 1.25% of the complement of I0. This step size is a compromise between the fineness of the grid and the computational cost. We consider extended domains up to K = 40, corresponding to I40 = [−66.6, −1.2], because not enough subjects have observed values outside this interval for reliable estimation and crossvalidation. All regularized linear classification methods benefit from the domain extension; in particular, the error rate for the principal component method drops from 29.2% to 23.2%, for the conjugate gradient method from 29.2% to 25.8%, and for ridge regularization from 32.3% to 25%. The best domain is I10 = [−41.3, −5.8] for the conjugate gradient method and I11 = [−42.2, −5.7] for the other two methods. The alternative method based on multivariate classification of scores cannot be applied on extended domains since the individual scores of incomplete curves cannot be computed, although they can be predicted (Kraus, 2015). By contrast, the proposed methods are entirely formulated in terms of distributional parameters, which can be consistently estimated from incomplete data, unlike individual quantities. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 176 D. Kraus AND M. Stefanucci Acknowledgement The AneuRisk data and useful comments were kindly provided by Laura Sangalli. The work of David Kraus was supported by the Czech Science Foundation. We are grateful to two referees, an associate editor and the editor for helpful suggestions and corrections. Supplementary material Supplementary material available at Biometrika online includes the derivation of classifiers under unequal prior class probabilities, algorithmic details of crossvalidation, and additional simulation and real-data results. Appendix Proof of Proposition 1 The misclassification probability for ψm is D(ψm) given in (1). Since ψm ∈ Sm, we compute | μ, ψm | ψm, Rψm 1/2 = μ, R− m μ μ, R− m RR− m μ 1/2 = (R− m )1/2 μ . By Lebesgue’s monotone convergence theorem, the right-hand side converges to R−1/2 μ , finite or infinite, and therefore the limiting misclassification probability that is attained along the regularization path ψm, as m → ∞, is 1 − ( R−1/2 μ /2). Proof of Proposition 2 The conjugate gradient method minimizes the quadratic objective function in the Krylov subspace Km(R, μ) whose elements are in the form η = m−1 k=0 ck Rk μ = p(R)μ, where p is a polynomial of order lower than m. Then η ∈ Km(R, μ) can be written as η = ∞ j=1 p(λj)bjϕj with bj = μ, ϕj . The objective function at η equals η, Rη /2 − μ, η = p(R)μ, Rp(R)μ /2 − μ, p(R)μ = ∞ j=1 b2 j {p(λj)2 λj/2 − p(λj)} = ∞ j=1 b2 j 2λj q(λj){q(λj) − 2}, (A1) where q(λ) = p(λ)λ is a polynomial of degree at most m such that q(0) = 0.The conjugate gradient method seeks the polynomial with these properties that minimizes the objective function. To prove the proposition we shall find a polynomial q with the required properties such that the objective function above is smaller than or equal to the objective function for the principal component classifier. The principal component classifier uses ψPC m = m j=1 λ−1 j bjϕj, and the objective function at ψPC m is ψPC m , RψPC m /2 − μ, ψPC m = − m j=1 b2 j 2λj . (A2) Consider the polynomial of degree m, q(λ) = 1 − (−1)m λ − λ1 λ1 · · · λ − λm λm , Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 177 with q(0) = 0. We see that q(λj) = 1 for j = 1, . . . , m, so the first m summands in the series (A1) and (A2) are equal. For j > m we have that 0 q(λj) 2 due to the properties of the eigenvalue sequence; so q(λj){q(λj) − 2} 0 and therefore the corresponding summands in the series (A1) are negative, whereas they are zero in the series (A2). Hence, for this polynomial, ∞ j=1 b2 j 2λj q(λi){q(λi) − 2} − m j=1 b2 j 2λj , and so the objective at the conjugate gradient solution must be smaller than or equal to the objective at the principal component solution. The inequality between the minima of the quadratic objective function implies the inequality between the misclassification probabilities stated in the proposition. Proof of Proposition 3 Proceeding as in the proof of Proposition 1, we need to show that μ, R−1 α μ μ, R−1 α RR−1 α μ 1/2 = ∞ j=1 b2 j λj+α ∞ j=1 λjb2 j (λj+α)2 1/2 −−−→ α→0+ ∞ j=1 b2 j λj 1/2 = R−1/2 μ , where bj = μ, ϕj is the coefficient of μ in the eigenbasis. If ∞ j=1 b2 j /λj < ∞, the convergence follows from Lebesgue’s monotone convergence theorem. Otherwise, we use the inequality ∞ j=1 λjb2 j /(λj + α)2 ∞ j=1 b2 j /(λj + α) to bound the left-hand side expression from below by { ∞ j=1 b2 j /(λj + α)}1/2 , which diverges to infinity again by Lebesgue’s theorem. Proof of Theorem 1 The probability of misclassifying a new observation using the conjugate gradient classifier based on ˆψCG mn is D( ˆψCG mn ) = 1 − {|Z( ˆψCG mn )|/2}. We need to show that the fraction in Z( ˆψCG mn ) converges in probability to R−1/2 μ /2 along the regularization path satisfying the assumptions of the theorem. To deal with the numerator in Z( ˆψCG mn ), one can show that μ, ˆψCG mn − μ, ψCG mn = Op n−1/2 ω−1 mn γ (mn) + n−1 ω−3 mn . (A3) This result follows from an analogue of (5.9) in Theorem 5.3 of Delaigle & Hall (2012b) and intermediate results in the proof of that theorem which can be established in our context. The necessary modifications of the proofs of Theorems 5.1, 5.2 and 5.3 in Delaigle & Hall (2012b) are as follows. All results remain valid for incomplete instead of complete curves, because the proofs depend only on the root-n consistency of the covariance estimators, which holds also for functional fragments (Kraus, 2015, Proposition 1). Moreover, the derivations in Delaigle & Hall (2012b) can be repeated without assuming that the theoretical solution ψ = R−1 μ exists as an element of L2 (I). Indeed, the proofs in Delaigle & Hall (2012b) are based on stochastic expansions of ˆRj ψ = ˆRj R−1 μ, in our notation, about Rj ψ = Rj R−1 μ = Rj−1 μ and derived quantities, but the same steps can be followed for ˆRj−1 ˆμ about Rj−1 μ in our setting. In other words, it can be shown that ˆψCG mn and ψCG mn converge to each other without assuming that ψCG mn converges. Similarly, for the denominator in Z( ˆψCG mn ) we have that ˆψCG mn , R ˆψCG mn − ψCG mn , RψCG mn = Op n−1/2 ω−1 mn γ (mn) + n−1 ω−3 mn . (A4) This last result is analogous to (7.27) of Delaigle & Hall (2012b), whose proof can be repeated with the same modifications for our situation as before. Therefore, regardless of whether R−1 μ or R−1/2 μ is finite or infinite, the theoretical and empirical regularized quantities approach each other at the rates given in (A3) and (A4). The result on D( ˆψCG mn ) then follows as in the proof of Proposition 1. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 178 D. Kraus AND M. Stefanucci Proof of Theorem 2 We show that D( ˆψPC mn ) = 1− {|Z( ˆψPC mn )|/2} converges in probability to 1− ( R−1/2 /2). The strategy of the proof is similar to that of Theorem 3.1 in Cardot et al. (1999) for the principal component approach to the functional linear model. The difference lies in the incompleteness of the functional data and in that we do not assume that the underlying theoretical inverse problem has a solution. We write ˆψPC mn − ψPC mn ˆR− mn − R− mn ∞ ˆμ + R− mn ∞ ˆμ − μ . Proceeding as in the proof of Lemma 5.1 in Cardot et al. (1999), we can show that ˆR− mn − R− mn ∞ ˆλ−1 mn λ−1 mn ˆR − R ∞ + 2λ−1 mn ˆR − R ∞ mn j=1 aj. Here ˆλj are the eigenvalues of ˆR in descending order and ˆϕj are the corresponding eigenfunctions. In establishing the above inequality one uses the facts that |ˆλj −λj| ˆR −R ∞ and ˆϕj −sign ˆϕj, ϕj ϕj aj ˆR − R ∞, which are known from Bosq (2000, Lemmas 4.2 and 4.3) for the empirical covariance operator from complete curves but hold also for functional fragments; see the proof of Proposition 2 in the supplementary document for Kraus (2015). Since ˆR − R ∞ = Op(n−1/2 ), we see that ˆλ−1 mn λ−1 mn ˆR − R ∞1[ˆλmn >λmn /2] 2λ−2 mn ˆR − R ∞ = λ−2 mn Op(n−1/2 ). Since the probability of the event [ˆλmn < λmn /2] is bounded by λ−2 mn O(n−1 ) and hence converges to 0, it follows that ˆλ−1 mn λ−1 mn ˆR − R ∞ = λ−2 mn Op(n−1/2 ). Combining this with the facts that ˆμ = Op(1), R− mn = λ−1 mn and ˆμ − μ = Op(n−1/2 ) gives ˆψPC mn − ψPC mn λ−2 mn Op(n−1/2 ) + λ−1 mn Op(n−1/2 ) mn j=1 aj. Similar arguments can be used in the analysis of the denominator in Z( ˆψPC mn ). In conclusion, we obtain that the estimation errors for the quantities in the numerator and denominator converge to zero at the rates μ, ˆψPC mn − μ, ψPC mn = λ−2 mn Op(n−1/2 ) + λ−1 mn Op(n−1/2 ) mn j=1 aj, (A5) ˆψPC mn , R ˆψPC mn − ψPC mn , RψPC mn = λ−2 mn Op(n−1/2 ) + λ−1 mn Op(n−1/2 ) mn j=1 aj. (A6) In light of (A5) and (A6), the asymptotic behaviour of the misclassification probability is driven by the behaviour of the theoretical classifier addressed in Proposition 1. Proof of Theorem 3 We show that the fraction |Z( ˆψR mn )| converges in probability to R−1/2 μ /2 as n → ∞. For the numerator we write μ, ˆψR αn − μ, R−1 αn μ = μ, ( ˆR−1 αn − R−1 αn ) ˆμ + μ, R−1 αn ( ˆμ − μ) . (A7) For the first term on the right we find that | μ, ( ˆR−1 αn − R−1 αn ) ˆμ | μ ˆR−1 αn − R−1 αn ∞ ˆμ = μ ˆR−1 αn ( ˆRαn − Rαn )R−1 αn ∞ ˆμ Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Classification of functional fragments 179 μ ˆR−1 αn ∞ ˆRαn − Rαn ∞ R−1 αn ∞ ˆμ α−2 n Op(n−1/2 ), since ˆR−1 αn ∞ α−1 n , R−1 αn ∞ α−1 n , ˆμ = Op(1) and ˆRαn − Rαn ∞ = ˆR − R ∞ = Op{(n0 + n1)−1/2 } (Kraus, 2015, Proposition 1). For the second term on the right-hand side of (A7), we obtain | μ, R−1 αn ( ˆμ − μ) | μ R−1 αn ∞ ˆμ − μ α−1 n Op(n−1/2 ). The quantity in the denominator of Z( ˆψR mn ) can be rewritten as ˆψR αn , R ˆψR αn − ψR αn , RψR αn = ˆψR αn − ψR αn , R ˆψR αn + ψR αn , R( ˆψR αn − ψR αn ) . (A8) The first term on the right is ˆψR αn − ψR αn , R ˆψR αn = ˆR−1 αn ˆμ − R−1 αn μ, R ˆR−1 αn ˆμ = R−1 αn (Rαn − ˆRαn ) ˆR−1 αn ˆμ, R ˆR−1 αn ˆμ + R−1 αn ( ˆμ − μ), R ˆR−1 αn ˆμ . (A9) For the first summand in (A9) we have | R−1 αn (Rαn − ˆRαn ) ˆR−1 αn ˆμ, R ˆR−1 αn ˆμ | ˆμ 2 ˆR−1 αn 2 ∞ RR−1 αn ∞ ˆR − R ∞ α−2 n Op(n−1/2 ), using properties mentioned previously and the fact that RR−1 αn ∞ 1, and for the second summand we have | R−1 αn ( ˆμ − μ), R ˆR−1 αn ˆμ | RR−1 αn ∞ ˆR−1 αn ∞ ˆμ − μ α−1 n Op(n−1/2 ). Putting these results together, we see that the absolute value of the first term on the right-hand side of (A8) is dominated by α−2 n Op(n−1/2 ). The second term on the right-hand side of (A8) can be analysed in a similar way to the first two terms on the right-hand side of (A7) with RR−1 αn μ in place of μ. Thus we bound the absolute value from above by α−2 n Op(n−1/2 ). These results imply that the estimation errors vanish at rates μ, ˆψR αn − μ, ψR αn = α−2 n Op(n−1/2 ), ˆψR αn , R ˆψR αn − ψR αn , RψR αn = α−2 n Op(n−1/2 ). Hence the empirical classifier has the same limiting error as the theoretical one addressed in Proposition 3. References Baíllo, A., Cuevas, A. & Cuesta-Albertos, J. A. (2011a). Supervised classification for a family of Gaussian functional models. Scand. J. Statist. 38, 480–98. Baíllo, A., Cuevas, A. & Fraiman, R. (2011b). Classification methods for functional data. In The Oxford Handbook of Functional Data Analysis, F. Ferraty & Y. Romain, eds. Oxford: Oxford University Press, pp. 259–97. Berrendero, J. R., Cuevas, A. & Torrecilla, J. L. (2016). Variable selection in functional data classification: A maxima-hunting proposal. Statist. Sinica 26, 619–38. Berrendero, J. R., Cuevas, A. & Torrecilla, J. L. (2018). On the use of reproducing kernel Hilbert spaces in functional classification. J. Am. Statist. Assoc. 113, 1210–8. Blanchard, G. & Krämer, N. (2010). Kernel partial least squares is universally consistent. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Y. W. Teh & M. Titterington, eds., vol. 9 of Proceedings of Machine Learning Research. International Joint Conferences on Artificial Intelligence (IJCAI) Organization, pp. 57–64. Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 180 D. Kraus AND M. Stefanucci Bongiorno, E. G. & Goia, A. (2016). Classification methods for Hilbert data based on surrogate density. Comp. Statist. Data Anal. 99, 204–22. Bosq, D. (2000). Linear Processes in Function Spaces. New York: Springer. Bugni, F. A. (2012). Specification test for missing functional data. Economet. Theory 28, 959–1002. Cardot, H., Ferraty, F. & Sarda, P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22. Cuesta-Albertos, J. A., del Barrio, E., Fraiman, R. & Matrán, C. (2007). The random projection method in goodness of fit for functional data. Comp. Statist. Data Anal. 51, 4814–31. Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. J. Statist. Plan. Infer. 147, 1–23. Dai, X., Müller, H.-G. & Yao, F. (2017). Optimal Bayes classifiers for functional data and density ratios. Biometrika 104, 545–60. De Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemomet. Intel. Lab. Syst. 18, 251–63. Delaigle, A. & Hall, P. (2012a). Achieving near perfect classification for functional data. J. R. Statist. Soc. B 74, 267–86. Delaigle, A. & Hall, P. (2012b). Methodology and theory for partial least squares applied to functional data. Ann. Statist. 40, 322–52. Delaigle, A. & Hall, P. (2013). Classification using censored functional data. J. Am. Statist. Assoc. 108, 1269–83. Delaigle, A. & Hall, P. (2016).Approximating fragmented functional data by segments of Markov chains. Biometrika 103, 779–99. Delaigle, A., Hall, P. & Bathia, N. (2012). Componentwise classification and clustering of functional data. Biometrika 99, 299–313. Descary, M.-H. & Panaretos, V. M. (2019). Recovering covariance from functional fragments. Biometrika 106, 145–60. Febrero-Bande, M., Galeano, P. & González-Manteiga, W. (2017). Functional principal component regression and functional partial least-squares regression: An overview and a comparative study. Int. Statist. Rev. 85, 61–83. Ferraty, F., Hall, P. & Vieu, P. (2010). Most-predictive design points for functional data predictors. Biometrika 97, 807–24. Goldberg, Y., Ritov, Y. & Mandelbaum, A. (2014). Predicting the continuation of a function with applications to call center data. J. Statist. Plan. Infer. 147, 53–65. Gromenko, O., Kokoszka, P. & Sojka, J. (2017). Evaluation of the cooling trend in the ionosphere using functional regression with incomplete curves. Ann. Appl. Statist. 11, 898–918. Hastie, T. J., Tibshirani, R. J. & Friedman, J. H. (2009). The Elements of Statistical Learning. New York: Springer, 2nd ed. Horváth, L. & Kokoszka, P. (2012). Inference for Functional Data with Applications. New York: Springer. Kraus, D. (2015). Components and completion of partially observed functional data. J. R. Statist. Soc. B 77, 777–801. Liebl, D. (2013). Modeling and forecasting electricity spot prices: A functional data perspective. Ann. Appl. Statist. 7, 1562–92. LingjÆrde, O. C. & Christophersen, N. (2000). Shrinkage structure of partial least squares. Scand. J. Statist. 27, 459–73. Phatak, A. & de Hoog, F. (2002). Exploiting the connection between PLS, Lanczos methods and conjugate gradients: Alternative proofs of some properties of PLS. J. Chemomet. 16, 361–7. Pini, A. & Vantini, S. (2016). The interval testing procedure: A general framework for inference in functional data analysis. Biometrics 72, 835–45. Ramsay, J. O. & Silverman, B. W. (2005). Functional Data Analysis. New York: Springer, 2nd ed. Sangalli, L. M., Secchi, P. & Vantini, S. (2014a). Analysis of AneuRisk65 data: k-mean alignment. Electron. J. Statist. 8, 1891–904. Sangalli, L. M., Secchi, P. & Vantini, S. (2014b). AneuRisk65: A dataset of three-dimensional cerebral vascular geometries. Electron. J. Statist. 8, 1879–90. Sangalli, L. M., Secchi, P., Vantini, S. & Veneziani, A. (2009a).A case study in exploratory functional data analysis: Geometrical features of the internal carotid artery. J. Am. Statist. Assoc. 104, 37–48. Sangalli, L. M., Secchi, P., Vantini, S. & Veneziani, A. (2009b). Efficient estimation of three-dimensional curves and their derivatives by free-knot regression splines, applied to the analysis of inner carotid artery centrelines. J. R. Statist. Soc. C 58, 285–306. Stefanucci, M., Sangalli, L. M. & Brutti, P. (2018). PCA-based discrimination of partially observed functional data, with an application to AneuRisk65 data set. Statist. Neer. 72, 246–64. [Received on 22 August 2017. Editorial decision on 2 August 2018] Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019 Biometrika, pp. 1–5 Printed in Great Britain Supplementary material for Classification of functional fragments by regularized linear classifiers with domain selection BY DAVID KRAUS Department of Mathematics and Statistics, Masaryk University, Kotl´aˇrsk´a 2, 611 37 Brno, Czech Republic david.kraus@mail.muni.cz AND MARCO STEFANUCCI Department of Statistical Sciences, Sapienza University of Rome, Piazzale Aldo Moro 5, 00185 Roma, Italy marco.stefanucci@uniroma1.it SUMMARY The Supplementary Material provides the derivation of classifiers under unequal prior class probabilities, algorithmic details of cross-validation and additional simulation and real data re- sults. Some key words: Classification; Conjugate gradients; Domain selection; Functional data; Partial observation; Regularization; Ridge method. S1. DERIVATIONS UNDER UNEQUAL PRIOR CLASS PROBABILITIES Let πj be the prior probability of class j (j = 0, 1). The optimal classifier based on the onedimensional projection X, ψ assigns X to the class Cψ(X) given by Cψ(X) = 1{π1fψ,1( X,ψ )>π0fψ,0( X,ψ )} = 1{ X−µ0,ψ 2− X−µ1,ψ 2>2 ψ,Rψ log(π0/π1)} = 1{ X−¯µ,ψ µ,ψ > ψ,Rψ log(π0/π1)}, where ¯µ = (µ0 + µ1)/2 and µ = µ1 − µ0. The effect of unequal prior class probabilities is a shift of the decision boundary and the classifier is invariant with respect to multiplication of ψ by a non-zero constant. Due to the fact that X − ¯µ, ψ = X − µ0, ψ − µ, ψ /2 = X − µ1, ψ + µ, ψ /2, the misclassification probability for an observation coming from class 0 or 1 with probabilities π0, C 2016 Biometrika Trust 2 D. KRAUS AND M. STEFANUCCI π1 is π0P0{Cψ(X) = 1} + π1P1{Cψ(X) = 0} = π0P0{( X − µ0, ψ − µ, ψ /2) µ, ψ > ψ, Rψ log(π0/π1)} + π1P1{( X − µ1, ψ + µ, ψ /2) µ, ψ < ψ, Rψ log(π0/π1)} = π0P0 X − µ0, ψ ψ, Rψ 1/2 > ψ, Rψ 1/2 | µ, ψ | log(π0/π1) + | µ, ψ | 2 ψ, Rψ 1/2 + π1P1 X − µ1, ψ ψ, Rψ 1/2 < ψ, Rψ 1/2 | µ, ψ | log(π0/π1) − | µ, ψ | 2 ψ, Rψ 1/2 = π0 1 − Φ ψ, Rψ 1/2 | µ, ψ | log(π0/π1) + | µ, ψ | 2 ψ, Rψ 1/2 + π1Φ ψ, Rψ 1/2 | µ, ψ | log(π0/π1) − | µ, ψ | 2 ψ, Rψ 1/2 . Since the function π0[1 − Φ{z−1 log(π0/π1) + z/2}] + π1Φ{z−1 log(π0/π1) − z/2} is decreasing in z > 0, the minimization of the misclassification probability is equivalent to the maximization of | µ, ψ | ψ, Rψ 1/2 like in the case of equal prior probabilities discussed in the main body of the paper. If R−1/2µ < ∞, the upper bound for the above fraction is R−1/2µ and the corresponding misclassification probability equals π0 1 − Φ log(π0/π1) R−1/2µ + R−1/2µ 2 + π1Φ log(π0/π1) R−1/2µ − R−1/2µ 2 . When R−1/2µ < ∞, that is, when the Gaussian measures with means µ1, µ2 and covariance R are mutually absolutely continuous, this is the optimal misclassification probability among all classifiers, i.e., the Bayes error, as shown in Theorem 2 in Berrendero et al. (2018). The Bayes error is achieved by ψ = R−1µ, if R−1µ < ∞. We can proceed like in the case of equal probabilities and apply regularization techniques to the inverse problem Rψ = µ. All theoretical results presented for the case of equal probabilities can be restated and reproved with the above form of the optimal error rate for the general case, including in the situation with R−1/2µ = ∞, in which case the optimal error rate is zero and the two Gaussian measure are mutually singular. In the empirical version of the problem one either estimates the prior class probabilities by nj/(n0 + n1) if the training sample can be seen as a sample from the mixture of populations with these probabilities, or uses some fixed values. S2. SELECTION OF THE REGULARIZATION PARAMETER AND DOMAIN BY CROSS-VALIDATION Given the target domain I, regularization method and regularization parameter, Algorithm S1 describes the estimation of the misclassification probability by cross-validation. Supplement for Classification of functional fragments 3 Algorithm S1. Estimation of the misclassification probability by cross-validation Set V = {(j, i) : j ∈ {0, 1}, i ∈ {1, . . . , nj}, Oji ⊇ I} Repeat for (j, i) ∈ V Estimate the mean and covariance function restricted to I using all training functions except Xji Estimate the projection direction ˆψ using the given regularization method and regularization parameter Apply ˆC ˆψ to the restriction of Xji to I and save the predicted class label to cji Set the misclassification indicator δji = 1[cji=j] Output (j,i)∈V δji/|V | The misclassification probability is estimated for a grid of values of the regularization parameter using Algorithm S1. The value that minimizes the error is selected. When selecting the domain as well, one repeats the above process for each candidate domain in place of I. Once the regularization parameter and possibly domain are selected, the classifier is reestimated using all training curves and applied to the new curve Xnew. S3. ADDITIONAL SIMULATION RESULTS S3·1. Processes with non-smooth trajectories Fig. S1 presents simulation results to compare the behaviour of classifiers on the conjugate gradient, principal component and ridge regularization path for Gaussian processes with nonsmooth trajectories. We considered the Ornstein–Uhlenbeck process with covariance function ρ(s, t) = exp(−|s − t|). We used the same configurations for the mean difference between the classes as in Subsection 5.1 in the main body of the paper, except in cases (v) and (vi), where the mean difference now was the first and tenth eigenfunction of the Ornstein–Uhlenbeck covariance kernel. The main conclusion from Subsection 5.1 of the paper is still valid for this situation. All three regularization methods reach about the same best error rate but the conjugate gradient method does it with less degrees than the other methods. The principal component method appears to be less stable than in the case of the smooth process of Subsection 5.1 which can probably be explained by the increased error of the estimation of the eigenfunction. S3·2. Behaviour under different covariance operators in groups The methods presented in the paper are derived under the assumption of equal covariance operators in both groups. Fig. S2 shows simulation results when this assumption is violated. We used Gaussian processes with covariance function exp(−|s − t|2/0.01) in one group and exp(−|s − t|) in the other group. We considered the same scenarios for the mean difference as in Subsection 5.1 in the paper, except for scenarios (v) and (vi), where the mean difference was the first and tenth eigenfunction of the mixture covariance 0.5 exp(−|s − t|2/0.01) + 0.5 exp(−|s − t|). We conclude that the findings of Subsection 5.1 are robust with respect to the assumption of equal covariance operators. The principal component classifier again appears to be the least preferable method. Moreover, the error rates in this situation with different covariances are between the error rates in situations in which the two groups both have one of the considered 4 D. KRAUS AND M. STEFANUCCI 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) i 5 10 15 2001020304050 Degrees of freedom Misclassificationrate(%) ii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iv 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) v 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vi 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) viii Fig. S1. Misclassification rate (%) versus degrees of freedom for non-smooth processes for different forms of µ(t), (i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal, (v) first eigenfunction, (vi) tenth eigenfunction, (vii) symmetric beta, (viii) asymmetric beta, for principal component (solid), conjugate gradient (dotted) and ridge (dashed) classifiers. covariance structures. Hence if there is a difference in the means, unequal covariances do not appear to have a serious negative effect on the performance of the classifiers. S3·3. Performance under increasing training sample size We performed additional simulations to study the effect of the training sample size. Fig. S3 presents results for the same settings as in Subsection 5.1 in the paper but with 100 training observations in each group, twice as many as in the paper. Overall, the misclassification rates in Fig. S3 are slightly lower than in Fig. 1 in the paper due to the reduction of the estimation error. The difference is, however, small, suggesting that at the considered training sample sizes the estimation error is a relatively unimportant part of the total misclassification error. S4. PERFORMANCE ON BENCHMARK DATA We applied the proposed methods to two datasets, referred to as the wheat data and the phoneme data, on which Delaigle & Hall (2012) and Berrendero et al. (2018) previously compared functional classifiers. See these papers for references to the original sources of the data. We repeated with our classifiers their procedure which consisted of randomly splitting the data to the training set and test set, building the classifier on the training set and applying it to the test set to compute the proportion of misclassified curves, repeating this whole process two hundred times to estimate the misclassification rate. Table S1 reports the results. We can see that misclas- Supplement for Classification of functional fragments 5 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) i 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) ii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iv 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) v 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vi 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) viii Fig. S2. Misclassification rate (%) versus degrees of freedom for processes with unequal covariance operators for different forms of µ(t), (i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal, (v) first eigenfunction, (vi) tenth eigenfunction, (vii) symmetric beta, (viii) asymmetric beta, for principal component (solid), conjugate gradient (dotted) and ridge (dashed) classifiers. sification rates decrease with increasing training sample size. Overall, on these data all classifiers appear to perform similarly and similarly to other methods studied in Delaigle & Hall (2012) and Berrendero et al. (2018). The ridge method might seem to perform slightly worse than the other two on the wheat data but in view of the standard errors we do not over-interpret this and other differences. REFERENCES BERRENDERO, J. R., CUEVAS, A. & TORRECILLA, J. L. (2018). On the use of reproducing kernel Hilbert spaces in functional classification. Journal of the American Statistical Association To appear. DELAIGLE, A. & HALL, P. (2012). Achieving near perfect classification for functional data. Journal of the Royal Statistical Society. Series B. Statistical Methodology 74, 267–286. 6 D. KRAUS AND M. STEFANUCCI 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) i 5 10 15 2001020304050 Degrees of freedom Misclassificationrate(%) ii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) iv 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) v 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vi 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) vii 5 10 15 20 01020304050 Degrees of freedom Misclassificationrate(%) viii Fig. S3. Misclassification rate (%) versus degrees of freedom for 200 training observations for different forms of µ(t), (i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal, (v) first eigenfunction, (vi) tenth eigenfunction, (vii) symmetric beta, (viii) asymmetric beta, for principal component (solid), conjugate gradient (dotted) and ridge (dashed) classifiers. Table S1. Misclassification rate (%) and its standard error achieved for wheat and phoneme data Training sample size PC CG R Wheat 30 0.94 (1.89) 0.93 (2.06) 2.48 (2.79) 50 0.36 (1.23) 0.58 (1.84) 1.73 (3.02) Phoneme 30 24.1 (4.79) 23.3 (3.87) 22.1 (2.90) 50 21.7 (2.76) 21.6 (2.12) 21.0 (2.07) 100 20.1 (1.67) 20.1 (1.51) 20.1 (1.55) PC, principal components; CG, conjugate gradients; R, ridge. E. Inferential procedures for partially observed functional data By David Kraus Journal of Multivariate Analysis, 173:583–603, 2019 DOI: 10.1016/j.jmva.2019.05.002 139 Journal of Multivariate Analysis 173 (2019) 583–603 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Inferential procedures for partially observed functional data David Kraus Department of Mathematics and Statistics, Masaryk University, Kotlářská 2, 611 37 Brno, Czech Republic a r t i c l e i n f o Article history: Received 19 September 2018 Received in revised form 14 May 2019 Accepted 15 May 2019 Available online 27 May 2019 AMS 2010 subject classifications: primary 62M99 secondary 62G10 Keywords: Bootstrap Covariance operator Functional data K-sample test Partial observation Principal components a b s t r a c t In functional data analysis it is usually assumed that all functions are completely, densely or sparsely observed on the same domain. Recent applications have brought attention to situations where each functional variable may be observed only on a subset of the domain while no information about the function is available on the complement. Various advanced methods for such partially observed functional data have already been developed but, interestingly, some essential methods, such as K-sample tests of equal means or covariances and confidence intervals for eigenvalues and eigenfunctions, are lacking. Without requiring any complete curves in the data, we derive asymptotic distributions of estimators of the mean function, covariance operator and eigenelements and construct hypothesis tests and confidence intervals. To overcome practical difficulties with storing large objects in computer memory, which arise due to partial observation, we use the nonparametric bootstrap approach. The proposed methods are investigated theoretically, in simulations and on a fragmentary functional data set from medical research. © 2019 Elsevier Inc. All rights reserved. 1. Introduction Functional data analysis is an established field [17,28,34,54] with well-developed methodologies for common types of observation of random curves, i.e., full (or dense) and sparse observation regimes. Due to new applications recent years have seen the emergence of a new type of observation of functional data, called functional fragments or partially observed functional data. For various examples see Bugni [6], Delaigle and Hall [14], Liebl [38], Gellar et al. [21], Goldberg et al. [23], Kraus [35], Delaigle and Hall [15], Gromenko et al. [24], Kneip and Liebl [32], Dawson and Müller [13], Mojirsheibani and Shaw [45], Stefanucci et al. [55], Descary and Panaretos [16], Kraus and Stefanucci [37] or Liebl and Rameseder [40]. Functional data are collections of observations of random elements of a function space, such as curves, images, surfaces, spatio-temporal fields. We consider random functions in a separable Hilbert space. Without loss of generality we work with the space L2 ([0, 1]) of square-integrable functions on [0, 1] equipped with inner product ⟨f , g⟩ = ∫ 1 0 f (t)g(t)dt and norm ∥f ∥ = ⟨f , f ⟩1/2 but our results are applicable to more general spaces. Partially observed functional data consist of realizations of random functions that are not observed on the entire domain. Each function in the sample may be observed on a different subset of the domain and no information is available on the function values at arguments in the complement of this subset. For the ith functional variable Xi ∈ L2 ([0, 1]) there is a subset Oi ⊆ [0, 1] such that Xi(t) is observed for t ∈ Oi and not observed for t ∈ [0, 1]\Oi. The observation sets may be random, corresponding to data that are missing by happenstance, or non-random for designed experiments. We assume that the observation sets are mutually independent and independent of the curves. We refer to Liebl and Rameseder [40] for a study of the case of dependent missingness. Although some advanced procedures, such as goodness-of-fit tests, regression, classification and reconstruction methods, have been developed for functional fragments, basic methods of inference about the fundamental characteristics E-mail address: david.kraus@mail.muni.cz. https://doi.org/10.1016/j.jmva.2019.05.002 0047-259X/© 2019 Elsevier Inc. All rights reserved. 584 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 of functional variables are still missing. In particular, the asymptotic distribution of estimators of the mean function and covariance operator, K-sample tests of equal means or covariances, and confidence intervals for eigenvalues and eigenfunctions have not been studied yet in the setting of incomplete functions. Users who wish to perform these basic tasks currently have the only option: to omit the partially observed functions and apply existing procedures to the complete data only. This approach is not only clearly sub-optimal due to a possibly large loss of information and resulting decay of power and accuracy, but also hardly or totally inapplicable in situations where the data contain few or no complete curves. In this paper, we address this deficiency of existing methodology and develop essential methods of inference about the mean and covariance structure of incomplete functional data. Random functions are characterized by the mean function µ = E X and the covariance operator R : L2 ([0, 1]) → L2 ([0, 1]) defined as (Rf )(·) = ∫ 1 0 ρ(·, t)f (t)dt, f ∈ L2 ([0, 1]), where ρ(s, t) = cov{X(s), X(t)} is the covariance function, assuming it exists. The covariance structure is best understood via principal component analysis or eigendecomposition of R in the form R = ∞∑ m=1 λmϕm ⊗ ϕm, where λ1 ≥ λ2 ≥ · · · ≥ 0 are the eigenvalues, ϕm are the corresponding orthonormal eigenfunctions, and (a⊗b)f = ⟨b, f ⟩a for a, b, f ∈ L2 ([0, 1]). For a theoretical background see, e.g., Bosq [5]. We find appropriate assumptions on the observation pattern that enable us to establish the asymptotic distribution of estimators of µ and R. We develop tests for comparing the mean functions in K populations of functional data based on samples of fragments. Next, we propose several tests of equal covariance operators in K samples. We also construct confidence intervals for the eigenvalues and eigenfunctions estimated from incomplete data. The practical implementation of methods for functional fragments is more complicated than for complete curves. The main difficulty is that temporal averaging (e.g., in inner products for dimension reduction) is impossible due to missing values. This leads to asymptotic distributions whose parameters follow rather complicated formulas. More importantly, since dimension reduction is not possible, the asymptotic distributions are, upon discretization, characterized by large objects (matrices or arrays) that are difficult or even impossible to store and manipulate in computer memory. The bootstrap turns out to be a solution to this problem. We provide specific algorithms for resampling functional fragments for mean and covariance testing and for confidence intervals for eigenelements. In a simulation study we investigate the performance of the proposed tests, focusing in particular on the impact of missingness on the different tests and on the effect of the interplay between missingness and the form of differences between groups. The study shows that the proposed methods are superior to the currently only available approach based on omitting incomplete curves. The proposed methodology is applied to a data set of temporal profiles of heart rate. The data consist of several hundred curves recorded by an automatic device during several hours in the evening during the transition from the day to night regime of heart activity. The profiles are not observed always available on the entire domain of interest because either the device did not measure or record measurements, or the person switched off the device. These fragmentary data were previously analysed in Kraus [35], where further details can be found. Section 2 develops methods of inference about means in one and K samples. Section 3 deals with tests about covariance operators and with inference about principal components. Section 4 presents bootstrap approximations. Results of the simulation study and the data example are reported in Sections 5 and 6. In the Appendix we provide a central limit theorem for non-identically distributed functional variables needed in the asymptotic analysis of fragments, and proofs of all theorems. Additional simulation results and further results of the data analysis. 2. Mean inference from incomplete curves 2.1. Estimation of the mean function In this section we focus on inference about the mean of functional data. Let us first consider estimation of the mean function µ of a homogeneous population. Let there be n independent functional observations. Each curve Xi, i ∈ {1, . . . , n} may be observed incompletely, with values known only for arguments in a subset Oi ⊆ [0, 1], with no information on the complement of Oi. The observation sets may be non-random or random. They are assumed to be mutually independent and independent of the curves and to consist of a finite union of intervals. We denote by Oi(t) the indicator that the value of Xi(t) is observed. The mean function µ(t) can be estimated by the cross-sectional average of available observations ˆµ(t) = J(t) N(t) n ∑ i=1 Oi(t)Xi(t), D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 585 where N(t) = ∑n i=1 Oi(t) is the number of available observations at time t and J(t) = 1[N(t)>0]. The estimator is defined to be zero when N(t) = 0. In Kraus [35, Proposition 1] it was shown that under non-restrictive assumptions on the observation pattern the estimator ˆµ is consistent for the mean function µ, namely, it was proven that E ∥ ˆµ−µ∥2 = O(n−1 ) as n → ∞. We now aim to provide the asymptotic distribution of the estimator. The result will be essential in the derivation of the limiting distribution of the test statistics that we construct afterwards. We denote πi(t) = E Oi(t) = Pr{Oi(t) = 1} and ¯π(t) = n−1 ∑n i=1 πi(t). Furthermore, we denote by Ui(s, t) = Oi(s)Oi(t) the indicator of observing the function values at the pair of arguments s and t, and define νi(s, t) = E Ui(s, t), ¯ν(s, t) = n−1 ∑n i=1 νi(s, t) and M(s, t) = ∑n i=1 Ui(s, t). We need to introduce conditions on the observation pattern as follows. Condition 1. (a) Let there be a function π(t) such that π0 = inft∈[0,1] π(t) > 0 and supt∈[0,1] | ¯π(t) − π(t)| → 0 for n → ∞. (b) Let there be a function ν(s, t) such that ¯ν(s, t) → ν(s, t) for all s, t ∈ [0, 1]. (c) Let there be a value ν0 > 0 such that for each (s, t) ∈ [0, 1]2 either ν(s, t) ≥ ν0 or ν(s, t) = 0, and let the convergence sup(s,t)∈[0,1]2 |¯ν(s, t) − ν(s, t)| → 0 for n → ∞ hold. Condition (a) guarantees the consistency of the estimator ˆµ, see Kraus [35]. Condition (b) is needed for the weak convergence of the estimator. Condition (c) is needed for consistent estimation of the covariance operator of the limiting distribution. We emphasize that no complete curves are required since these conditions may be satisfied even when the sample contains only fragments. We illustrate this attractive property in the simulation study in Section 5. When the observation indicators O1, . . . , On are identically distributed, then Condition (a) is satisfied if π(t) = P{Oi(t) = 1} is bounded away from zero, Condition (b) is satisfied automatically and Condition (c) is satisfied if for each (s, t) ∈ [0, 1]2 , ν(s, t) = P{Oi(s) = 1, Oi(t) = 1} is either bounded away from zero or equal to zero. The case of non-identically distributed observation indicators may be relevant, for example, for designed experiments in which non-random, designed observation sets may vary across subjects. By ∥ · ∥2 below we denote the Hilbert–Schmidt norm of an operator. Theorem 1. Assume that E(∥X1∥2 ) < ∞. Let Conditions 1(a) and 1(b) hold. Then n1/2 { ˆµ(·) − µ(·)}, N(·)1/2 { ˆµ(·) − µ(·)} are asymptotically distributed as mean zero Gaussian processes with covariance operators K ′ , K with kernels κ′ (s, t) = π(s)−1 π(t)−1 ν(s, t)ρ(s, t), κ(s, t) = π(s)−1/2 π(t)−1/2 ν(s, t)ρ(s, t), respectively. If, moreover, Definition 1(c) is satisfied, then K ′ and K can be consistently estimated by the operators ˆK ′ and ˆK with kernels ˆκ′ (s, t) = ˆπ(s)−1 ˆπ(t)−1 ˆν(s, t) ˆρ(s, t) and ˆκ(s, t) = ˆπ(s)−1/2 ˆπ(t)−1/2 ˆν(s, t) ˆρ(s, t), respectively, i.e., E ∥ ˆK ′ − K ′ ∥2 2 → 0 and E ∥ ˆK − K ∥2 2 → 0, where ˆπ(t) = N(t)/n, ˆν(s, t) = M(s, t)/n, ˆρ(s, t) is the empirical covariance based on all complete pairs of function values at s, t, and the value of the kernels is set to 0 whenever ˆπ(s) or ˆπ(t) is 0. The proof of this and other theorems is provided in the Appendix. Since the observable functional variables may be non-identically distributed due to possibly non-identically distributed observation indicators, the proof uses a central limit theorem for non-identically distributed functional random variables given in the Appendix. Notice that the covariance kernels κ′ (s, t) and κ(s, t) of the limiting distributions are zero when ν(s, t) = 0 regardless of the value of ρ(s, t). Therefore, it is not necessary to estimate ρ(s, t) at such points. This is why Definition 1(c) does not require the function ν(s, t) to be bounded away from zero on the entire domain [0, 1]2 which is needed for the estimation of R, as will be seen in Section 3, Definition 2(a). This means that the theorem applies also in the context of short fragments of curves considered, e.g., by Delaigle and Hall [15] or Descary and Panaretos [16], where each curve in the sample is observed on a short interval and no completely observed curves are available. 2.2. Tests of equality of means in several populations Let us now consider K independent samples of functional data. Let the jth sample (j ∈ {1, . . . , K}) consist of independent curves Xj1, . . . , Xjnj coming from the same distribution with mean µj and covariance operator Rj. The functions may not be observed completely. It is assumed that for each function Xji its values are available on a subset Oij. Let the observation subsets be mutually independent and independent of the curves. Our aim is to test the null hypothesis that µ1 = · · · = µK against the general alternative that the null does not hold. The literature on hypothesis testing for means of functional data is rich. See, for example, [2,3,8,9,18,28,39,43,49,52,53,56,57,59]. In the literature on complete functional samples there exist two main approaches to comparing mean functions. One is based on the L2 distance between the means and one uses projections on finite dimensional subspaces. The assessment of the hypothesis will be based on the contrasts of the group means and a null estimate of the common mean, i.e., on the differences ˆµj − ˆµ, j ∈ {1, . . . , K}. Here we use ˆµj(t) = Jj(t)Nj(t)−1 ∑nj i=1 Oji(t)Xji(t), j ∈ {1, . . . , K}, with 586 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Nj(t) = ∑nj i=1 Oji(t) and Jj(t) = 1[Nj(t)>0]. The estimator ˆµ is obtained as a weighted average of the group means in the form ˆµ(t) = ∑K j=1 ˆwj(t) ˆµj(t) with weights ˆwj(t) = Nj(t)/ˆr2 j ∑K k=1 Nk(t)/ˆr2 k , where ˆr2 j = tr ˆRj is the trace of the estimated covariance operator in the jth sample (the estimators ˆRj are discussed later). The role of the scaling by ˆr2 j is to account for possibly different covariance structures in the samples. This way of combining estimated means of heteroscedastic samples is inspired by the univariate case and its standard multivariate extensions. If the covariance structures are known to be the same in all samples, the factors ˆr2 j can be replaced by the trace of an estimator of the common covariance operator, which leads to the estimated mean based on the pooled sample of curves. The first test we propose is inspired by the method of Cuevas et al. [9] who in the context of fully observed functional data developed an ANOVA test based on the L2 norms of the contrasts of the group means and the pooled sample mean. A two-sample version of the test using the nonparametric bootstrap was proposed by Benko et al. [3]. Horváth et al. [29] studied a two-sample test based on the L2 norm in the context of functional time series. The standardized contrast processes Nj(·)1/2 { ˆµj(·) − ˆµ(·)}/ˆrj, j ∈ {1, . . . , K} can be collected into a K-dimensional vector that is a random element of the product space {L2 ([0, 1])}K with inner product ⟨f , g⟩ = ∑K j=1⟨fj, gj⟩ for f = (f1, . . . , fK )⊤ , g = (g1, . . . , gK )⊤ . We use its L2 norm as the test statistic, i.e., base the test on TL2 = K ∑ j=1 ∥Nj(·)1/2 { ˆµj(·) − ˆµ(·)}/ˆrj∥2 = K ∑ j=1 ∫ 1 0 Nj(t){ ˆµj(t) − ˆµ(t)}2 /ˆr2 j dt (1) and reject when the value of the statistic is significantly large. Another main approach to curve mean testing uses dimension reduction. See, e.g., Aue et al. [2], Horváth and Kokoszka [28] or Horváth et al. [29]. The idea is to focus on a finite number of important features of the infinite-dimensional data. The functional observations are projected on a finite-dimensional subspace and multivariate ANOVA or a similar multivariate procedure is applied to the resulting vectors of Fourier scores. This strategy is not directly applicable in the situation of incompletely observed curves because, unlike in the fully observed case, Fourier scores of functional fragments cannot be computed by numerical integration as inner products of the functional variable and the basis function since the functional variable is not available on the entire domain. Let ˆψ1, . . . , ˆψd be some linearly independent functions in L2 ([0, 1]). Without loss of generality we assume that they are orthonormal. These functions may be either deterministic or random (estimated from the data). In the construction of our projection tests we use Fourier scores of the standardized contrast processes with respect to the basis functions ˆψl. We denote these scores Qjl = ⟨Nj(·){ ˆµj(·) − ˆµ(·)}, ˆψl⟩/(ˆrjn 1/2 j ), j ∈ {1, . . . , K}, l ∈ {1, . . . , d} and collect them in the score vector Q = (Q11, . . . , Q1d, . . . , QK1, . . . , QKd)⊤ . The score statistic is the quadratic form Td = Q ⊤ ˆV− Q , (2) where ˆV− is the Moore–Penrose pseudoinverse of the estimated (Kd) × (Kd) covariance matrix of Q whose entry on the position with index (jl, km) is ˆVjl,km = ⟨ ˆπ 1/2 j ˆψl, ˆVjk( ˆπ 1/2 k ˆψm)⟩ = ∫ [0,1]2 ˆπj(s)1/2 ˆψl(s)ˆvjk(s, t) ˆψm(t) ˆπk(t)1/2 dsdt for j, k ∈ {1, . . . , K}, l, m ∈ {1, . . . , d}. Here ˆVjk is the covariance operator with kernel ˆvjk(s, t) = K ∑ l=1 ˆr−1 j {δjl − Nj(s)1/2 ˆwl(s)Nl(s)−1/2 }ˆκl(s, t){δkl − Nk(t)1/2 ˆwl(t)Nl(t)−1/2 }ˆr−1 k , (3) where δjk is the Kronecker delta. The test rejects for large values of Td. Analogously to the case of one group considered in Section 2.1, we denote for j ∈ {1, . . . , K}, i ∈ {1, . . . , nj} the following quantities characterizing the observation patterns in each group, πji(t) = E Oji(t) = Pr(Oji(t) = 1), ¯πj(t) = n−1 j ∑nj i=1 πji(t), Uji(s, t) = Oji(s)Oji(t), νji(s, t) = E Uji(s, t), ¯νj(s, t) = n−1 j ∑nj i=1 νij(s, t) and Mj(s, t) = ∑nj i=1 Uji(s, t). Under mild assumptions we obtain the asymptotic distribution of both test statistics. Theorem 2. For j ∈ {1, . . . , K} assume that nj → ∞, nj/(n1 + · · · + nK ) → aj > 0 and E ∥Xj1∥2 < ∞. Let the observation patterns in each group satisfy Definition 1. Then under the null hypothesis of equal means we obtain the following results: (i) The test statistic TL2 is asymptotically distributed as ∑∞ k=1 γkCk, where Ck are independent chi-square distributed variables with one degree of freedom and γk can be consistently estimated by the eigenvalues of the operator ˆV given in (3). (ii) Assume that there exist linearly independent non-random functions ψ1, . . . , ψd such that ∥ ˆψl − ψl∥ P −→ 0 for l ∈ {1, . . . , d}. Then the test statistic Td is asymptotically chi-square distributed with (K − 1)d degrees of freedom. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 587 The test statistic based on the L2 norm is not distribution-free but the critical values can be obtained straightforwardly by simulation, provided that the eigenvalues of ˆV consistently estimate γk. Similarly, the consistency of ˆV (and hence of ˆV) is needed for the score statistic. The consistency of ˆV is guaranteed by Definition 1(c). It may sometimes happen that Mj(s, t) is low for some s, t, making the estimator ˆV less reliable. For this reason, and also for computational reasons, to avoid the estimation of the limiting covariance one can use the bootstrap method, as we describe in Section 4. In the literature on complete functional data, the most common choice of the basis functions for the projection test is derived from principal component analysis (see Horváth and Kokoszka [28] and references therein, or Fremdt et al. [19]). The approach uses several leading eigenfunctions of the pooled sample covariance operator. The motivation for this choice is the property that the first eigenfunctions capture the principal modes of variation, the most important features of random deviations of the functional variables from the mean. Another approach is to use a fixed set of basis functions, such as several elements of the Fourier basis of sines and cosines or several orthonormal Legendre polynomials. For several reasons we prefer deterministic bases to the basis of eigenfunctions. One drawback of the latter approach is that the principal components of variability may be only weakly related or entirely unrelated (orthogonal) to the differences between the mean functions, resulting in a test that is weak or inconsistent against this alternative. It may of course happen that the deterministic functions we choose are orthogonal to the alternative too, or that the leading eigenfunctions capture the mean differences well. However, with fixed functions it is at least possible to say before the analysis which alternatives can be detected. With principal components it is not known beforehand which departures from the null can be captured because the eigenfunctions are usually unknown. Moreover, their property of capturing the largest portion of variability, which is typically the main argument for using them, is not exactly what one wishes in mean testing. In fact, one would rather wish to maximize the signal-to-noise ratio or non-centrality, which, for example, in the case of components with equal magnitude of means would mean to minimize variability. In reality, the true interplay between the magnitude of components of the mean difference and their variability is not known, and we, therefore, prefer fixed functions. The choice of the number of basis functions is important with projection methods. For the approach using eigenfunctions, we follow the recommendation of Horváth et al. [29] to use the smallest number of components needed to explain at least 85% of the total variability. For the method using fixed functions, in light of the above discussion of the relation of the power and variability we do not base the choice of d on the explained variability. Instead, we can specify what shape differences we wish to detect and use the corresponding basis functions. For example, using just d = 3 Legendre polynomials describing constant, monotonic as well as convex or concave non-monotonic differences seems to be a good choice in many applications. 3. Covariance inference under partial observation 3.1. Asymptotics for the estimated covariance operator and principal components Given a collection of independent realizations of curves X1, . . . , Xn with mean function µ and covariance operator R observed on subsets O1, . . . , On, the covariance function ρ(s, t) can be estimated by the empirical covariance using pairwise complete observations, that is, by ˆρ(s, t) = I(s, t) M(s, t) n ∑ i=1 Ui(s, t){Xi(s) − ˆµst (s)}{Xi(t) − ˆµst (t)}, where I(s, t) = 1[M(s,t)>0] and ˆµst (s) = 1[M(s,t)>0] M(s, t) n ∑ i=1 Ui(s, t)Xi(s). If M(s, t) = 0, we define ˆρ(s, t) = 0 and ˆµst (s) = 0. Under certain assumptions on the observation pattern, the operator ˆR with kernel ˆρ(s, t) was shown to be a consistent estimator of R in Kraus [35, Proposition 1]. In the theorem below we give the asymptotic distribution under a set of conditions for which we denote Ei(s, t, u, v) = Oi(s)Oi(t)Oi(u)Oi(v), the indicator that the observation of Xi at points s, t, u, v is available, and set θi(s, t, u, v) = Pr{Ei(s, t, u, v) = 1}, ¯θ(s, t, u, v) = ∑n i=1 θi(s, t, u, v)/n and L(s, t, u, v) = ∑n i=1 Ei(s, t, u, v). Condition 2. (a) Let there be a function ν(s, t) such that ν0 = inf(s,t)∈[0,1]2 ν(s, t) > 0 and sup(s,t)∈[0,1]2 |¯ν(s, t) − ν(s, t)| → 0 for n → ∞. (b) Let there be a function θ(s, t, u, v) such that ¯θ(s, t, u, v) → θ(s, t, u, v) for all s, t, u, v ∈ [0, 1]. (c) Let there be a value θ0 > 0 such that for each (s, t, u, v) ∈ [0, 1]4 either θ(s, t, u, v) ≥ θ0 or θ(s, t, u, v) = 0, and let the convergence sup(s,t,u,v)∈[0,1]4 |¯θ(s, t, u, v) − θ(s, t, u, v)| → 0 for n → ∞ hold. 588 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Condition (a) means that there are enough observations at all pairs of arguments. The condition is needed for the consistency of ˆR, see Kraus [35] for a proof under an essentially equivalent condition. Condition (b) guarantees the weak convergence in the theorem below, and the additional condition (c) guarantees that the covariance of the asymptotic distribution can be estimated. We stress that these conditions do not require that the data contain any complete curves. They may be satisfied even in situations, where all functional observations are fragmentary. When the observation indicators O1, . . . , On are identically distributed, then Condition (a) is satisfied if ν(t) = P{Oi(s) = 1, Oi(t) = 1} is bounded away from zero, Condition (b) is satisfied automatically and Condition (c) is satisfied if for each (s, t, u, v) ∈ [0, 1]4 , θ(s, t, u, v) = P{Oi(s) = 1, Oi(t) = 1, Oi(u) = 1, Oi(v) = 1} is either bounded away from zero or equal to zero. Theorem 3. Assume that E(∥X1∥4 ) < ∞. Let Conditions 2(a) and 2(b) hold. Then n1/2 ( ˆR − R) and the operator with kernel M(·, ·)1/2 {ˆρ(·, ·) − ρ(·, ·)} are asymptotically distributed as mean zero Gaussian operators whose covariance operators H′ , H have kernels η′ (s, t, u, v) = ν(s, t)−1 ν(u, v)−1 θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}, η(s, t, u, v) = ν(s, t)−1/2 ν(u, v)−1/2 θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}, respectively, where ζ(s, t, u, v) = E[{X(s) − µ(s)}{X(t) − µ(t)}{X(u) − µ(u)}{X(v) − µ(v)}]. If, in addition, Definition 2(c) is satisfied, then H′ and H can be consistently estimated by the operators ˆH′ and ˆH with kernels ˆη′ (s, t, u, v) = ˆν(s, t)−1 ˆν(u, v)−1 ˆθ(s, t, u, v){ˆζ(s, t, u, v) − ˆρ(s, t) ˆρ(u, v)} and ˆη(s, t, u, v) = ˆν(s, t)−1/2 ˆν(u, v)−1/2 ˆθ(s, t, u, v) {ˆζ(s, t, u, v)− ˆρ(s, t) ˆρ(u, v)}, respectively, i.e., E ∥ ˆH′ −H′ ∥2 2 → 0 and E ∥ ˆH−H∥2 2 → 0, where ˆη′ (s, t, u, v) and ˆη(s, t, u, v) are set to 0 whenever ˆν(s, t) or ˆν(u, v) is 0, ˆθ(s, t, u, v) = L(s, t, u, v)/n and ˆζ(s, t, u, v) is the empirical fourth central moment of the functional random variable computed using all complete quadruples of function values at arguments s, t, u, v. The weak convergence in the theorem above is on the separable Hilbert space of Hilbert–Schmidt operators equipped with the Hilbert–Schmidt norm ∥ · ∥2. The limiting covariance operator H is an operator that maps a Hilbert–Schmidt operator F with kernel f (u, v) to an operator with kernel ∫ 1 0 ∫ 1 0 η(s, t, u, v)f (u, v)dudv, similarly for other objects in the theorem. Next, we study the estimators ˆλm and ˆϕm of the eigenvalues and eigenfunctions of R. The estimators are obtained by the eigendecomposition of ˆR. Their root-n consistency was established by Kraus [35, Proposition 2]. Here we find the approximate distribution of the fluctuation of the estimators around their true counterparts (with appropriate sign for the eigenfunctions as usual). Theorem 4. Assume that E(∥X1∥4 ) < ∞ and R has eigenvalues with multiplicity 1. Let Conditions 2(a) and 2(b) hold. Denote by H ′∞ a random operator following the limiting Gaussian distribution of n1/2 ( ˆR − R) with mean zero and covariance H′ given in Theorem 3. Then, for n → ∞, we obtain the following results: (i) n1/2 (ˆλm − λm) is asymptotically distributed as ⟨H ′∞ ϕm, ϕm⟩, which is a normal variable with mean zero and variance ∫ [0,1]4 ϕm(s)ϕm(t)η′ (s, t, u, v)ϕm(u)ϕm(v)dsdtdudv. (ii) n1/2 ( ˆϕm − ˆsmϕm), where ˆsm = sign⟨ˆϕm, ϕm⟩, is asymptotically distributed as the Gaussian random function QmH ′∞ ϕm, where Qm = ∞∑ k=1 k̸=m ϕk ⊗ ϕk λm − λk . The limiting covariance operator of n1/2 ( ˆϕm − ˆsmϕm) is ∞∑ k=1 k̸=m ∞∑ l=1 l̸=m ϕk ⊗ ϕl (λm − λk)(λm − λl) ∫ [0,1]4 ϕk(s)ϕm(t)η′ (s, t, u, v)ϕm(u)ϕl(v)dsdtdudv. If, additionally, Definition 2(c) is satisfied, then the limiting variance and covariance above can be consistently estimated by plugging-in estimates from Theorem 3. The theorem is proved in the Appendix with the help of perturbation theory. The theorem generalizes the classic results of Dauxois et al. [11] who considered completely observed functions. See Kokoszka and Reimherr [33] for related results for functional time series. In the case of complete Gaussian curves Dauxois et al. [11] showed that the limiting covariance structure of the empirical covariance operator simplifies [see also 46] which eventually leads to a simpler form of the limiting variance of the empirical eigenvalue, namely to 2λ2 m. No such simplification is in general possible in the case of incomplete curves, even if they are Gaussian. Therefore, to make inference about eigenvalues or eigenfunctions, e.g., to construct confidence intervals, one possibility is to estimate the function η′ (s, t, u, v) and use the complicated expressions above for the limiting covariance structure. In Section 4 we provide an alternative approach based on the bootstrap which enables to avoid the possibly unstable estimation of η′ and computer memory demanding storage and manipulation with the estimate. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 589 3.2. Testing the equality of covariance operators We now study tests for equality of covariance operators of several populations. Let there be K independent samples of partially observed functions with mean µj and covariance Rj in the jth sample, as described in Section 2.2. We aim to test the null hypothesis that R1 = · · · = RK against the general alternative. The general problem of hypothesis testing for covariance operators was previously studied in various contexts by various methods. See, e.g., [3,4,7,20,25,26,30,31, 36,44,46,49–51,57,58]. Tests of the null hypothesis of equal covariance operators can be based on the differences between the estimators ˆRj and the null estimator ˆR which is the pooled covariance operator with kernel ˆρ(s, t) = K ∑ j=1 ˆwj(s, t) ˆρj(s, t), where ˆwj(s, t) = Mj(s, t) ∑K k=1 Mk(s, t) . The differences are expressed by the contrast operators with kernels Mj(·, ·)1/2 {ˆρj(·, ·) − ˆρ(·, ·)}. We propose two types of tests measuring the importance of the contrasts: one approach is based on the Hilbert–Schmidt norm of the contrasts and one is based on their projections on a subspace. The first approach is inspired by methods that were previously considered in the case of fully observed functions, e.g., by Boente et al. [4]. The importance of the contrasts is expressed by the Hilbert–Schmidt norm. The test statistic takes the form SHS = K ∑ j=1 ∥Mj(·, ·)1/2 {ˆρj(·, ·) − ˆρ(·, ·)}∥2 2 = K ∑ j=1 ∫ [0,1]2 Mj(s, t){ˆρj(s, t) − ˆρ(s, t)}2 dsdt (4) (in this notation we identify kernels and the corresponding operators). The second approach uses projections of the contrasts onto a finite-dimensional subspace of the space of Hilbert–Schmidt operators. This type of tests was used for complete functions in various settings, e.g., by Horváth et al. [27], Panaretos et al. [46], Panaretos et al. [47], Kraus and Panaretos [36], Fremdt et al. [20], and Jarušková [30]. It is natural to project on the subspace generated by the leading eigenfunctions of ˆR because they carry information about the object of interest, the covariance operator (unlike in the case of mean functions where we prefer to use a fixed basis for the projection test). Let ˆϕ1, . . . , ˆϕd be the first d eigenfunctions of ˆR. Then the operators ˆUlm = { ˆϕl ⊗ ˆϕl, l = m, ( ˆϕl ⊗ ˆϕm + ˆϕm ⊗ ˆϕl)/21/2 , l < m with kernels ˆull(s, t) = ˆϕl(s) ˆϕl(t) and ˆulm(s, t) = {ˆϕl(s) ˆϕm(t) + ˆϕm(s) ˆϕl(t)}/21/2 , l < m form an orthonormal basis of a d(d + 1)/2-dimensional subspace of HS(L2 ([0, 1])). The Fourier coefficients of the projection of the jth standardized contrast on this subspace are Rjlm = ⟨Mj(·, ·){ˆρj(·, ·) − ˆρ(·, ·)}/n 1/2 j , ˆUlm⟩ = ∫ [0,1]2 Mj(s, t){ˆρj(s, t) − ˆρ(s, t)}ˆulm(s, t)dsdt/n 1/2 j . (5) Denote by R the Kd(d + 1)/2-dimensional score vector with components Rjlm, j ∈ {1, . . . , K}, 1 ≤ l ≤ m ≤ d. The test statistic measures the size of the projection of the contrast operators on the subspace. It takes the form Sd = R ˆW− R, (6) where ˆW− is the Moore–Penrose pseudoinverse of the estimator of the asymptotic covariance matrix whose entry with indices (jlm, kpq) is ˆWjlm,kpq = ⟨ˆνj(·, ·)1/2 ˆulm(·, ·), ˆBjk{ˆνk(·, ·)1/2 ˆupq(·, ·)}⟩ = ∫ [0,1]4 ˆνj(s, t)1/2 ˆulm(s, t) ˆβjk(s, t, u, v)ˆupq(u, v)ˆνk(u, v)1/2 dsdtdudv, (7) j, k = 1, . . . , K, 1 ≤ l ≤ m ≤ d, 1 ≤ p ≤ q ≤ d. The kernel of ˆBjk is ˆβjk(s, t, u, v) = K ∑ l=1 {δjl − Mj(s, t)1/2 ˆwl(s, t)Ml(s, t)−1/2 }ˆηl(s, t, u, v) × {δkl − Mk(u, v)1/2 ˆwl(u, v)Ml(u, v)−1/2 }. (8) We now give the asymptotic distribution of the Hilbert–Schmidt and projection statistics. 590 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Theorem 5. For j ∈ {1, . . . , K} assume that nj → ∞, nj/(n1 + · · · + nK ) → aj > 0, E ∥Xj1∥4 < ∞ and all eigenvalues of Rj have multiplicity 1. Let the observation patterns in each group satisfy Definition 2. Then under the null hypothesis of equal covariance operators we obtain the following results: (i) The test statistic SHS is asymptotically distributed as ∑∞ k=1 δkCk, where Ck are independent chi-square distributed variables with one degree of freedom and δk can be consistently estimated by the eigenvalues of the operator ˆB given in (8). (ii) The test statistic Sd is asymptotically chi-square distributed with (K − 1)d(d + 1)/2 degrees of freedom. The asymptotic distribution of SHS can be approximated by simulation like in Boente et al. [4]. Section 4 presents a practical bootstrap implementation of these tests in which it is not necessary to compute the operator ˆB. Tests based directly on covariance operators are not the only option. As an alternative we explore the approach of Pigoli et al. [50] who argue that although covariance operators are contained in the Hilbert space of Hilbert–Schmidt operators, they do not form a linear subspace, and propose other distances than those based on the difference of covariances, such as the Procrustes distance and the square root distance. This direction of research was further investigated by Cabassi et al. [7] and Masarotto [44]. One of the proposals of Pigoli et al. [50] was to use the Hilbert–Schmidt distance between square root covariance operators dsqrt(R1, R2) = ∥R 1/2 1 −R 1/2 2 ∥2. They report good power results for a two-sample test of equal covariances in the setting of complete functions based on this distance between estimated operators, dsqrt( ˆR1, ˆR2). We extend this approach to K samples consisting of partially observed functions. Since the data may contain incomplete functions, the empirical covariance operators ˆRj used before may have negative eigenvalues. To be able to work with empirical square root covariance operators, we need to modify the covariance estimators to ensure they are non-negative definite. We use ˆRj+ = nj ∑ l=1 (ˆλjl)+ ˆϕjl ⊗ ˆϕjl, where (ˆλjl)+ = max(ˆλjl, 0) is the positive part of the eigenvalue ˆλjl of ˆRj and ˆϕjl is the corresponding eigenfunction. As discussed in Kraus [35], negative eigenvalues are typically of small magnitude in comparison with leading eigenvalues and, therefore, are negligible in practice. For a test statistic, we need to use the distance dsqrt to define a null estimator of R and contrasts between the group estimators ˆRj+ and the null estimator. The common covariance operator can be estimated by ˆRsqrt = (∑K j=1 nj ˆR 1/2 j+ ∑K j=1 nj )2 , which is the weighted Fréchet mean of the group-specific operators, i.e., the minimizer with respect to R of ∑K j=1 nj dsqrt( ˆRj+, R)2 . The attained minimum of this objective function, Ssqrt = K ∑ j=1 njdsqrt( ˆRj+, ˆRsqrt)2 = K ∑ j=1 ∥n 1/2 j ( ˆR 1/2 j+ − ˆR 1/2 sqrt)∥2 2, (9) can serve as a test statistic for comparing covariance operators in K samples. The statistic summarizes the size of the contrasts between the group and null estimators of the square root covariance operator. Following Pigoli et al. [50] we use resampling to approximate the null distribution of the statistic. Notice that the contrasts between the group and null estimators in Ssqrt and SHS are weighted differently. In SHS we weight the contrast kernels by Mj(s, t)1/2 which in the fragmentary setting reflects the accuracy of the estimation of the covariance kernel at each point of [0, 1]2 due to the number of observations available at that point. In Ssqrt this would not be meaningful because the square root covariance operator is a function of the entire covariance operator and thus the accuracy of the estimation of the square root covariance kernel at one point depends also on the numbers of available observations at all other points. We therefore simply weight by n 1/2 j reflecting the overall accuracy of the square root covariance estimator. Both SHS and Ssqrt are the attained minimum of the corresponding objective functional that defines the null estimator. 4. Practical implementation and bootstrap approximations Functional data procedures are practically implemented by discretization. Functional observations are evaluated at q points of a grid in the domain. Functions then correspond to q-vectors (possibly with missing values), operators on the function space correspond to (q × q)-matrices and operators on operators correspond to four-way arrays with all dimensions q. To make inference (tests and confidence intervals), one can use the asymptotic distributions found in the previous section. However, the implementation of such procedures would be excessively demanding in terms of computer memory, D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 591 especially in the case of covariance inference. For example, when the evaluation grid consists of q = 100 points, arrays such as the one corresponding to the fourth moment kernel ζ(s, t, u, v) contain q4 = 108 entries. To compare covariances, e.g., in K = 3 samples, one would have to work with an array with K2 q4 = 9×108 entries whose size already approaches the memory limits of usual computers, even if symmetry is exploited. In the case of multivariate, spatial or image data the number of evaluation points q is typically much larger than for functions of a one-dimensional argument. Aston et al. [1] give an example of acoustic phonetic data with bivariate, time–frequency argument with q = 8100. In conclusion, the size of objects representing the asymptotic covariance structure for tests or confidence intervals may be far beyond memory limits. Projection covariance tests for complete functions can avoid the computation, storage and manipulation with such large arrays by computing principal scores of each function with respect to the required low number d of eigenfunctions [20,27, 46,47]. The covariance matrix of the score then depends on easy-to-handle d-dimensional four-way arrays instead of large q-dimensional four-way arrays. This dimension reduction approach is not applicable in the case of incomplete functions because the principal scores ⟨Xji − ˆµj, ˆϕm⟩ cannot be computed when Xji is available only on a subset of its domain [they can only be predicted, see 35]. Therefore, even the computation of the projection test statistic (6) is difficult due the large arrays the matrix ˆW depends on. The computation of the Hilbert–Schmidt statistic (4) and the square root covariance statistic (9) does not involve large four-way arrays. However, to use the asymptotic distribution of SHS (see Theorem 5) one needs to estimate the eigenvalues of an operator on operators. Upon discretization and vectorization, this leads to a large eigenproblem of dimension (Kq2 ) × (Kq2 ), e.g., 30 000 × 30 000 for K = 3, q = 100. Again, dimension reduction cannot be used due to incomplete functions. To overcome these difficulties we use the bootstrap. For completely observed functional data bootstrap tests of equal mean functions or covariance operators were studied by Benko et al. [3] and Paparoditis and Sapatinas [48,49]. In our missing data setting, all bootstrap procedures consist of appropriate resampling of fragmentary curves, which means that each bootstrap sample is again a collection of partially observed functions. The proposed procedures enable to completely avoid the computation of each entry of the large four-way covariance array and the storage and decomposition of the whole array. The implementation of the tests of equal means is described in Algorithm 1. To correctly reproduce the limiting distribution of the group mean estimators under the null, the resampling is done separately in each group of groupwise centred fragmentary observations. The stratification guarantees that neither the missingness patterns nor distributional characteristics of the functions beyond the means need to be equal in all groups. The L2 statistic is computed directly for each bootstrap sample and the observed value is then compared with the resampled values. The direct computation of the projection test statistic from observed or resampled data would require the estimation of the covariance functions ˆvjk in (3), which may be memory demanding and possibly unstable in regions with few complete pairs. We avoid it by estimating the covariance matrix of the score vector from the resampled score vectors, calculating the quadratic form statistic using the observed score vector and the bootstrap estimate of its covariance matrix, and comparing it with its asymptotic chi-square distribution. Algorithm 1 Bootstrap approximation for tests of equal mean functions 1: Calculate ˆµj from observed samples of fragments Xj1, . . . , Xjnj , j = 1, . . . , K, and ˆµ 2: Calculate the test statistic TL2 and the score vector Q 3: Set Xji0 = Xji − ˆµj + ˆµ 4: For b = 1, . . . , B 5: For each j = 1, . . . , K, sample with replacement from fragments Xj10, . . . , Xjnj0 to get fragments X∗ j10, . . . , X∗ jnj0 6: Calculate the statistic T ∗(b) L2 and score vector Q ∗(b) from X∗ j10, . . . , X∗ jnj0, j = 1, . . . , K 7: Approximate the p-value of the L2 -test using TL2 and T ∗(1) L2 , . . . , T ∗(B) L2 8: Calculate the empirical covariance matrix ˆV∗ of Q ∗(1) , . . . , Q ∗(B) and the statistic Td = Q ⊤ ˆV∗− Q 9: Approximate the p-value of the projection test using Td and the χ2 (K−1)d distribution Algorithm 2 describes the bootstrap implementation of confidence intervals for eigenelements. Resampling is applied to fragments and eigenelements are computed. The resampled eigenfunction is possibly reflected about zero so that its sign agrees with that of the observed data empirical eigenfunction. Standard methods of construction of confidence intervals can then be used. Since we again wish to avoid the calculation of variance estimates of eigenelements (see Theorem 4), we use the normal or basic bootstrap method [12, Chapter 5]. Intervals for eigenvalues are constructed on the logarithmic scale and untransformed. This is appropriate in general because in the case of completely observed Gaussian curves the asymptotic variance of n1/2 (ˆλm − λm) is 2λ2 m and thus the log-transformation approximately stabilizes variance. Bootstrap covariance testing is described in Algorithm 3. Unlike in the case of mean testing, it is not possible to transform the data to the common null covariance structure and use stratified resampling. Bootstrap samples are instead 592 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Algorithm 2 Bootstrap confidence intervals for eigenvalues and eigenfunctions 1: Calculate ˆR from the observed fragmentary functional data X1, . . . , Xn 2: Calculate the eigenvalues ˆλm and eigenfunctions ˆϕm of ˆR 3: For b = 1, . . . , B 4: Sample with replacement from fragments X1, . . . , Xn to get fragments X∗ 1 , . . . , X∗ n 5: Calculate ˆR∗ from X∗ 1 , . . . , X∗ n and its eigenvalues ˆλ∗(b) m and eigenfunctions ˆϕ∗(b) m 6: Replace ˆϕ∗(b) m by sign⟨ˆϕ∗(b) m , ˆϕm⟩ˆϕ∗(b) m 7: Based on ˆλ∗(b) m , ˆϕ∗(b) m , b = 1, . . . , B, calculate bootstrap confidence intervals for λm using log-transformation and pointwise bootstrap confidence intervals for ϕm(t) drawn from the pooled sample of groupwise centred fragments, similarly to Paparoditis and Sapatinas [49, Subsection 2.2] for complete curves. Then, under the null hypothesis, if characteristics of observation patterns (θj) and fourth order moments (ζj) are the same in all groups, the pooled resampling asymptotically replicates the limiting distributions of interest. The Hilbert–Schmidt norm and square root covariance statistics are computed directly and the significance is decided upon by comparing the observed statistics with the resampled ones. Like in the case of mean testing, dimension reduction is impossible due to partial observation, and thus the computation of the covariance matrix of the score vector would require to compute large four-way arrays. Instead, the bootstrap is used to estimate the covariance matrix of the score and the quadratic statistic with this matrix is used. Algorithm 3 Bootstrap approximation for tests of equal covariance operators 1: Calculate ˆµj and ˆRj from observed samples of fragments Xj1, . . . , Xjnj , j = 1, . . . , K, and ˆR 2: Perform eigendecomposition of ˆR, determine d and calculate ˆUlm, 1 ≤ l ≤ m ≤ d 3: Calculate the test statistics SHS and Ssqrt and the score vector R with respect to ˆUjm 4: Set Xji0 = Xji − ˆµj 5: For b = 1, . . . , B 6: For each j = 1, . . . , K, sample with replacement from the pooled collection of fragments Xji0, j = 1, . . . , K, i = 1, . . . , nj to get fragments X∗ j10, . . . , X∗ jnj0 7: Calculate the statistics S ∗(b) HS and S ∗(b) sqrt and the score vector R∗(b) with respect to ˆUjm from X∗ j10, . . . , X∗ jnj0, j = 1, . . . , K 8: Approximate the p-value of the Hilbert–Schmidt norm test using SHS and S ∗(1) HS , . . . , S ∗(B) HS and the p-value of the square root covariance test using Ssqrt and S ∗(1) sqrt , . . . , S ∗(B) sqrt 9: Calculate the empirical covariance matrix ˆW∗ of R∗(1) , . . . , R∗(B) and the statistic Sd = R⊤ ˆW∗− R 10: Approximate the p-value of the projection test using Sd and the χ2 (K−1)d(d+1)/2 distribution While we do not provide formal proofs of the validity of the bootstrap approximations, these could be obtained along the lines of the proofs in Paparoditis and Sapatinas [48] and Paparoditis and Sapatinas [49] using our asymptotic results (Theorems 1–5). Note that in our setting the observation sets might be non-identically distributed (e.g., in the case of designed experiments), and hence the bootstrap is applied to possibly non-identically distributed observed fragments. Their average characteristics, however, converge under Definitions 1 and 2. It is possible to use the bootstrap even with mildly non-identically distributed data, as discussed in the general context by Liu [41] who shows that if average moment characteristics of possibly non-identically distributed variables converge, the bootstrap is still applicable. The use of the bootstrap for the square root covariance test is based on empirical evidence from simulation studies (Section 5 and the Supplementary Material). Its theoretical justification would require to first establish the asymptotic distribution of the estimated square root covariance operator, which is not available even in the case of completely observed curves [50]. 5. Simulation results The main goal of the study is to investigate the impact of partial observation on the performance of the different mean and covariance tests and compare the proposed tests using complete and incomplete curves with the simple approach using complete curves only. We repeatedly generate three samples of curves of sizes n1 = 80, n2 = 100, n3 = 120. Curves in the jth sample take the form X(t) = µj(t) + λ 1/2 j0 βj0hj(t) + 20 ∑ k=1 λ 1/2 jk βjk21/2 cos(kπt), t ∈ [0, 1], D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 593 Table 1 Empirical rejection probability (in %) of the L2 test, TL2 , and projection test, Td, of equal means. A dash indicates the same value as on the preceding row. The observation patterns (1)–(9) and mean configurations A–D are described in the text. Observation pattern Mean configuration A B C D TL2 Td TL2 Td TL2 Td TL2 Td Tests using complete and incomplete curves (proposed approach) (1) 5.6 6.2 69 60 49 56 52 63 (2) 5.4 6.7 59 52 28 29 38 50 (3) – – – – 50 56 44 62 (4) 4.4 6.5 66 58 51 57 51 62 (5) – – – – 44 49 50 58 (6) 5.4 7.1 58 51 50 55 42 49 (7) – – – – 28 34 37 42 (8) 5.4 5.8 55 47 34 37 42 48 (9) 5.4 7.8 37 40 20 23 26 34 Tests using complete curves only (simple approach) (2), (3) 5.7 7.4 40 34 26 32 27 35 (4), (5) 3.6 7.4 28 27 18 26 19 28 (6), (7) 4.9 26.8 7 31 6 29 6 31 (8) 4.0 11.5 13 22 8 20 10 21 where βjk, j ∈ {1, 2, 3}, k ∈ {0, . . . , 20} are mutually independent standard normal variables. Additional simulations with t5 distributed coefficients are reported in the supplementary material. In all simulations we use 1000 repetitions of the test procedures, each based on 500 bootstrap samples. All tests are performed on the nominal level of 5%. All results have been computed in R 3.4. The tests are applied to complete trajectories, observation pattern (1), and to fragments obtained by deleting missing periods following several random or nonrandom patterns. Observation patterns (2) and (3) are nonrandom: under pattern (2), the period [0, 0.5] is removed from 50% of the curves in the first sample, 50% in the second sample and 60% in the third sample; pattern (3) is symmetric about 0.5, i.e., the period [0.5, 1] instead of [0, 0.5] is missing in the same subset of curves. Under patterns (4)–(7), a random missing period is generated independently for each curve and removed from the trajectory. First, we consider random missing periods taking the form M = [C − E, C + E] ∩ [0, 1] with C = dU 1/2 1 and E = fU2, where U1, U2 are independent variables uniformly distributed on [0, 1] and d, f are parameters. For missingness pattern (4) we set d = 1.4 and f = 0.2; this gives 39% of completely observed curves and the cross-sectional percentage of observed values decreases from 99% at time 0 to 79% at time 1. Pattern (5) is symmetric about 0.5. For pattern (6) we use the same model as for (4) and set d = 1.2 and f = 0.5; this leads to 7% of complete curves and the cross-sectional probability of observation is 94% at 0 and decreases to about 45% near 1. Pattern (7) is again obtained by reflecting pattern (6) about 0.5. Pattern (8) consists of observation periods generated independently for each curve in the form O = [U1, U2] ∩ [0, 1], where U1, U1 are independent variables uniformly distributed on [a, C], [C, 1 − a], respectively, a = −0.3 and C is uniformly distributed on [0, 1]; the percentage of complete curves in this case is 16% and the cross-sectional observation probability at 0.5 is 77% and decreases to 44% towards both endpoints of the domain. Finally, for pattern (9) curves are observed on random intervals generated as [C − 0.2, C + 0.2] ∩ [0, 1], where C is uniformly distributed in [0, 1]. This corresponds to fragments of curves of length at most 0.4, hence the datasets contain no complete curves, the median length of observed fragments is 0.3 and the cross-sectional probability of observation is 0.3 in the middle of the domain and decreases towards the endpoints, where it is 0.15. In the study of mean tests four configurations of the mean functions are considered. Under configuration A the null hypothesis is satisfied: all mean functions are zero. Under configuration B the mean functions differ by a constant vertical shift: µ1(t) = 0, µ2(t) = 0.18, µ3(t) = −0.1. Under configuration C there are monotonic differences between the means: µ1(t) = 0, µ2(t) = 0.35 exp(−4t), µ3(t) = −0.25 exp(−3t). Under configuration D the means differ in a more complex, nonmonotonic way and they cross: µ1(t) = 0, µ2(t) = 2t exp(−3t), µ3(t) = 0.1 − 8t2 exp(−5t). We set λj0 = 0.5, λjk = 3−k and hj(t) = 1, that is, the covariance structure is the same in all three groups. Additional simulations with unequal covariance structures lead to similar results and are included in the Supplementary Material. We report in the first part of Table 1 the size and power of the L2 test based on TL2 given in (1) and of the projection test based on Td given in (2) using d = 3 Legendre polynomials of order zero, one and two. Blank entries in the table correspond to situations where the true rejection probability is the same as in the entry above; such situations arise when the observation pattern is obtained by reflecting the preceding pattern and the processes {X(t) : t ∈ [0, 1]} and the time-reversed processes {X(1 − t) : t ∈ [0, 1]} have the same distribution. We see in the first part of Table 1 that under the null hypothesis, configuration A, the rejection probability of the L2 tests is close to the nominal level. The size of the projection test seems to be somewhat above the nominal level due to the sample size, especially under observation pattern (9), where the missingness rate is the highest. Our simulation study 594 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Table 2 Empirical rejection probability (in %) of the Hilbert–Schmidt norm test, SHS, projection test, Sd, and square root covariance test, Ssqrt, of equal covariance operators. A dash indicates the same value as on the preceding row. The observation patterns (1)–(5) and covariance configurations A–D are described in the text. Observation pattern Covariance configuration A B C D SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt Tests using complete and incomplete curves (proposed approach) (1) 5.4 5.8 4.8 69 82 80 69 58 69 78 62 81 (2) 4.6 6.4 4.9 54 63 41 37 32 38 76 64 54 (3) – – – – – – – – – 46 30 48 (4) 5.0 5.1 5.8 64 74 72 61 53 62 72 56 73 (5) – – – – – – – – – 77 60 77 Tests using complete curves only (simple approach) (2), (3) 4.1 7.3 4.6 32 38 41 33 28 34 45 30 47 (4), (5) 4.3 5.5 4.2 26 32 33 25 24 28 34 23 36 of power provides raw rejection probabilities in Table 1 and size-adjusted powers (using the method from Subsection 3.2 of Lloyd [42]) in Table S2 in the Supplementary Material. The possibility of size issues should be kept in mind in applications: especially in marginal cases, users should not simply compare p-values with a single threshold but rather carefully report them. Under scenario B the L2 test is more powerful than the projection method. The reason is that the projection method uses in addition to the constant basis function two other terms (linear and quadratic) that do not contribute to the detection of the constant difference between the means but on the other hand they increase the degrees of freedom and hence decrease the power. The L2 method uses infinitely many directions in the space of alternatives but these redundant features are downweighted by the decreasing eigenvalues (the constant difference of means agrees with the constant leading eigenfunction which receives the highest weight in the L2 statistic). Most partial observation patterns lead to a relatively small decrease of power because under this scenario the mean functions differ by a constant vertical shift which is a very simple, global feature that is easily detected even with reduced, fragmented data. The loss of power is largest under pattern (9), where also the reduction of observed data is considerably larger than under the other patterns. Both tests have comparable power under scenario C. Both tests lose power under observation pattern (2) because a large portion of data is missing on the interval [0, 0.5], where the difference between the means is the largest; on the other hand, the reflected pattern (3) does not lead to a loss of power because curves are missing only in [0.5, 1], where the means do not differ much. A similar effect is seen under observation patterns (6) and (7). Under scenario D the projection test seems to be slightly more powerful than the L2 (even after the size adjustment in Table S2 in the Supplementary Material) because the nonmonotonic differences between the mean functions are well captured by both the first three Legendre polynomials and the first three eigenfunctions but the contribution of the latter is downweighted in the L2 statistic whereas the projection statistic treats all three components equally. The second part of Table 1 shows for each missingness pattern and mean configuration the performance of the tests applied to the subset of complete curves only. The complete curve approach would be the only possibility if the tests developed in this paper were not available. Results for the pairs of patterns (2) and (3), (4) and (5), (6) and (7) are presented on the same rows of the second part of the table because the subsets of complete curves are the same under both patterns in each pair. Pattern (9) is omitted because it contains no complete curves and hence inference is impossible without our methods. Under patterns (2) (or (3)) and (4) (or (5)), the use of complete curves only, which form 46% and 39%, respectively, of the whole sample, leads to a considerable loss of power in most situations. Configuration C under pattern (2) is an exception. Here removing incomplete curves does not decrease the power because they are observed on the subdomain [0.5, 1], where the means do not differ much. Under patterns (6) (or (7)) and (8) there are only 7% and 16% complete curves, respectively. With such small sample sizes the projection test becomes unreliable in terms of level and the L2 test loses almost all power. Next, we study the behaviour of the tests for comparing covariance operators. Under all scenarios we generate mean zero trajectories. Configuration A satisfies the null hypothesis with λj0 = 0.5, λjk = 3−k and hj(t) = 1, j ∈ {1, 2, 3}. Under configuration B the same parameters are used except for the third sample where the overall scale is larger, namely λ3,0 = 1.5 × 0.5 and λ3,k = 1.5 × 3−k . Under scenario C the first two eigenvalues in the third sample are interchanged, i.e., λ3,0 = 3−1 , λ3,1 = 0.5 and λ3,k = 3−k , k ∈ {2, . . . , 20}, otherwise the parameters are the same as in A. Scenario D differs from A in that we set h3(t) = 1 for t ∈ [0, 0.5] and h3(t) = 2.21/2 for t ∈ (0.5, 1]. Table 2 shows the size and power of the Hilbert–Schmidt norm test based on SHS in (4), projection test based on Sd in (6) with d selected to explain at least 85% of the total variability of the null covariance estimate, and square root covariance test based on Ssqrt in (9). Like before, entries where the true rejection probability equals the one above are left blank. We use only observation patterns (1)–(5). Under the other patterns the amount of missing information is too large for second order inference. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 595 Under the null hypothesis, configuration A, the first part of Table 2 shows that the rejection probability of all tests is close to the nominal level under all missingness patterns, with the projection test being slightly above the level in some cases. It is interesting to notice the different impact of missingness on the power in different situations. We report raw power in Table 2 and size-adjusted power in Table S4 in the Supplementary Material. While in many situations the loss of power due to missingness is similar for all three tests, in some situations the square root test appears to be more sensitive to missingness. For example under scenario B and missingness pattern (2), the square root covariance test loses almost half of its power relative to no missingness, much more than the other two tests. This can be explained by the fact that the square root covariance estimator depends on the estimator of the covariance kernel at all arguments which means that uncertainty due to missingness localized in a certain region in the domain, like under pattern (2), propagates. Similarly, under scenario D and pattern (2) the Hilbert–Schmidt and projection tests do not lose much power and the square root test does because the difference between the covariances is due to the differences of hj(t) for t ∈ [0.5, 1] while missingness occurs for t ∈ [0, 0.5]. For these reasons, under the same scenario, pattern (3) leads to a larger loss of power than pattern (2) for the Hilbert–Schmidt and projection tests, whereas the loss of the square root covariance is not much higher than under pattern (2), where it was already high. The second part of Table 2 shows results for tests applied to the subset of complete curves only. Like before, patterns (3) and (5) are shown on the same rows as patterns (2) and (4), respectively, because the subsets of complete curves are the same. We observe a large decrease of power in comparison with the power of the proposed tests in cases, where the neglected incomplete curves carry information on the difference between covariance operators. When the difference is mostly in the frequently missing region (e.g., configuration D, pattern (3)), removing incomplete curves affects the power much less. These results highlight the usefulness of the proposed methods as an efficient, and often the only viable approach to testing with incomplete functions. In no situation the proposed methods behaved worse than the simple approach using complete curves only, and in many cases it behaved dramatically better. Additional results for non-Gaussian curves can be found in the Supplementary Material. 6. Application to partially observed heart rate temporal profiles We illustrate our methods on curves describing the evolution of heart rate in 427 male participants in the period from 8 PM to 2 AM corresponding to the domain [20, 26]. The data come from the Swiss Kidney Project on Genes in Hypertension. There are three groups of persons according to their age: younger than 40 years (164 persons), between 40 and 65 (180), and older than 65 (83). The curves and their first derivative are plotted in Fig. 1. Although the percentage of observed values at each time or at each pair of time points is relatively high (Fig. 2), only 58% of the curves are complete. Plots of the estimated mean functions in Fig. 1 indicate differences between the age groups both in terms the temporal profiles and their first derivative. We first compare the group means of heart rate profiles. The p-values of the L2 test and projection test using three Legendre polynomials are 0.006 and less than 0.001, respectively, confirming the clearly visible differences. To compare the dynamics of heart rate during the transition between day and night we test whether the means of the first derivative differ. The L2 and projection test have nearly zero p-values, meaning that the mean heart rate profiles differ between age groups more than by a vertical shift. The plots suggest it may be interesting to compare some pairs of groups. E.g., while the mean profiles of the middle and oldest group significantly differ (p < 0.01 for both tests), they appear to be approximately parallel. The difference between the derivatives is indeed insignificant (p = 0.07 for the L2 test, p = 0.09 for the projection test). Without the methods developed in this paper one would have to use complete curves only. There are 249 complete functions (43, 110 and 96 in the three age groups). The projection test still detects the differences between the three groups (p = 0.008) but the L2 test loses significance (p = 0.066). When comparing the second and third group, the projection test now fails to detect the difference (p = 0.13) and the L2 test gives a marginally significant result (p = 0.048). This can be explained by a loss of power seen in simulations because the removed incomplete curves are more often observed at earlier times, where also the difference between the two mean curves is more pronounced. Estimates of the covariance function, eigenvalues and eigenfunctions of heart rate profiles and of their derivatives for each age group are plotted in Fig. 3 and Fig. 4. Further plots can be found in the supplementary document. The plots suggest some differences between the groups. The variance and covariance appears to be higher in younger participants, especially earlier in the time interval (during the day). We assess the significance of these differences using the proposed tests. For the projection test we consider up to three principal components (plotted in the supplementary document), which corresponds to the projection on a subspace of dimension six in the space covariance operators. Table 3 reports the p-values. None of the tests rejects the null hypothesis on usual significance levels. Similarly, pairwise comparisons provided no overwhelming evidence of differences. It is of course possible that there are differences between groups that may be detected with larger samples. To gain further insight into the structure of possible differences one can inspect the values of the standardized score components Rjlm/ ˆW 1/2 jlm,jlm (see (5) and (7)) whose graphical representation is provided in the supplement. 596 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Fig. 1. Individual heart rate profiles and their first derivative (left panels) and the corresponding group-specific and null estimates of the mean (right panels). Fig. 2. Cross-sectional percentage of observed values (left) and percentage of pairwise complete observations (right). Table 3 p-values of the Hilbert–Schmidt norm test, SHS, the square root covariance test, Ssqrt, and the projection tests, Sd, with d = 1, 2, 3, for comparing covariance structures of heart rate profiles and of their first derivative in three age groups. The fraction of variance explained by the first d principal components of the null covariance estimate is indicated in parentheses. SHS Ssqrt S1 S2 S3 Curves 0.338 0.118 0.317 (88.2%) 0.439 (97.3%) 0.275 (99.1%) First derivative 0.226 0.114 0.322 (62.6%) 0.131 (94.4%) 0.094 (98.7%) Acknowledgments We are grateful to all reviewers for their valuable comments and suggestions. This work was supported by the Czech Science Foundation under Grant GJ17-22950Y. Access to computing and storage facilities owned by parties and projects contributing to the MetaCentrum National Grid Infrastructure provided under the programme ‘‘Projects of Large Research, Development, and Innovations Infrastructures’’ (CESNET LM2015042) is greatly appreciated. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 597 Fig. 3. Estimated covariance functions of heart rate profiles (top row) and of their derivatives (bottom row) in age groups. Fig. 4. Estimated eigenvalues and eigenfunctions of heart rate profiles (top row) and of their derivatives (bottom row) in age groups with pointwise 95% bootstrap confidence intervals. 598 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 Appendix A. A central limit theorem We provide a general central limit theorem for independent but not necessarily identically distributed random elements of a separable Hilbert space. It is needed in the proofs, where non-identical distributions arise due to partial observation, but is of more general interest. It extends the standard result for independent identically distributed functional variables [5, Theorem 2.7] by relaxing the assumption of identical distributions and by considering triangular arrays. The notation ∥ · ∥∞ below means the operator norm. Theorem 6. Let Yni, n ∈ {1, 2, . . . }, i ∈ {1, . . . , n} be random elements of a separable Hilbert space H with mean zero, E ∥Yni∥2 < ∞ and covariance operators Cni. Let Yn1, . . . , Ynn be mutually independent for each n ∈ {1, 2, . . . }. Denote Sn = n−1/2 ∑n i=1 Yni and Gn = n−1 ∑n i=1 Cni. Assume that (i) ∥Gn − G ∥∞ → 0 as n → ∞ for some covariance operator G , (ii) for all ε > 0, n−1 n ∑ i=1 E(∥Yni∥2 1[∥Yni∥>n1/2∥Gn∥∞ε]) → 0 as n → ∞, (iii) tr Gn → tr G as n → ∞. Then Sn converges in distribution to a Gaussian random element with mean zero and covariance operator G . Appendix B. Proofs Proof of Theorem 1. We rewrite N1/2 ( ˆµ − µ) = ˆπ1/2 n1/2 ( ˆµ − µ). The main task is to establish the weak convergence of the process n1/2 ( ˆµ − µ) = 1 π Sn + ( J ˆπ − 1 π ) Sn + n1/2 (J − 1)µ, (B.1) where Sn = n−1/2 ∑n i=1 Oi(Xi − µ). We show that the first term on the right side of (B.1) converges in distribution to a mean zero Gaussian process with covariance operator with kernel π(s)−1 π(t)−1 ν(s, t)ρ(s, t) that can be consistently estimated by ˆπ(s)−1 ˆπ(t)−1 ˆν(s, t) ˆρ(s, t), and that the norms of the other two terms converge in probability to 0. The proof of the weak convergence of N1/2 ( ˆµ − µ) then follows from the convergence of ˆπ to π, the consistency of the estimator of its covariance kernel can be shown analogously. The weak convergence of Sn is shown with the help of Theorem 6, a central limit theorem for independent nonidentically distributed Hilbert space variables given in the Appendix. We apply the theorem with Yni = Oi(Xi − µ). The covariance operator Gn of Sn is given by the kernel ¯ν(s, t)ρ(s, t). Denote by G the covariance operator with kernel ν(s, t)ρ(s, t). Conditions of the central limit theorem Theorem 6 can be shown using Definition 1(b) as follows. Condition (i) of Theorem 6 is satisfied because ∥Gn − G ∥2 ∞ ≤ ∥Gn − G ∥2 2 = ∫ [0,1]2 {¯ν(s, t) − ν(s, t)}2 ρ(s, t)2 dsdt → 0 as n → ∞ by the dominated convergence theorem. Condition (ii) of Theorem 6 holds because n−1 n ∑ i=1 E(∥Yni∥2 1[∥Yni∥>n1/2∥Gn∥∞ε]) ≤ n−1 n ∑ i=1 E(∥Xi − µ∥2 1[∥Xi−µ∥>n1/2∥Gn∥∞ε]) = E(∥X1 − µ∥2 1[∥X1−µ∥>n1/2∥Gn∥∞ε]), which converges to 0 by the dominated convergence theorem. Finally, ∫ 1 0 ¯ν(t, t)ρ(t, t)dt → ∫ 1 0 ν(t, t)ρ(t, t)dt by the dominated convergence theorem again, and thus condition (iii) of Theorem 6 is satisfied. Hence the process Sn is asymptotically Gaussian with covariance kernel ν(s, t)ρ(s, t). The expectation of the squared norm of the second term on the right side of (B.1) can be rewritten as ∫ 1 0 E [{ J(t) ˆπ(t) − 1 π(t) }2 Sn(t)2 1[ ˆπ(t)≥π0/2] ] dt + ∫ 1 0 E [{ J(t) ˆπ(t) − 1 π(t) }2 Sn(t)2 1[ ˆπ(t)<π0/2] ] dt. (B.2) The first summand above is dominated by ∫ 1 0 E [ {π(t) − ˆπ(t)}2 π4 0 /4 Sn(t)2 ] dt ≤ ∫ 1 0 E [ {π(t) − ˆπ(t)}2 π4 0 /4 ] ρ(t, t)dt D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 599 which converges to zero by the dominated convergence theorem since E[{π(t)− ˆπ(t)}2 ] = {π(t)− ¯π(t)}2 +n−2 ∑n i=1 πi(t) {1 − πi(t)} → 0 for n → ∞. Next, we first compute { J(t) ˆπ(t) − 1 π(t) }2 1[ ˆπ(t)<π0/2] = [ J(t) { π(t) − ˆπ(t) ˆπ(t)π(t) }2 + {1 − J(t)} 1 π(t)2 ] 1[ ˆπ(t)<π0/2] ≤ [J(t)n2 /π2 0 + {1 − J(t)}/π2 0 ]1[ ˆπ(t)<π0/2] ≤ n2 /π2 0 1[ ˆπ(t)<π0/2]. Then the second summand in (B.2) is smaller than or equal to ∫ 1 0 E{n2 /π2 0 1[ ˆπ(t)<π0/2]Sn(t)2 }dt ≤ ∫ 1 0 n2 /π2 0 Pr{ ˆπ(t) < π0/2}ρ(t, t)dt ≤ n2 sup t∈[0,1] Pr{ ˆπ(t) < π0/2}/π2 0 tr R, which converges to 0 because, in light of Hoeffding’s inequality and Definition 1(a), for all t ∈ [0, 1], Pr{ ˆπ(t) < π0/2} ≤ exp[−2n{ ¯π(t) − π0/2}2 ] ≤ exp [ −2n { π0/2 − sup t∈[0,1] | ¯π(t) − π(t)| }2] → 0. This completes the proof of the convergence in probability of the norm of the second term on the right hand side of (B.1) to zero. The last term in (B.1) can be shown to converge to zero using similar arguments based on Hoeffding’s inequality. We now turn to the proof of the consistency of the estimator of the covariance kernel. To show that E ∫ [0,1]2 { ˆν(s, t) ˆρ(s, t) ˆπ(s) ˆπ(t) − ν(s, t)ρ(s, t) π(s)π(t) }2 dsdt → 0, we can split the integral into the integrals over A0 = {(s, t) ∈ [0, 1]2 : ν(s, t) = 0} and A1 = {(s, t) ∈ [0, 1]2 : ν(s, t) ≥ ν0} because Definition 1(c) implies that A0 ∪ A1 = [0, 1]2 . On A0 we obtain E ∫ A0 { ˆν(s, t) ˆρ(s, t) ˆπ(s) ˆπ(t) }2 {1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]}dsdt ≤ ∫ A0 E{ˆν(s, t)2 } E{ˆρ(s, t)2 }dsdt ( (π0/2)−4 + n4 sup (s,t)∈[0,1]2 Pr[min{ ˆπ(s), ˆπ(t)} < π0/2] ) . Here the integral converges to zero by the dominated convergence theorem as the integrand can be shown to go to 0 and the second term in the brackets asymptotically vanishes due to an exponential rate of decrease of the supremum that can be established with the help of Hoeffding’s inequality as before, hence the whole quantity above converges to 0. We now focus on A1. We rewrite ˆν(s, t) ˆρ(s, t) ˆπ(s) ˆπ(t) − ν(s, t)ρ(s, t) π(s)π(t) = ˆν(s, t) ˆπ(s) ˆπ(t) {ˆρ(s, t) − ρ(s, t)} + { ˆν(s, t) ˆπ(s) ˆπ(t) − ν(s, t) π(s)π(t) } ρ(s, t) (B.3) and show that the integral over A1 of the expectation of the square of each summand converges to zero. For the first summand we compute ∫ A1 E ([ ˆν(s, t) ˆπ(s) ˆπ(t) {ˆρ(s, t) − ρ(s, t)} ]2 {1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]} ) dsdt ≤ E ∫ A1 {ˆρ(s, t) − ρ(s, t)}2 dsdt [ (π0/2)−4 + n4 sup (s,t)∈[0,1]2 Pr(min{ ˆπ(s), ˆπ(t)} < π0/2) ] , where the integral term converges to 0 by similar arguments to those in the proof of Proposition 1 in Kraus [35] with the help of Definition 1(c) and the second term goes to 0 by Hoeffding’s inequality again. For the second summand on the right in (B.3) we can write ∫ A1 E [ I(s, t) { π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t) ˆπ(s) ˆπ(t)π(s)π(t) }2] ρ(s, t)2 dsdt + ∫ A1 E [ {1 − I(s, t)} { ν(s, t) π(s)π(t) }2] ρ(s, t)2 dsdt. (B.4) Like before, we split the first term in (B.4) into two summands by writing ∫ A1 E [ I(s, t) { π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t) ˆπ(s) ˆπ(t)π(s)π(t) }2 {1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]} ] ρ(s, t)2 dsdt. The first summand is bounded by 16π−8 0 ∫ A1 E[{π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t)}2 ]ρ(s, t)2 dsdt, which converges to 0 by the dominated convergence theorem since the expectation in the integrand can be shown to converge to 0; the 600 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 second summand in the displayed expression above is dominated by n4 π−4 0 ∥R∥2 2 sup(s,t)∈[0,1]2 Pr(min{ ˆπ(s), ˆπ(t)} < π0/2), which converges to 0 by Hoeffding’s inequality. Finally, the second term in (B.4) is dominated by sup(s,t)∈A1 Pr(ˆν(s, t) < ν0/2)π−4 0 ∥R∥2 2, which converges to 0 again by Hoeffding’s inequality. Proof of Theorem 2. Denote Zj(·) = Nj(·)1/2 { ˆµj(·) − ˆµ(·)}/ˆrj and Z = (Z1, . . . , ZK )⊤ . Under the null hypothesis we can write Z = ˆDH, where H = (H1, . . . , HK )⊤ with Hj = N 1/2 j ( ˆµj − µ) and ˆD is a bounded linear operator from {L2 ([0, 1])}K to {L2 ([0, 1])}K that maps an element f to an element g whose jth component is given by gj(t) = ∑K l=1( ˆDjlfl)(t) = ∑K l=1 ˆr−1 j {δjl − Nj(t)1/2 ˆwl(t)Jl(t)Nl(t)−1/2 }fl(t) (here δjl is the Kronecker delta and Jl(t)Nl(t)−1/2 is zero if Jl(t) = 1[Nl(t)>0] is zero). From Theorem 1 we see that H converges in distribution to the random element H∞ = (H∞ 1 , . . . , H∞ K )⊤ whose components are mutually independent Gaussian processes with mean zero and covariance operators Kj, j = 1 . . . , K analogous to the operator K in Theorem 1. The operator ˆD converges in probability to the operator D whose elements are defined by (Djlfl)(t) = r−1 j {δjl − πj(t)1/2 a 1/2 j wl(t)πl(t)−1/2 a −1/2 l }fl(t) with wl(t) = alπl(t)/r2 l /( ∑K k=1 akπk(t)/r2 k ) (the convergence is in the operator norm, i.e., ∥ ˆD − D∥∞ P −→ 0). Therefore, it follows from Slutsky’s and continuous mapping theorem that Z = ˆDH converges weakly to Z∞ = DH∞ . This is a K-dimensional mean zero Gaussian random process with cross-covariance operator between Z∞ j and Z∞ k equal to Vjk = ∑K l=1 DjlKlD∗ kl, j = 1, . . . , K, k = 1, . . . , K. These can be consistently estimated by plugging-in the estimators ˆDjl and ˆKl. The kernel of the estimator ˆVjk takes the form ˆvjk(s, t) = ∑K l=1 ˆr−1 j {δjl − Nj(s)1/2 ˆwl(s)Nl(s)−1/2 }ˆκl(s, t){δkl − Nk(t)1/2 ˆwl(t)Nl(t)−1/2 }ˆr−1 k . For (i), the continuous mapping theorem gives that the statistic TL2 = ∥Z∥2 converges weakly to the random variable ∥Z∞ ∥2 . The process Z∞ is a Gaussian random element of the separable Hilbert space {L2 ([0, 1])}K . Therefore, it can be expanded in a Karhunen–Loève series with Gaussian coefficients. Consequently, the distribution of its squared norm is that of the series given in the theorem. The consistency of ˆV implies the consistency of the estimated eigenvalues. To prove (ii), notice that the components of the score vector satisfy Qjl = ⟨ ˆπ 1/2 j Zj, ˆψl⟩. The continuous mapping theorem and Slutsky’s theorem in conjunction with the convergence of ˆψl imply that Q is asymptotically distributed as a Gaussian vector with mean zero and covariance matrix with entries Vjl,km = ⟨π 1/2 j ψl, Vjk(π 1/2 k ψm)⟩. The consistency of ˆVjl,km follows from the consistency of ˆVjk and ˆπj and convergence of ˆψl. The process ( ˆπ 1/2 1 Z1, . . . , ˆπ 1/2 K ZK ) lies in a (K − 1)-dimensional subspace of the K-dimensional product space {L2 ([0, 1])}K and the same holds for its limit. Therefore, the score vector lies in a (K − 1)d-dimensional subspace of RKd , leading to (K − 1)d degrees of freedom of the chi-square distribution. Proof of Theorem 3. The kernel of n1/2 ( ˆR − R) is n1/2 {ˆρ(s, t) − ρ(s, t)} = n1/2 {ˆρ(s, t) − ˇρ(s, t)} + 1 ν(s, t) σ(s, t) + { I(s, t) ˆν(s, t) − 1 ν(s, t) } σ(s, t) + n1/2 {I(s, t) − 1}ρ(s, t), (B.5) where ˇρ is defined like ˆρ with the true mean in place of the estimated mean and σ(s, t) = n−1/2 ∑n i=1 Ui(s, t)[{Xi(s) − µ(s)}{Xi(t) − µ(t)} − ρ(s, t)]. Let us focus on the second summand on the right side of (B.5). All the other terms are negligible in the appropriate sense as we explain later. The kernel σ(s, t) corresponds to the operator Sn = n−1/2 ∑n i=1 Yni, where Yni are the integral operators with kernels yni(s, t) = Ui(s, t)[{Xi(s) − µ(s)}{Xi(t) − µ(t)} − ρ(s, t)]. We will apply Theorem 6 to Yni, which is a triangular array of row-wise independent non-identically distributed zero-mean random elements of the separable Hilbert space of the Hilbert–Schmidt operators on L2 ([0, 1]). The covariance operator of Yni is the Hilbert–Schmidt operator Cni on Hilbert–Schmidt operators given by ⟨A1, CniA2⟩ = cov(⟨Yni, A2⟩, ⟨Yni, A1⟩) = ∫ [0,1]4 α1(s, t) cov{yni(s, t), yni(u, v)}α2(u, v)dsdtdudv, where A1, A2 are Hilbert–Schmidt operators with kernels α1, α2, respectively. The kernel of Cni is cni(s, t, u, v) = cov{yni(s, t), yni(u, v)} = θi(s, t, u, v){ζ(s, t, u, v)−ρ(s, t)ρ(u, v)}. The covariance operator of Sn is Gn = n−1 ∑n i=1 Cni with kernel ¯θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}. Like in the proof of Theorem 1, one can use the dominated convergence theorem to show that ∥Gn − G∥2 → 0, where G has kernel θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}. Thus condition (i) of Theorem 6 is verified. Condition (ii) can be verified like in the proof of Theorem 1. Next, condition (iii) is satisfied because tr Gn = ∫ [0,1]2 ¯θ(s, t, s, t){ζ(s, t, s, t)−ρ(s, t)2 }dsdt converges to tr G = ∫ [0,1]2 θ(s, t, s, t){ζ(s, t, s, t)−ρ(s, t)2 }dsdt. Therefore, Sn is asymptotically distributed as a Gaussian random operator with mean zero and covariance operator G and, consequently, by the continuous mapping theorem the second term on the right-hand side of (B.5) weakly converges to the mean zero Gaussian operator with covariance operator H′ given in Theorem 3. The operators corresponding to the first and fourth summand on the right side in (B.5) were shown to converge to zero in the proof of Proposition 1 in Kraus [35] in the sense that the expectation of their squared Hilbert–Schmidt norm converges to zero. Also, the Hilbert–Schmidt norm of the third term on the right in (B.5) converges to zero in mean square which can be shown by arguments analogous to those used for the second term on the right in (B.1) in the proof of Theorem 1. Therefore, in view of Slutsky’s lemma these terms are negligible for the weak convergence. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 601 The weak convergence of the operator with kernel M(s, t)1/2 {ˆρ(s, t) − ρ(s, t)} follows from the convergence of ˆν(s, t) to ν(s, t). The consistency of the estimators of H′ and H can be proved along the lines of the proof for K ′ and K in Theorem 1. Proof of Theorem 4. The proof uses perturbation theory in which ˆR is regarded as a perturbed version of R, i.e., ˆR = R + ( ˆR − R). Recall that the perturbation satisfies E ∥ ˆR − R∥2 2 = O(n−1 ) [35, Proposition 1], and, therefore, ∥ ˆR − R∥∞ = OP (n−1/2 ). Similarly to the proof of Theorem 3.1 in [10], we rewrite n1/2 (ˆλm −λm) = n1/2 (ˆλm −λm)1Ωn +n1/2 (ˆλm −λm)1ΩC n , where Ωn = {ω : ∥ ˆR − R∥∞ < εn} for a numerical sequence εn satisfying n−1/2 ≪ εn ≪ n−1/4 . Since Pr(Ωn) → 1 as n → ∞, the term n1/2 (ˆλm − λm)1ΩC n converges to 0 in probability. For ∥ ˆR − R∥∞ sufficiently small, i.e., on Ωn for n large enough, we have by Corollary 3.4 of [22] that n1/2 (ˆλm − λm)1Ωn = n1/2 ⟨( ˆR − R)ϕm, ϕm⟩1Ωn + n1/2 O(∥ ˆR − R∥2 ∞)1Ωn . Here the last term converges to 0 in probability because εn ≪ n−1/4 and the first term on the right side converges in distribution to the limit given in part (i) of the theorem. Hence the result follows from Slutsky’s theorem. The expression for the limiting variance is obtained by rewriting var⟨H ′∞ ϕm, ϕm⟩ = var⟨H ′∞ , ϕm ⊗ ϕm⟩ = ⟨ϕm ⊗ ϕm, H′ (ϕm ⊗ ϕm)⟩. Next, we can write n1/2 (ˆsm ˆϕm −ϕm) = n1/2 (ˆsm ˆϕm −ϕm)1Ωn +n1/2 (ˆsm ˆϕm −ϕm)1ΩC n . For n sufficiently large, Corollary 3.3 of [22] gives n1/2 (ˆsm ˆϕm − ϕm)1Ωn = n1/2 Qm( ˆR − R)ϕm1Ωn + n1/2 O(∥ ˆR − R∥2 ∞)1Ωn . The first term on the right converges in distribution to the limiting distribution as claimed in part (ii) and the other terms converge in probability to 0. The limiting covariance operator is obtained by inspecting the cross-covariance operator for each pair of summands in the series QmH ′∞ ϕm. The cross-covariance between (ϕk⊗ϕk)H ′∞ ϕm = ⟨ϕk, H ′∞ ϕm⟩ϕk and (ϕl⊗ϕl)H ′∞ ϕm = ⟨ϕl, H ′∞ ϕm⟩ϕl is cov(⟨ϕk, H ′∞ ϕm⟩, ⟨ϕl, H ′∞ ϕm⟩)(ϕk ⊗ ϕl) = cov{⟨(ϕm ⊗ ϕk), H ′∞ ⟩, ⟨(ϕm ⊗ ϕl), H ′∞ ⟩}(ϕk ⊗ ϕl) = ⟨(ϕm ⊗ ϕk), H′ (ϕm ⊗ ϕl)⟩(ϕk ⊗ ϕl). The inner product in the last expression above equals the integral in part (ii) of the theorem. Proof of Theorem 5. Let ˆD be the linear operator on the product space HS(L2 ([0, 1]))K that maps F = (F1, . . . , FK )⊤ , where Fj are Hilbert–Schmidt operators on L2 ([0, 1]) with kernels fj(s, t), to G = (G1, . . . , GK )⊤ where Gj has kernel gj(s, t) = ∑K l=1{δjl − Mj(s, t)1/2 ˆwl(s, t)Il(s, t)Ml(s, t)−1/2 }fl(s, t). The mapping ˆD is a random linear operator on HS(L2 ([0, 1]))K that acts by pointwise multiplication and linear combination of integral kernels; ˆD itself is not an integral operator but it is bounded because the functions in the braces above are bounded. It converges in probability to the nonrandom bounded linear operator D that maps F to G with Gj with kernel ∑K l=1{δjl − νj(s, t)1/2 a 1/2 j wl(s, t)νl(s, t)−1/2 a −1/2 l } fl(s, t). The convergence is in the sense of the operator norm on linear operators on HS(L2 ([0, 1]))K , that is, ∥ ˆD−D∥∞ P −→ 0, where ∥D∥∞ = sup{∥DF∥2/∥F∥2 : F ∈ HS(L2 ([0, 1]))K } with ∥ · ∥2 being the Hilbert–Schmidt norm on HS(L2 ([0, 1]))K . Now consider the standardized contrasts Z = (Z1, . . . , ZK )⊤ with kernels zj(s, t) = Mj(s, t)1/2 {ˆρj(s, t) − ˆρ(s, t)}. They are obtained as Z = ˆDH , where H = (H1, . . . , HK )⊤ with Hj with kernel hj(s, t) = Mj(s, t)1/2 {ˆρ(s, t) − ρ(s, t)}. Under the null hypothesis Theorem 3 yields that H converges in distribution to H ∞ , a vector of K independent mean zero Gaussian random operators with covariance operators Hj. Therefore, Z = ˆDH converges in distribution to Z ∞ = DH ∞ by Slutsky’s and continuous mapping theorem. The covariance operator B of Z ∞ is given by the cross-covariance operators Bjk between the components Zj and Zk whose estimator ˆBjk has kernel ˆβjk(s, t, u, v) = K ∑ l=1 {δjl − Mj(s, t)1/2 ˆwl(s, t)Ml(s, t)−1/2 }ˆηl(s, t, u, v){δkl − Mk(u, v)1/2 ˆwl(u, v)Ml(u, v)−1/2 }. The test statistic SHS = ∥Z ∥2 2 is asymptotically distributed as ∥Z ∞ ∥2 2. The random variable Z ∞ is a Gaussian element of the separable Hilbert space HS(L2 ([0, 1]))K , therefore it can be expanded in a Karhunen–Loève series with independent Gaussian coefficients. Therefore, its squared norm is distributed as the series of independent chi-square variables weighted by the eigenvalues of the covariance operator and part (i) of the theorem follows. The components of the score vector satisfy Rjlm = ⟨ˆνj(·, ·)1/2 zj(·, ·), ˆUlm⟩. Due to the consistency of the estimated eigenfunctions [35, Proposition 2], the operator ˆUlm (up to the sign ambiguity for l ̸= m) converges to Ulm defined by the true eigenfunctions, with kernel ulm(s, t). Therefore, the score vector weakly converges to the mean zero Gaussian vector with components R∞ jlm = ⟨νj(·, ·)1/2 z∞ j (·, ·), Ulm⟩ = ⟨z∞ j (·, ·), νj(·, ·)1/2 ulm(·, ·)⟩ whose covariance matrix has entries Wjlm,kpq = ⟨νj(·, ·)1/2 ulm(·, ·), Bjk{νk(·, ·)1/2 upq(·, ·)}⟩, j, k ∈ {1, . . . , K}, 1 ≤ l ≤ m ≤ d, 1 ≤ p ≤ q ≤ d. The vector of operators with kernels νj(s, t)1/2 z∞ j (s, t) lies in a hyperplane in HS(L2 ([0, 1]))K , thus the matrix W has rank (K − 1)d(d + 1)/2. The consistency of ˆW follows from the convergence of all quantities involved. Hence the limiting distribution is the chi-square distribution as claimed in part (ii). Proof of Theorem 6. First, we prove the convergence in distribution of one-dimensional projections using Lindeberg’s central limit theorem. It follows from assumption (i) that for f ∈ H such that G f ̸= 0, var⟨Sn, f ⟩ = ⟨f , Gnf ⟩ → ⟨f , G f ⟩ as 602 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 n → ∞. To verify Lindeberg’s condition, we compute n−1 n ∑ i=1 E(⟨Yni, f ⟩2 1[|⟨Yni,f ⟩|>n1/2⟨f ,Gnf ⟩1/2ε]) ≤ n−1 n ∑ i=1 E(∥Yni∥2 ∥f ∥2 1[∥Yni∥>n1/2⟨f ,Gnf ⟩1/2∥f ∥−1ε]). Now in light of assumption (i), there is a positive constant c such that for sufficiently large n, ⟨f , Gnf ⟩1/2 /∥Gn∥∞ > c, and the above expression is further dominated by n−1 ∑n i=1 E(∥Yni∥2 ∥f ∥2 1[∥Yni∥>n1/2∥Gn∥∞c∥f ∥−1ε]), which converges to 0 by assumption (ii). Hence one-dimensional projections converge, and due to Theorem 2.3 of Bosq [5], all finite-dimensional projections converge. To complete the proof, let us prove the tightness of the sequence Sn, n = 1, 2, . . . The idea of the proof is similar to that of Bosq [5, Theorem 2.7] but in the present situation the variables Yn1, . . . , Ynn are possibly non-identically distributed. Let vj and δj, j = 1, 2, . . . be the eigenfunctions and eigenvalues of the limiting operator G . Consider a sequence lk, k = 1, 2, . . . such that lk → ∞ for k → ∞. For ε > 0, let Nk, k = 1, 2, . . . be an increasing sequence of integers such that ∑∞ k=1 lkr2 Nk < ε, where r2 N = ∑∞ j=N δj. Define Bk = {x ∈ H : ∑∞ j=Nk ⟨x, vj⟩2 ≤ l−1 k }. It follows from assumptions (i) and (iii) that Pr(Sn ∈ BC k ) = P ( ∞∑ j=Nk ⟨Sn, vj⟩2 > l−1 k ) ≤ lk E ( ∞∑ j=Nk ⟨Sn, vj⟩2 ) = lk E ( ∥Sn∥2 − Nk−1 ∑ j=1 ⟨Sn, vj⟩2 ) = lk ( tr Gn − Nk−1 ∑ j=1 ⟨vj, Gnvj⟩ ) → lk ( tr G − Nk−1 ∑ j=1 ⟨vj, G vj⟩ ) = lk ∞∑ j=Nk ⟨vj, G vj⟩ = lkr2 Nk . Consider the compact set Kε = ∩∞ k=1Bk and compute lim sup n→∞ Pr(Sn ∈ KC ε ) ≤ lim sup n→∞ ∞∑ k=1 Pr(Sn ∈ BC k ) ≤ ∞∑ k=1 lim sup n→∞ Pr(Sn ∈ BC k ) ≤ ∞∑ k=1 lkr2 Nk < ε, where the second inequality is due to Fatou’s lemma. This proves the tightness. Appendix C. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jmva.2019.05.002. The supplementary document available online contains further simulation results and additional graphs for the data application. R code is available online. References [1] J.A.D. Aston, D. Pigoli, S. Tavakoli, Tests for separability in nonparametric covariance operators of random surfaces, Ann. Statist. 45 (4) (2017) 1431–1461. [2] A. Aue, R. Gabrys, L. Horváth, P. Kokoszka, Estimation of a change-point in the mean function of functional data, J. Multivariate Anal. 100 (10) (2009) 2254–2269. [3] M. Benko, W. Härdle, A. Kneip, Common functional principal components, Ann. Statist. 37 (1) (2009) 1–34. [4] G. Boente, D. Rodriguez, M. Sued, Testing equality between several populations covariance operators, Ann. Inst. Statist. Math. (2017) 1–32. [5] D. Bosq, Linear Processes in Function Spaces, Springer, New York, 2000. [6] F.A. Bugni, Specification test for missing functional data, Econom. Theory 28 (5) (2012) 959–1002. [7] A. Cabassi, D. Pigoli, P. Secchi, P.A. Carter, Permutation tests for the equality of covariance operators of functional data with applications to evolutionary biology, Electron. J. Stat. 11 (2) (2017) 3815–3840. [8] G. Cao, L. Yang, D. Todem, Simultaneous inference for the mean function based on dense functional data, J. Nonparametr. Stat. 24 (2) (2012) 359–377. [9] A. Cuevas, M. Febrero, R. Fraiman, An anova test for functional data, Comput. Statist. Data Anal. 47 (1) (2004) 111–122. [10] J. Cupidon, D. Gilliam, R. Eubank, F. Ruymgaart, The delta method for analytic functions of random operators with application to functional data, Bernoulli 13 (4) (2007) 1179–1194. [11] J. Dauxois, A. Pousse, Y. Romain, Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference, J. Multivariate Anal. 12 (1) (1982) 136–154. [12] A.C. Davison, D.V. Hinkley, Bootstrap methods and their application, Cambridge University Press, Cambridge, 1997, p. x+582. [13] M. Dawson, H.-G. Müller, Dynamic modeling of conditional quantile trajectories, with application to longitudinal snippet data, J. Amer. Statist. Assoc. 113 (524) (2018) 1612–1624. [14] A. Delaigle, P. Hall, Classification using censored functional data, J. Amer. Statist. Assoc. 108 (504) (2013) 1269–1283. [15] A. Delaigle, P. Hall, Approximating fragmented functional data by segments of Markov chains, Biometrika 103 (4) (2016) 779–799. [16] M.-H. Descary, V.M. Panaretos, Recovering covariance from functional fragments, Biometrika 106 (1) (2019) 145–160. [17] F. Ferraty, Y. Romain (Eds.), The Oxford Handbook of Functional Data Analysis, Oxford University Press, Oxford, 2011, p. xviii+494. [18] C.B. Fogarty, D.S. Small, Equivalence testing for functional data with an application to comparing pulmonary function devices, Ann. Appl. Stat. 8 (4) (2014) 2002–2026. [19] S. Fremdt, L. Horváth, P. Kokoszka, J.G. Steinebach, Functional data analysis with increasing number of projections, J. Multivariate Anal. 124 (2014) 313–332. D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 603 [20] S. Fremdt, J.G. Steinebach, L. Horváth, P. Kokoszka, Testing the equality of covariance operators in functional samples, Scand. J. Stat. 40 (1) (2013) 138–152. [21] J.E. Gellar, E. Colantuoni, D.M. Needham, C.M. Crainiceanu, Variable-domain functional regression for modeling ICU data, J. Amer. Statist. Assoc. 109 (508) (2014) 1425–1439. [22] D.S. Gilliam, T. Hohage, X. Ji, F. Ruymgaart, The Fréchet derivative of an analytic function of a bounded operator with some applications, Int. J. Math. Math. Sci. 2009 (2009). [23] Y. Goldberg, Y. Ritov, A. Mandelbaum, Predicting the continuation of a function with applications to call center data, J. Statist. Plann. Inference 147 (2014) 53–65. [24] O. Gromenko, P. Kokoszka, J. Sojka, Evaluation of the cooling trend in the ionosphere using functional regression with incomplete curves, Ann. Appl. Stat. 11 (2) (2017) 898–918. [25] J. Guo, B. Zhou, J.-T. Zhang, New tests for equality of several covariance functions for functional data, J. Amer. Statist. Assoc. (2018) To appear. [26] J. Guo, B. Zhou, J.-T. Zhang, Testing the equality of several covariance functions for functional data: a supremum-norm based test, Comput. Statist. Data Anal. 124 (2018) 15–26. [27] L. Horváth, M. Hušková, P. Kokoszka, Testing the stability of the functional autoregressive process, J. Multivariate Anal. 101 (2) (2010) 352–367. [28] L. Horváth, P. Kokoszka, Inference for functional data with applications, Springer, New York, 2012, p. xiv+422. [29] L. Horváth, P. Kokoszka, R. Reeder, Estimation of the mean of functional time series and a two-sample problem, J. R. Stat. Soc. Ser. B Stat. Methodol. 75 (1) (2013) 103–122. [30] D. Jarušková, Testing for a change in covariance operator, J. Statist. Plann. Inference 143 (9) (2013) 1500–1511. [31] A. Kashlak, J. Aston, R. Nickl, Inference on covariance operators via concentration inequalities: k-sample tests, classification, and clustering via rademacher complexities, Sankhya A (2018). [32] A. Kneip, D. Liebl, On the Optimal Reconstruction of Partially Observed Functional Data, Ann. of Statist. to appear, 2019. [33] P. Kokoszka, M. Reimherr, Asymptotic normality of the principal components of functional time series, Stochastic Process. Appl. 123 (5) (2013) 1546–1562. [34] P. Kokoszka, M. Reimherr, Introduction to Functional Data Analysis, CRC Press, 2017. [35] D. Kraus, Components and completion of partially observed functional data, J. R. Stat. Soc. Ser. B Stat. Methodol. 77 (4) (2015) 777–801. [36] D. Kraus, V.M. Panaretos, Dispersion operators and resistant second-order functional data analysis, Biometrika 99 (4) (2012) 813–832. [37] D. Kraus, M. Stefanucci, Classification of functional fragments by regularized linear classifiers with domain selection, Biometrika 106 (1) (2019) 161–180. [38] D. Liebl, Modeling and forecasting electricity spot prices: a functional data perspective, Ann. Appl. Stat. 7 (3) (2013) 1562–1592. [39] D. Liebl, Nonparametric testing for differences in electricity prices: The case of the fukushima nuclear accident, Ann. Appl. Stat. (2019) To appear. [40] D. Liebl, S. Rameseder, Partially observed functional data: The case of systematically missing parts, Comput. Statist. Data Anal. 131 (2019) 104–115. [41] R.Y. Liu, Bootstrap procedures under some non-i.i.d. models, Ann. Statist. 16 (4) (1988) 1696–1708. [42] C.J. Lloyd, Estimating test power adjusted for size, J. Stat. Comput. Simul. 75 (11) (2005) 921–933. [43] A. Mas, Testing for the mean of random curves: a penalization approach, Stat. Inference Stoch. Process. 10 (2) (2007) 147–163. [44] V. Masarotto, Procrustes Metric and Optimal Transport for Covariance Operators, Ph. D. thesis, Ecole Polytechnique Fédérale de Lausanne, 2019. [45] M. Mojirsheibani, C. Shaw, Classification with incomplete functional covariates, Statist. Probab. Lett. 139 (2018) 40–46. [46] V.M. Panaretos, D. Kraus, J.H. Maddocks, Second-order comparison of Gaussian random functions and the geometry of DNA minicircles, J. Amer. Statist. Assoc. 105 (490) (2010) 670–682. [47] V.M. Panaretos, D. Kraus, J.H. Maddocks, Second-order inference for functional data with application to dna minicircles, in: Recent Advances in Functional Data Analysis and Related Topics, Springer, 2011, pp. 245–250. [48] E. Paparoditis, T. Sapatinas, Bootstrap-Based K-Sample Testing For Functional Data, arXiv:1409.4317v4, 2016. [49] E. Paparoditis, T. Sapatinas, Bootstrap-based testing of equality of mean functions or equality of covariance operators for functional data, Biometrika 103 (3) (2016) 727–733. [50] D. Pigoli, J.A. Aston, I.L. Dryden, P. Secchi, Distances and inference for covariance operators, Biometrika 101 (2) (2014) 409–422. [51] A. Pini, L. Spreafico, S. Vantini, A. Vietti, Multi-aspect local inference for functional data: Analysis of ultrasound tongue profiles, J. Multivariate Anal. 170 (2019) 162–185. [52] A. Pini, A. Stamm, S. Vantini, Hotelling’s T2 in separable Hilbert spaces, J. Multivariate Anal. 167 (2018) 284–305. [53] A. Pini, S. Vantini, The interval testing procedure: a general framework for inference in functional data analysis, Biometrics 72 (3) (2016) 835–845. [54] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, New York, 2005. [55] M. Stefanucci, L.M. Sangalli, P. Brutti, PCA-Based discrimination of partially observed functional data, with an application to AneuRisk65 data set, Stat. Neerl. 72 (3) (2018) 246–264. [56] O. Vsevolozhskaya, M. Greenwood, D. Holodov, Pairwise comparison of treatment levels in functional analysis of variance with application to erythrocyte hemolysis, Ann. Appl. Stat. 8 (2) (2014) 905–925. [57] J.-T. Zhang, Analysis of Variance for Functional Data, Chapman and Hall/CRC, 2013. [58] J.-T. Zhang, X. Liang, One-way ANOVA for functional data via globalizing the pointwise F-test, Scand. J. Stat. 41 (1) (2014) 51–71. [59] C. Zhang, H. Peng, J.-T. Zhang, Two samples tests for functional data, Commun. Statist. – Theory Methods 39 (4) (2010) 559–578. Supplementary material for “Inferential procedures for partially observed functional data” David Kraus∗ Abstract: This supplementary document contains additional simulation results and further results of the data analysis. Key words and phrases: Bootstrap; covariance operator; functional data; K-sample test; partial observation; principal components. S1 Extended simulation results Table S1 is an extended version of Table 1 presented in the main body of the paper. It includes additional simulation results for tests of equal means for non-Gaussian distributed curves and for groups with unequal covariance operators. The same model as in the paper is used except that for the non-Gaussian case independent t5 distributed coefficients are generated and for the case of unequal covariance operators we set λ3,0 = 0.2. Since the empirical size deviates from the nominal level in some cases, Table S2 additionally reports size-adjusted powers for the same settings using the method described by Lloyd (2005, Subsection 3.2). Table S3 reports results for tests of equal covariance operators. In addition to the results presented in Table 2 in the main body of the paper it contains results for t5 distributed coefficients in the model for random curves. Table S4 reports size-adjusted powers for the same settings. S2 Additional results for the data analysis Fig. S1 contains additional plots of the covariance function estimates of the heart rate data shown in the main body of the paper. Fig. S2 shows the null estimates of the covariance functions and their leading eigenfunctions that the projection covariance test uses. Components of the score vector standardized by their estimated standard deviation are plotted in Fig. S3 Acknowledgements We are grateful to all reviewers for their valuable comments and suggestions. This work was supported by the Czech Science Foundation under Grant GJ17-22950Y. Access to computing and storage facilities owned by parties and projects contributing to the MetaCentrum ∗ Department of Mathematics and Statistics, Masaryk University, Kotl´aˇrsk´a 2, 611 37 Brno, Czech Republic; david.kraus@mail.muni.cz. 1 Table S1 Empirical rejection probability (in %) of the L2 test, TL2 , and projection test, Td, of equal means. A dash indicates the same value as on the preceding row. The observation patterns (1)–(9) and mean configurations A–D are described in Section 5 of the paper. Distrib. Covar. Observ. Mean configuration oper. pattern A B C D TL2 Td TL2 Td TL2 Td TL2 Td Gaussian Equal (1) 5.6 6.2 69 60 49 56 52 63 (2) 5.4 6.7 59 52 28 29 38 50 (3) — — — — 50 56 44 62 (4) 4.4 6.5 66 58 51 57 51 62 (5) — — — — 44 49 50 58 (6) 5.4 7.1 58 51 50 55 42 49 (7) — — — — 28 34 37 42 (8) 5.4 5.8 55 47 34 37 42 48 (9) 5.4 7.8 37 40 20 23 26 34 Gaussian Unequal (1) 4.2 5.2 79 75 58 63 57 67 (2) 4.0 5.6 66 62 28 32 37 52 (3) — — — — 56 62 47 66 (4) 4.0 5.7 77 72 58 62 55 64 (5) — — — — 50 55 53 63 (6) 3.9 4.9 64 60 55 57 43 52 (7) — — — — 29 36 38 46 (8) 4.5 7.0 64 62 39 42 47 54 (9) 4.0 6.5 42 48 23 25 27 38 t5 Equal (1) 5.4 7.3 72 61 51 58 54 63 (2) 4.7 7.6 58 53 27 30 38 52 (3) — — — — 50 60 44 63 (4) 5.1 6.4 70 60 52 57 51 60 (5) — — — — 46 52 50 60 (6) 3.7 6.1 56 50 50 54 41 50 (7) — — — — 27 32 37 43 (8) 5.1 7.1 58 52 33 36 44 51 (9) 5.4 6.6 38 42 21 24 26 34 t5 Unequal (1) 5.8 7.4 82 77 59 65 60 68 (2) 4.7 6.9 68 64 32 35 44 57 (3) — — — — 60 66 50 68 (4) 5.2 6.7 80 76 62 65 59 66 (5) — — — — 53 60 56 65 (6) 3.9 6.1 65 63 57 61 47 57 (7) — — — — 32 37 42 50 (8) 4.8 7.5 65 64 39 42 50 56 (9) 5.5 6.2 44 50 24 28 30 40 2 Table S2 Size-adjusted empirical power (in %) for the same settings as in Table S1. Distrib. Covar. Observ. Mean configuration oper. pattern B C D TL2 Td TL2 Td TL2 Td Gaussian Equal (1) 66 56 47 52 49 59 (2) 56 43 25 23 34 41 (3) — — 47 48 40 54 (4) 68 52 52 48 52 54 (5) — — 45 43 51 51 (6) 58 46 50 49 42 45 (7) — — 28 29 37 37 (8) 54 45 34 34 41 45 (9) 36 33 20 17 26 27 Gaussian Unequal (1) 83 73 63 62 62 66 (2) 72 59 35 29 44 49 (3) — — 62 59 56 64 (4) 81 72 63 62 61 63 (5) — — 56 55 59 62 (6) 68 60 60 57 47 53 (7) — — 34 36 43 46 (8) 67 54 42 36 49 48 (9) 45 45 25 23 31 35 t5 Equal (1) 71 55 50 51 52 57 (2) 60 44 28 23 39 42 (3) — — 51 47 46 53 (4) 69 53 51 53 50 56 (5) — — 44 45 49 55 (6) 60 48 53 52 45 48 (7) — — 31 30 40 40 (8) 57 44 32 30 43 44 (9) 38 38 21 20 26 31 t5 Unequal (1) 80 71 58 59 58 62 (2) 68 56 32 27 44 48 (3) — — 61 56 50 60 (4) 80 71 62 60 59 62 (5) — — 53 54 56 61 (6) 70 61 61 57 51 54 (7) — — 37 35 46 47 (8) 66 56 40 35 51 49 (9) 43 45 23 24 28 36 3 Table S3 Empirical rejection probability (in %) of the Hilbert–Schmidt norm test, SHS, projection test, Sd, and square root covariance test, Ssqrt, of equal covariance operators. A dash indicates the same value as on the preceding row. The observation patterns (1)–(5) and covariance configurations A–D are described in Section 5 of the paper. Distrib. Observ. Covariance configuration pattern A B C D SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt Gaussian (1) 5.4 5.8 4.8 69 82 80 69 58 69 78 62 81 (2) 4.6 6.4 4.9 54 63 41 37 32 38 76 64 54 (3) — — — — — — — — — 46 30 48 (4) 5.0 5.1 5.8 64 74 72 61 53 62 72 56 73 (5) — — — — — — — — — 77 60 77 t5 (1) 3.6 5.7 4.2 26 32 35 30 26 35 38 41 44 (2) 3.3 6.5 3.4 22 31 18 14 17 16 38 41 23 (3) — — — — — — — — — 16 16 20 (4) 4.0 6.4 4.8 23 32 30 25 25 31 30 33 34 (5) — — — — — — — — — 36 38 40 Table S4 Size-adjusted empirical power (in %)) for the same settings as in Table S3. Distrib. Observ. Covariance configuration pattern B C D SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt Gaussian (1) 66 79 81 67 56 69 78 60 81 (2) 54 59 42 38 29 38 78 60 55 (3) — — — — — — 47 26 49 (4) 64 73 69 61 52 59 72 56 71 (5) — — — — — — 77 59 74 t5 (1) 32 29 39 36 23 38 44 38 48 (2) 23 24 20 18 14 18 40 33 26 (3) — — — — — — 20 12 23 (4) 29 26 31 31 19 32 37 27 35 (5) — — — — — — 43 32 41 4 20 21 22 23 24 25 26 20212223242526 <=40 Time Time 60 80 100 120 140 20 21 22 23 24 25 26 20212223242526 (40,65] Time Time 60 80 100 120 140 20 21 22 23 24 25 26 20212223242526 >65 Time Time 60 80 100 120 140 20 21 22 23 24 25 26 20212223242526 <=40 Time Time −1 0 1 2 3 4 5 20 21 22 23 24 25 26 20212223242526 (40,65] Time Time −1 0 1 2 3 4 5 20 21 22 23 24 25 26 20212223242526 >65 Time Time −1 0 1 2 3 4 5 Fig. S1. Estimated covariance functions of heart rate profiles (top row) and of their derivatives (bottom row) in age groups. 20 21 22 23 24 25 26 20212223242526 Time Time 60 80 100 120 140 20 21 22 23 24 25 26 −1.0−0.50.00.5 Time PC1 (88.2%) PC2 (9.1%) PC3 (1.8%) 20 21 22 23 24 25 26 20212223242526 Time Time −1 0 1 2 3 4 5 20 21 22 23 24 25 26 −1.0−0.50.00.5 Time PC1 (62.6%) PC2 (31.9%) PC3 (4.3%) Fig. S2. The null estimate of the covariance function (left column) and its three leading principal components (right column) for heart rate profiles (top row) and for their first derivative (bottom row). 5 −3 −2.4 −1.8 −1.2 −0.6 0 0.6 1.2 1.8 2.4 3 1 2 3 1 2 3 <=40 1.23 0.24 1.93 0.85 0.24 2.94 q q q q −3 −2.4 −1.8 −1.2 −0.6 0 0.6 1.2 1.8 2.4 3 1 2 3 1 2 3 (40,65] −0.19 0.15 −0.94 −1.46 −0.37 −1.64 q q q −3 −2.4 −1.8 −1.2 −0.6 0 0.6 1.2 1.8 2.4 3 1 2 3 1 2 3 >65 −1.32 −0.45 −1.29 0.83 0.15 −1.69 q q q −2.7 −2.16 −1.62 −1.08 −0.54 0 0.54 1.08 1.62 2.16 2.7 1 2 3 1 2 3 <=40 1.21 1.56 0.57 0.98 −0.74 2.61 q q q −2.7 −2.16 −1.62 −1.08 −0.54 0 0.54 1.08 1.62 2.16 2.7 1 2 3 1 2 3 (40,65] −0.09 −0.8 1.12 −1.34 0.41 −1.94 q q q q −2.7 −2.16 −1.62 −1.08 −0.54 0 0.54 1.08 1.62 2.16 2.7 1 2 3 1 2 3 >65 −1.32 −0.84 −2.13 0.45 0.4 −0.72 Fig. S3. Standardized components of the score vector for testing equal covariances contrasting age groups against the null for heart rate profiles (top row) and for their derivatives (bottom row). National Grid Infrastructure provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042) is greatly appreciated. References Lloyd, C. J. (2005). Estimating test power adjusted for size. Journal of Statistical Computation and Simulation, 75(11):921–933. 6