Topics in Functional Data Analysis
Habilitation Thesis
David Kraus
August 2021
Masaryk University
Faculty of Science
Department of Mathematics and Statistics
Contents
1. Introduction and summary 2
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Summary of Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Summary of Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Summary of Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5. Summary of Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6. Summary of Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References 17
A. Second-order comparison of Gaussian random functions and the geometry
of DNA minicircles 19
B. Dispersion operators and resistant second-order functional data analysis 52
C. Components and completion of partially observed functional data 81
D. Classiﬁcation of functional fragments by regularized linear classiﬁers with
domain selection 112
E. Inferential procedures for partially observed functional data 139
1. Introduction and summary
1.1. Introduction
Functional data analysis is an active area of statistics that deals with data that can be
seen as mathematical functions. These could be curves, surfaces, images etc. Due to
the development of modern technology, contemporary data sets indeed often consist of
data units that are complex object. A functional data set is a collection of observations
of such functions (mathematically regarded as realizations of random processes, i.e.,
random variables in a function space), whereas more traditional data sets consist of
observations of numbers or vectors. For a general background, see, e.g., Bosq (2000),
Ramsay and Silverman (2005), Ferraty and Vieu (2006), Ferraty and Romain (2011),
Horv´ath and Kokoszka (2012), Hsing and Eubank (2015) or Kokoszka and Reimherr
(2017).
My research concentrates on the development of statistical methodology driven by
applications. This text comprises ﬁve research articles containing my and my co-authors’
contributions to the ﬁeld of functional data analysis accompanied by this introductory
section, which summarizes the contents of the papers. The presentation is simpliﬁed to
provide only the basic ideas and results of each paper. Thus, for example, references to
preceding and subsequent relevant publications are not included and results are described
in a stylized way rather than as rigorous formal statements.
2
The papers included in the appendix are:
(A) Panaretos, V. M., Kraus, D., and Maddocks, J. H. (2010). Second-order comparison
of Gaussian random functions and the geometry of DNA minicircles. Journal of
the American Statistical Association, 105(490):670–682.
(B) Kraus, D. and Panaretos, V. M. (2012). Dispersion operators and resistant secondorder
functional data analysis. Biometrika, 99(4):813–832.
(C) Kraus, D. (2015). Components and completion of partially observed functional
data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
77(4):777–801.
(D) Kraus, D. and Stefanucci, M. (2019). Classiﬁcation of functional fragments by
regularized linear classiﬁers with domain selection. Biometrika, 106(1):161–180.
(E) Kraus, D. (2019). Inferential procedures for partially observed functional data.
Journal of Multivariate Analysis, 173:583–603
Four papers (A, B, C, D) have been published in the Journal of the American Statistical
Association, Biometrika and Journal of the Royal Statistical Society: Series B
(Statistical Methodology), which are regarded by the scientiﬁc community among the
leading 5–7 journals in the ﬁeld of methodological statistics. Paper E has been published
in the Journal of Multivariate Analysis, which is a standard, respected journal
in the ﬁeld. Two papers (C, E) are single-authored, the other three are collaborative
with equal contribution of each co-author. The papers have been published with peerreviewed
supplements, which are included as well.
1.2. Summary of Paper A
Paper A (Panaretos et al., 2010) studies methods of statistical inference on the covariance
structure of random functions. Although its main focus is the development of statistical
methodology and related theory, the motivation for this work comes from another ﬁeld,
namely molecular biology.
The understanding of the mechanical properties of the DNA molecule constitutes
a fundamental biophysical task, as important biological processes can be aﬀected by
properties such as stiﬀness and shape. In addition to holding the genetic code, the DNA
base-pair sequence may inﬂuence the geometric properties of the molecule. However,
empirical detection of this eﬀect on stereological data acquired through the electron
microscope had previously been elusive. The data set of interest consists of closed
curves (DNA minicircles obtained from short strands of DNA) in R3 of two types: both
types have identical base pair sequences, except for a short base-pair window, where
two diﬀerent sequences are present (one of them, a TATA box, is of special interest).
Biophysical considerations suggest this will have a signiﬁcant eﬀect on the geometry of
the minicircle, and the goal is to compare these two groups to probe for such an eﬀect.
3
Motivated by the need of two-sample comparison of loops, as exempliﬁed in DNA
minicircle experiments, this article considers the problem of second-order comparison
of two samples of random functions, within a functional data analysis framework. In
particular, given realisations of n1 and n2 independent copies of two continuous zero
mean Gaussian processes X and Y on a compact set, we consider the problem of testing
the hypothesis that their covariance operators RX, RY are equal against the alternative
that they are diﬀerent. Although this problem is now well studied, by the time of
writing of this paper it had received relatively little attention. Our paper proposes a test
based on the approximation of the Hilbert–Schmidt distance of the empirical covariance
operators of the two samples of functions based on the Karhunen–Lo`eve expansion.
The asymptotic distribution of the test statistic is determined and its performance is
investigated computationally. The application of our methodology to the data set of two
groups of minicirles characterized by the presence or absence of a TATA box suggests
the potential existence of signiﬁcant diﬀerences in the two groups, which eluded previous
analyses as these focused on the mean (the shape of the minicircle), whereas we detect
the diﬀerences in the covariance structure (the ﬂexibility/stiﬀness).
Let us give a more detailed description of the contents of Paper A.
Since this work is data-driven, the paper ﬁrst explains the scientiﬁc background and
questions in molecular biology and the source, properties and pre-processing of available
data. To perform a functional data analysis of the minicircles it is required to register the
data. Each curve has thus been centered and scaled, so that the center of mass is at zero
and the length of the curve is one. Since the data were obtained by electron microscopy
of the minicircles imbedded in a liquid, the reconstructed curves are not aligned (they are
subject to a random unobservable orthogonal transformation). We describe a procedure
that rigidly aligns curves by their intrinsic characteristics: each curve was individually
aligned using the coordinate system induced by its moments of inertia tensor. We thus
arrive at a functional data set consisting of smooth curves indexed by the arc length
taking values in R3 (corresponding to the coordinates on the three principal axes of
inertia).
We assume that we have two independent collections X1, . . . , Xn1 and Y1, . . . , Yn2 of iid
Gaussian processes on [0, 1], considered as random elements of the Hilbert space L2[0, 1]
of coordinate-wise square-integrable R3-valued functions with the inner product f, g =
1
0 f(t)Tg(t)dt (but everything readily extends to more general cases). Assuming, without
loss of generality, that the mean functions are zero, the processes are characterized by
their respective covariance kernels RX(s, t) = cov{Xi(s), Xi(t)}) = E{Xi(s)Xi(t)T}, and
RY (s, t), respectively. Associated with the covariance kernel is the covariance operator
RX: L2[0, 1] → L2[0, 1] deﬁned as RX(f)(t) = cov{ Xi, f , Xi(t)} =
1
0 RX(t, s)f(s)ds.
The Karhunen–Lo`eve theorem allows for a representation of the process by a stochastic
Fourier series with respect to the orthonormal eigenfunctions {ϕ
(j)
X }∞
j=1 of the operator
RX,
Xi(t) =
∞
j=1
λ
(j)
X ξijϕ
(j)
X (t),
4
where {λ
(j)
X }∞
j=1 is the nonincreasing sequence of corresponding eigenvalues and {ξij} is
an iid array of standard Gaussian random variables. The empirical covariance kernel may
be used to “optimally” reduce inﬁnite-dimensional inferential problems to multivariate
ones. Letting ˆRX stand for the empirical covariance kernel, we denote its eigenvalues by
ˆλk,n1
X and its eigenfunctions by ˆϕk,n1
X . The ﬁnite-dimensional reduction is then achieved
by retaining a ﬁnite number of principal components Xi − ¯X, ˆϕk,n1
X , k = 1, . . . , K in
lieu of each Xi, and similarly for the second sample. The dimension reduction aﬀorded
by the Karhunen–Lo`eve expansion is the tool we employ to construct our test.
We wish to test the null hypothesis RX = RY against the alternative RX = RY .
We propose the use of a test statistic based on the norm of the diﬀerence of the two
empirical covariance operators. The Hilbert–Schmidt norm of a trace-class operator R
is deﬁned as
R HS =
1
0
1
0
trace{R(s, t)TR(s, t)}dsdt.
A test may be based on the squared Hilbert–Schmidt distance ˆRX − ˆRY
2
HS. The
sampling distribution of this quantity will depend on the unknown covariance operators
even asymptotically. To be able to “normalize” the test statistic, we employ the property
that for any orthonormal system {ei} of L2[0, 1], we have
R 2
HS =
∞
i=1
Rei
2
=
∞
i=1
∞
j=1
Rei, ej
2
ˆRX − ˆRY
2
HS. In practice, we need to truncate the series to obtain a ﬁnite-dimensional
reduction and choose the contrasts {ei} so that the truncation retains the bulk of the
norm. For each of the two empirical operators, the optimal contrasts will coincide with
their eigenfunctions, but we need to use a common basis. We thus choose the eigenfunctions
ˆϕk,N
XY of the empirical covariance operator of the pooled sample as a compromise
for the common coordinate system. Our proposed test statistic is a linear combination
of the terms ( ˆRX − ˆRY ) ˆϕi,N
XY , ˆϕj,N
XY
2, i, j = 1, . . . , K with weights corresponding to
their asymptotic covariance structure. Theorem 1 in the paper shows that under the
null hypothesis and certain assumptions, this test statistic is asymptotically chi-squared
distributed with K(K + 1)/2 degrees of freedom, which is the basis of a hypothesis test.
The paper then introduces a modiﬁed test statistic that can be useful when one a priori
knows that the eigenfunctions of both covariance operators are are equal. Then one
can focus only on the diagonal terms (those with i = j), which leads to a test statistic
with asymptotic chi-squared distribution with K degrees of freedom. Furthermore, we
consider variance-stabilized variants of these statistics, where we apply a log transformation
to the diagonal terms and Fisher’s z-transformation to the oﬀ-diagonal terms.
We then discuss methods to choose the truncation level.
To assess the behaviour of the proposed tests under the null hypothesis and under
various alternatives we carry out a number of simulations. We consider one situation
with equal covariance functions and several alternative conﬁgurations. The general and
5
diagonal test statistics are considered under various ﬁxed choices of K and with automatically
selected K. The study provides a useful insight into the performance and
capabilities of the tests depending on the type of deviation from the null hypothesis.
Next, we present an analysis of the data set of DNA minicircles. First, we show both
graphically and numerically that there is no important diﬀerence between the means of
the two types of curves. Then we focus on the comparison of their second order properties.
The analysis shows a signiﬁcant diﬀerence on the third (most important) principal
axis of inertia and also jointly in the plane given by the third and second axis.
A proof of Theorem 1 is provided in the Appendix. Additional plots and tables are
available in a supplementary ﬁle. In addition, the supplementary ﬁle contains a more
detailed study of the problem of comparing the complete spectrum.
1.3. Summary of Paper B
Paper B (Kraus and Panaretos, 2012) focuses on the second-order structure of a random
function, which is key to understanding the nature of the functional observations that
it induces, as it is inextricably linked with the smoothness properties of the stochastic
ﬂuctuations of the function. These second-order properties are encapsulated in the covariance
operator. The link with the smoothness properties of the random function is
then given by the Karhunen–Lo`eve expansion, which provides an optimal Fourier representation
of the random function, using a basis comprised by the eigenfunctions of this
operator. A natural inference problem is that of comparing the covariance structures of
two samples of functional data, in order to decide whether they share the same ﬂuctuation
properties. We focus on situations where the data are not Gaussian, and indeed may
be characterized by the presence of inﬂuential observations. The inﬁnite-dimensional nature
of the data means that an observation can be atypical in many ways, the deviation
from the mean being only one; observations close to the mean may contain unusual frequency
components. Detection of such observations via exploratory techniques may be
non-trivial. Such inﬂuential observations might signiﬁcantly inﬂuence the estimation of
the covariance, and, even more profoundly, the quality of the estimators of its spectrum.
The sensitivity of the empirical covariance operator and its spectrum to the presence
of inﬂuential observations can have an impact on testing procedures for the covariance
operator.
To cope with these issues, this paper introduces a class of operators that we term dispersion
operators that are implicitly deﬁned through a variational problem, motivated by
M-estimators of location for the tensor product of the centred functional observations.
It is then proposed that these operators be used as proxies for the covariance operator,
when inferences on the second-order structure are to be drawn for non-Gaussian and
potentially contaminated functional samples. The implicit deﬁnition of a dispersion operator
gives rise to a score equation, as the dispersion operator is a zero of the Fr´echet
derivative of the variational problem with respect to the operator argument. This functional
score equation is then used as a basis to construct a test for the second-order
comparison of two functional samples. The test is based on the distance of the functional
score equation under the null hypothesis from zero, measured by an appropriately
6
renormalized Hilbert–Schmidt distance. This work is motivated by and illustrated on
a data set of DNA strands, which indeed is contaminated by atypical curves.
We now recapitulate the contributions of Paper B in more detail.
First, the paper introduces the notion of a dispersion operator as a substitute for the
usual covariance operator that is more suitable for contaminated data while still characterizing
the second-order structure of the random function. To describe the second-order
properties of a random element X in a separable Hilbert space H (without loss of generality
L2[0, 1]), one typically considers the covariance operator
C = E{(X − µ) ⊗ (X − µ)},
where ⊗ stands for the tensor product and µ = E(X) is the mean. The covariance
operator can be seen as the Hilbert–Schmidt operator that solves the variational problem
min
R∈HS(H,H)
E{ (X − µ) ⊗ (X − µ) − R 2
}
(HS(H, H) are Hilbert–Schmidt operators from H to H). The empirical covariance
operator can be represented as the solution to the above optimization problem with
expectation computed with respect to the empirical distribution of the data. This being
essentially a least squares problem, both the empirical covariance operator and methods
based on it will be sensitive to the presence of atypical observations in the dataset. We
obtain procedures pertaining to the second-order structure of X that are more resistant
to departures from normality and to the presence of inﬂuential observations by replacing
the squared norm in the variational problem deﬁning the covariance by a less sensitive
loss function. This gives rise to a new class of second-order characteristics, which we call
dispersion operators. Within this class, the most useful new choice of the loss function
leads to what we call the spatial dispersion operator. It is deﬁned via M-estimation of
the location of (X − µ) ⊗ (X − µ) as
arg min
R∈HS(H,H)
E{ (X − µ) ⊗ (X − µ) − R − (X − µ) ⊗ (X − µ) },
where µ is a suitable element of H with the interpretation of a location parameter (the
spatial median is a natural choice). The empirical spatial dispersion operator minimizes
the sample version of the objective. By taking Fr´echet derivative we arrive at an equivalent
deﬁnition of the dispersion operator as a Z-estimator solving a score equation.
Proposition 1 in the paper establishes the existence and uniqueness of the (population)
dispersion operator under non-restrictive assumptions on the data-generating distribution.
In Corollary 1 we show that the sample dispersion operator exists and is unique
under weak assumptions on the observed data, and that it is consistent for the true
dispersion operator. We continue our theoretical analysis by showing an interesting link
between the spectra of the dispersion and covariance operator. Although the operators
are in general diﬀerent, they both carry useful information on second-order properties.
Proposition 2 shows that the dispersion operator has the same set of eigenfunctions as
the covariance operator.
7
Having deﬁned the notion of a dispersion operator, we then construct a two-sample
second-order test based upon it. Let there be two independent random samples of
functions, whose location parameters are µ1, µ2 and dispersion operators are R1, R2.
The goal is to test the null hypothesis H0: R1 = R2 against the general alternative
H1: R1 = R2. We propose to employ the general idea of score tests, that is, to base
the test on the estimating score for the general model, without assuming H0, evaluated
at the null estimate of the parameter. As the centres µ1, µ2 are not restricted under
the null hypothesis, they can be estimated separately. On the other hand, the common
null estimator of the dispersion is estimated by ˆR, which minimizes a combination of
objectives for each sample under the restriction induced by the null. Equivalently, ˆR
solves a score equation under H0. After a reparametrization, we arrive at a score operator
whose component corresponding to the diﬀerence between the two dispersion operators
reﬂects the validity of H0. When the null hypothesis holds, the score operator is expected
to be close to the zero operator, otherwise it should be far from the zero operator. To
perform the test, we need to measure its distance from the zero operator and assess the
signiﬁcance of the resulting test statistic. We especially develop one way of doing it. It
is based on spectral truncation of the score operator, which is an inﬁnite dimensional
object (a Hilbert–Schmidt operator on H). We use a projection of this operator on
a ﬁnite dimensional subspace, in particular the one deﬁned by the tensor products of
the eigenfunctions of the dispersion operator. The test statistic is then obtained by
combining the projection coeﬃcients in a quadratic form. Theorem 1 establishes the
weak convergence of the score operator to a mean zero Gaussian random operator under
the null hypothesis and provides a consistent estimator of its covariance operator (which
is an operator on operators). Then it provides the asymptotic null distribution of the
score test statistic.
Next, the paper presents empirical results. In a simulation study, we investigate the
behaviour of the test based on the spatial dispersion and the non-resistant L2 test under
the null hypothesis without and with contamination and the impact of contamination on
the power of these tests under various alternative and contamination scenarios. We also
apply the proposed methodology to the data set of DNA minicircles studied in Paper A.
The supplementary document contains proofs of theoretical results.
1.4. Summary of Paper C
It is standard in the ﬁeld of functional data analysis to assume that all functions are
observed on the same domain. In Paper C (Kraus, 2015), we develop methods of analysis
for functional data that are observed incompletely in the sense that each function might
be observed only on a subset of the domain, whereas no information about the curve is
available on the complement of this subset.
Our work is motivated by an ambulatory blood pressure monitoring data set that is
part of the “Swiss Kidney Project on Genes in Hypertension.” The data set consists
of automatically recorded temporal heart rate proﬁles of several hundred participants.
Due to either the failure of the recording device or participant’s discomfort some values
have not been measured and the time points corresponding to unobserved values form
8
series (intervals) of non-negligible length. The resulting data set thus consists of partially
observed curves (functional fragments). Since there is only a relatively small fraction of
complete curves, removing incomplete curves would considerably reduce (and possibly
destroy) the accuracy of the statistical analysis. Therefore, this type of functional data
necessitates the development of special methodological approaches, which is the subject
of this paper. Before the appearance of this paper, relatively little work had been
published on missing data in the functional context.
In this paper we introduce a formal framework for analysing incompletely observed
functional data and develop basic nonparametric, fully functional (inﬁnite-dimensional)
inferential procedures. We ﬁrst focus on the main building blocks of the analysis of the
second-order properties: estimation of the covariance operator and principal component
analysis. We propose an estimator of the covariance operator and its eigenvalues and
eigenfunctions for partially observed functions and derive their properties. We deal with
the estimation of projections (principal scores) of individual incomplete functions which
is especially challenging. We develop a procedure that enables to predict the value of
a principal score of a function when only a fragment of the function is available and
direct computation is thus impossible. Next, we propose a method that can recover the
unobserved part of the function from the observed part, using the information about the
distribution of the data that it learns from the sample. We develop automatic procedures
for the selection of the tuning parameter of the method that is based on generalised
cross-validation for incompletely observed functions. We quantify the uncertainty of
the predictions of unobserved quantities and provide approximate prediction regions
(intervals and bands) covering the unobserved random quantity with high probability.
Simulations conﬁrm the usefulness and good performance of the proposed methodology.
We now describe the main methodological, theoretical and numerical contributions of
Paper C.
First, the paper formalizes the framework of partially observed functional data. Functional
data X1, . . . , Xn are seen as independent identically distributed random variables
in the separable Hilbert space of square integrable functions on a bounded domain.
Without loss of generality, we consider the space L2([0, 1]) with inner product
f, g =
1
0 f(t)g(t)dt, f, g ∈ L2([0, 1]) and norm f = f, f 1/2. In traditional functional
data analysis, it is assumed that the functions X1, . . . , Xn are observed on the
whole interval [0, 1]. We consider situations where each curve Xi is observed only on
a subset of [0, 1]. Speciﬁcally, let the observation periods be Oi ⊂ [0, 1], i = 1, . . . , n.
Then the observed data for the ith curve are Xi(t), t ∈ Oi. We collectively denote
the observed part of the curve as XiOi , which can be seen as a random element of the
space L2(Oi). The values of Xi on the complement of Oi, Mi = [0, 1] \ Oi, are not
observed; the missing part of the trajectory is denoted as XiMi . The observation periods
Oi, i = 1, . . . , n are modelled as random subsets of [0, 1]. We assume that the observation
periods are independent of the functions X1, . . . , Xn, that is, the data are missing
completely at random.
Next, the paper focuses on the estimation of the main characteristics of the distribution
that generates the data, that is, the mean function and the covariance operator. Let the
9
mean function be µ = E X1. The covariance operator R : L2([0, 1]) → L2([0, 1]) is deﬁned
as Rf = E{ f, X1 − µ (X1 − µ)} =
1
0 ρ(·, t)f(t)dt, where ρ(s, t) = cov{X1(s), X1(t)}
is the covariance kernel of the stochastic process X1. Like in the multivariate case, the
mean function µ at point t ∈ [0, 1] can be estimated by the sample mean of observed
values at this point. The estimator ˆR of the covariance operator R is deﬁned through
an estimator of its covariance kernel ρ. We estimate ρ(s, t) by the sample covariance
computed from all complete pairs of functional values at s and t. It is seen that ˆµ(t) is an
unbiased estimator of µ(t). Similarly, if we subtract 1 in the denominator of ˆρ(s, t), the
estimator becomes unbiased for ρ(s, t). For the estimators ˆµ and ˆR to be consistent, we
need to assume that the observation pattern asymptotically provides enough information.
The exact formulation of such assumptions is provided in equations (2) and (3) in the
paper. Under these weak assumptions, we obtain a consistency result in Proposition 1
of the paper. In particular, we show that the L2 distance between the ˆµ and µ and the
Hilbert–Schmidt distance between ˆR and ˆR converge to zero in quadratic mean (and
hence in probability). Interestingly, the properties of the estimators are unaﬀected by
the fact that the functions are observed only partially. The full (dense) observation
regime, albeit only on subsets of the domain, preserves the convergence rates known for
complete functional data.
The paper then focuses on principal component analysis, which is probably the most
fundamental method for functional data since it provides insight into the complex covariance
structure of functional data, can be used to identify main sources of variability
and quantify their importance and to reduce the dimension of the data. The theoretical
foundation of functional principal component analysis is the Karhunen–Lo`eve theorem
stating that there exist random variables βij and nonrandom functions ϕj such that the
stochastic process Xi admits the decomposition
Xi(t) = µ(t) +
∞
j=1
βijϕj(t), t ∈ [0, 1],
where the series converges in mean square, uniformly in t. Here ϕj, j = 1, 2, . . . are
the orthonormal eigenfunctions of the operator R and βij, j = 1, 2, . . . are uncorrelated
mean zero variables with variances λj, where λ1 ≥ λ2 ≥ · · · > 0 are the eigenvalues of R.
Functional principal component analysis is the empirical version of the Karhunen–Lo`eve
expansion that aims to estimate the elements involved in the expansion. In the case of
completely observed functional data, to estimate the eigenvalues λj and eigenfunctions
ϕj, one performs eigen-decomposition of the usual sample covariance operator. When
the functions are observed partially, one can proceed similarly and deﬁne the estimators
ˆλj and ˆϕj as the eigenvalues and eigenfunctions of the operator ˆR given by the
kernel ˆρ. The paper shows that the asymptotic properties of the empirical eigenvalues
and eigenfunctions remain unchanged by the incompleteness of the observed functions.
Proposition 2 in the paper establishes that ﬁrst, the empirical eigenvalues are consistent
estimators of the true eigenvalues and this consistency is uniform over all indices, and
second, the empirical eigenfunctions are consistent estimators of the true eigenfunctions,
up to the usual sign ambiguity. The rates of convergence are parametric due to the full
10
observation regime on subsets.
The paper then moves to the most challenging contributions which are methods of inference
for individual curves based on their incomplete observation. These are prediction
rather than estimation problems since they aim to provide information on random targets:
the principal scores βij and the missing part of the curve XiMi . In the standard situation
of complete functional data, the scores are easily estimated by ˆβij = Xi − ˆµ, ˆϕj .
When the functional observations are incomplete, the direct computation of Xi − ˆµ, ˆϕj
is impossible because the last term in the expression
Xi − ˆµ, ˆϕj = XiOi − ˆµOi , ˆϕjOi + XiMi − ˆµMi , ˆϕjMi
is not available. In Section 3.2 of the paper we develop a procedure to estimate (or rather
predict) the missing quantity XiMi − ˆµMi , ˆϕjMi from the observed data and establish
its theoretical properties (Theorem 1, Proposition 3). We skip the description of this
part in this summary and instead describe the results on prediction of the missing part
of an incomplete curve.
This task of function reconstruction (completion) is studied in Section 4 of the paper.
In the population version of the problem, the best prediction of XM by a function of XO
in the sense of the mean integrated prediction squared error is the conditional expectation
E(XM |XO). It is in general a nonlinear operator from L2(O) to L2(M) and similarly
to the case of principal scores, we consider its best continuous linear approximation.
Assuming for simplicity that the functional variable has mean zero, the minimisation
problem to be solved is
min
A : A ∞<∞
E XM − A XO
2
,
where the solution is looked for in the class of continuous (bounded) linear operators from
L2(O) to L2(M) (by · ∞ we denote the operator norm). We see (by Fr´echet diﬀerentiation
or direct computation) that solving this minimisation is equivalent to solving the
(normal) equation A ROO = RMO. This suggests the solution ˜A = RMOR−1
OO and the
best linear prediction of XM in the form ˜XM = ˜A XO. From now on, we assume the existence
of a bounded solution, that is, we assume that RMOR−1
OO ∞ < ∞. Similarly to
the case of principal scores, the inverse problem A ROO = RMO to be solved is ill-posed,
that is, small perturbations of the right-hand side RMO can lead to large perturbations of
the solution (recall that ROO is compact, hence its inverse is unbounded); perturbations
of the right-hand side indeed need to be considered since RMO will be only estimated
from the data in the sample version of the problem. Regularization (i.e., modiﬁcation of
an ill-posed inverse problem into a well-posed inverse problem) is necessary for a stable
solution. Using ridge regularisation we obtain the solution ˜A (α) = RMO(ROO +αIO)−1
(α > 0 is a regularization parameter, IO is the identity operator of L2(O)). The regularised
best linear prediction equals ˜X
(α)
M = ˜A (α)XO. Practically, when the sample
X1O1 , . . . , XnOn is observed on the subsets O1, . . . , On, we replace the covariance operator
by its estimate and set ˆA
(α)
i = ˆRMiOi
ˆR
(α)−1
OiOi
. The mean function needs to be
estimated as well. For the ith curve, the best linear prediction of XiMi is estimated by
ˆX
(α)
iMi
= ˆµMi + ˆA
(α)
i (XiOi − ˆµOi ).
11
Under the assumption that the optimal reconstruction operator ˜A (α) is Hilbert–Schmidt,
Theorem 1 of the paper proves the consistency of the estimated best linear reconstruction.
That is, we show that, as the size of the training sample increases and the amount
of regularization decreases, the L2-distance between the theoretical best reconstruction
and its regularized estimate converges to zero in quadratic mean and provide the rate of
this convergence. It was later pointed out in the literature that our results are obtained
under unnecessarily strong assumptions. Therefore, in a follow-up paper (Kraus and
Stefanucci, 2020, not included here), we generalize the consistency result by relaxing the
assumption that the true optimal linear reconstruction operator is Hilbert–Schmidt. It
turns out that it is not even necessary to assume that the optimal reconstruction operator
is bounded, and the ridge regularization method (which is Hilbert–Schmidt) still
performs optimally in the limit. The follow-up paper explains this in the context of the
Reproducing Kernel Hilbert Space theory.
The paper provides an estimator of the asymptotic covariance operator of the predictive
distribution (error between the prediction and the target random process) and
proves its consistency (Proposition 5). This enables the construction of prediction intervals.
To address the problem of selection of the regularization parameter α, the paper
develops a generalized cross-validation procedure for partially observed data. A simulation
study is carried out to address the following goals: to investigate the performance
of generalized cross-validation as a selector of the regularization parameter, to verify
the validity and accuracy of the prediction intervals and bands and to explore the eﬀect
of the observation pattern. Finally, the performance of the proposed methodology is
illustrated on the motivating data set of incomplete heart rate temporal proﬁles. Proofs
of all formal statements are provided in the appendix and in a supplement.
1.5. Summary of Paper D
In Paper D (Kraus and Stefanucci, 2019), we consider classiﬁcation of a functional observation
into one of two groups. We formulate the theoretical (population) problem of
determining the best classiﬁer as a quadratic optimization problem on a function space,
or, equivalently, as a linear inverse problem. These problems are ill-posed but, unlike in
most inverse problems, this is not a complication but rather an advantage in the sense
that the more ill-posed the problem is, the better optimal misclassiﬁcation probability.
We use regularization techniques, such as the method of conjugate gradients with early
stopping and ridge regularization, to solve the optimization problem, yielding a class of
regularized linear classiﬁers. The optimal misclassiﬁcation rate is the limit along the
regularization path of solutions which themselves may not converge.
We study the empirical (sample) version of the problem, where the objective function
in the constrained minimization must be estimated from ﬁnite training data. We show
that it is possible to construct an empirical regularization path towards the possibly
non-existent unconstrained solution so that the classiﬁcation error converges to its best
value, possibly zero. We do this for conjugate gradient, principal component and ridge
classiﬁcation, in a truly inﬁnite-dimensional manner, in the sense that the convergence
takes place along a path with decreasing regularization and holds without restrictions
12
on the mean diﬀerence between classes. All our methodology and theory is developed
in the setting of partially observed functional data, where trajectories are observed only
on subsets of the domain. The principal diﬃculty for inference with fragments is that
temporal averaging is precluded by the incompleteness of the observed functions. Our
formulation as an optimization problem enables us to overcome this issue under certain
assumptions because only averaging across individuals in the training data is needed, and
not individual curves. We propose a domain selection strategy that looks for the best
classiﬁer with domain ranging from a minimum common domain of the training sample
to the entire domain of the function to be classiﬁed. Our simulation study conﬁrms that
domain selection can considerably reduce the misclassiﬁcation rate. Further simulations
compare the performance of the three types of regularization. Among other ﬁndings,
this study shows that the principal component and conjugate gradient classiﬁers often
achieve comparable error rates but the latter usually needs a lower dimension of the
regularization subspace, in agreement with a theoretical result we provide. Application
to a data set on the geometric features of the internal carotid artery in patients with
and without aneurysm demonstrates the utility of the proposed methodology.
A more detailed overview of the results of Paper D follows.
We consider classiﬁcation of a Gaussian random function, X, into one of two groups of
Gaussian random functions. Group 0 has mean µ0, group 1 has mean µ1. Both groups
have covariance operator R. We ﬁrst assume that µ0, µ1 and R are known, which
corresponds to the asymptotic situation with an inﬁnite training sample. We consider
the class of centroid classiﬁers that are based on one-dimensional projections of the form
X, ψ , where ψ is a function in L2(I). Given ψ, the optimal classiﬁer based on X, ψ
assigns X to the class Cψ(X) given by
Cψ(X) = 1{Tψ(X)>0},
where Tψ(X) = X − ¯µ, ψ µ, ψ with ¯µ = (µ0 + µ1)/2 and µ = µ1 − µ0. The misclassiﬁcation
probability of this classiﬁer is
1 − Φ
| µ, ψ |
2 ψ, Rψ 1/2
.
The task to ﬁnd the best function ψ ∈ L2(I) leads to the maximization of the argument
in Φ above. We discuss when this problem can be solved within L2 (i.e., there is an L2function
ψ that achieves the best error rate), when it cannot be solved within L2 (i.e.,
the best error rate is achieved by a linear functional but it is unbounded, not of the form
X, ψ ) and what value the optimal error rate can take (remarkably, it may be zero,
corresponding to perfect classiﬁcation). This discussion connects the H´ajek–Feldman
dichotomy between Gaussian measures, the theory of reproducing kernel Hilbert spaces
and constrained convex optimization. The optimization to be solved corresponds to the
task to maximize µ, ψ subject to ψ, Rψ = 1, which translates into the unconstrained
quadratic optimization problem to minimize ψ, Rψ /2− µ, ψ , i.e., to the linear inverse
problem Rψ = µ.
13
This formulation is the starting point for the deﬁnition of regularized classiﬁers. Regardless
of whether there is a solution (i.e., whether ψ = R−1µ exists in L2(I)), one
can consider an approximating, regularized problem that can be solved. Regularization
is typically used to solve ill-posed inverse problems, whose solution exists, in a stable
way. There, the path of regularized solutions converges to the solution to the problem
of interest. Here no solution may exist, but paths of regularized solutions towards
the possibly non-existent solution still turn out to be useful, since the misclassiﬁcation
probability converges to the optimal value along these paths. We consider three regularization
methods: the principal component method (which solves the optimization in
a subspace spanned by leading principal components), the conjugate gradient method
(which uses the numerical method of conjugate gradients with early stopping) and the
ridge method (which solves the optimization in a ball). In Propositions 1 and 3 in the paper
we provide an asymptotic analysis of these methods which shows that as the amount
of regularization decreases, the misclassiﬁcation rate along the regularization path converges
to the optimal value. This is true even when there is no bounded solution to
the problem (i.e., R−1µ ∈ L2(I)) and also in the “even more ill-posed” case of perfect
classiﬁcation (i.e., R−1/2µ ∈ L2(I)). Proposition 2 compares the two methods that use
a subspace for regularization, i.e., principal components and conjugate gradients, and
shows that the error rate of the former is always higher than or equal to that of the
latter when the same dimension is used.
We then present the empirical version with a ﬁnite training data set. Motivated by
a medical dataset, we do it in the case of incomplete curves. Incompleteness can occur
in the training data, with each curve possibly observed on a diﬀerent domain, and in the
new curve we wish to classify. A simple approach would be to consider all curves on the
intersection of their observation domains, if it is non-empty, or to discard incomplete
curves. However, such restrictions may be too severe and can be avoided. For group
j let there be a training sample consisting of nj independent curves Xj1, . . . , Xjnj that
may be observed incompletely with values known only on a subset Oji of the domain.
Then, similarly to Paper C, the mean µj of group j can be estimated by the crosssectional
average and the covariance kernel ρ(s, t) can be estimated by the empirical
covariance using pairwise complete observations of groupwise centred curves. Let the
new, independent curve to be classiﬁed, Xnew, be observed on the domain Onew. The
empirical classiﬁer ˆC ˆψ trained on partially observed curves is deﬁned like the theoretical
one but with unknown quantities replaced by their estimators. The projection direction
ˆψ is constructed by conjugate gradient, principal component or ridge regularization
applied to estimates ˆµ and ˆR (deﬁned through the estimated kernel ˆρ(s, t)), restricted
to the domain of the new curve to be classiﬁed (or, possibly, a subset of that domain).
In the theoretical analysis, we study the behaviour of classiﬁers for incomplete training
samples of increasing size with decreasing amount of regularization. We study the conjugate
gradient method with increasing number of steps, principal component method
with increasing number of eigenfunctions and ridge method with decreasing ridge parameter
in Theorems 1, 2 and 3, respectively. The theorems show that under speciﬁc
regularity conditions they all asymptotically achieve the optimal (Bayes) misclassiﬁca-
14
tion probability along the empirical regularization path as if there were inﬁnite training
data. This holds regardless of whether the theoretical best projection classiﬁer exists as
a bounded linear functional and whether the best error rate is positive or zero. Similarly
to the problem of function reconstruction in Paper C and the follow-up paper Kraus and
Stefanucci (2020), classiﬁcation is also a prediction rather than an estimation task and
we observe a similar interesting phenomenon that involves a possibly non-convergent
regularization path along which the predictive performance converges to its optimum.
Further, we propose a domain selection procedure that aims to ﬁnd the best domain on
which the classiﬁcation is performed. The method searches for the best domain between
two extremes, the common domain of all training curves and the domain of the curve
to be classiﬁed, to capture the location in the domain, where maximum discrimination
between the two classes is.
The numerical part of the paper presents a simulation study, which compares the
behaviour of the diﬀerent regularization methods, investigates the performance of crossvalidation
for the selection of the regularization parameters, studies the impacts of partial
observation and demonstrates the usefulness of the domain selection procedure. In
a data example, we analyze a set of curves describing the blood vessel morphology in
persons with and without aneurysm. The analysis shows an improvement of classiﬁcation
accuracy in comparison with existing methods due to the use of incomplete data
and domain selection. Further generalizations and numerical results are contained in
the supplementary document.
1.6. Summary of Paper E
Inspired by the data set of heart rate proﬁles, Paper E (Kraus, 2019) deals with another
aspect of partially observed functional data. Although some advanced procedures, such
as goodness-of-ﬁt tests, regression, classiﬁcation and reconstruction methods, have been
developed for functional fragments, basic methods of inference about the fundamental
characteristics of functional variables were still missing at the time of writing. In particular,
the asymptotic distribution of estimators of the mean function and covariance
operator, K-sample tests of equal means or covariances, and conﬁdence intervals for
eigenvalues and eigenfunctions had not been studied yet in the setting of incomplete
functions. Users who wish to perform these basic tasks had the only option: to omit the
partially observed functions and apply existing procedures to the complete data only.
This approach is not only clearly sub-optimal due to a possibly large loss of information
and resulting decay of power and accuracy, but also hardly or totally inapplicable in
situations where the data contain few or no complete curves.
In this paper, we address this deﬁciency of existing methodology and develop essential
methods of inference about the mean and covariance structure of incomplete functional
data. We ﬁnd appropriate assumptions on the observation pattern that enable us to
establish the asymptotic distribution of estimators of µ and R. We develop tests for
comparing the mean functions in K populations of functional data based on samples of
fragments. Next, we propose several tests of equal covariance operators in K samples.
We also construct conﬁdence intervals for the eigenvalues and eigenfunctions estimated
15
from incomplete data. The practical implementation of methods for functional fragments
is more complicated than for complete curves. The main diﬃculty is that temporal averaging
(e.g., in inner products for dimension reduction) is impossible due to missing values.
This leads to asymptotic distributions whose parameters follow rather complicated
formulas. More importantly, since dimension reduction is not possible, the asymptotic
distributions are, upon discretization, characterized by large objects (matrices or arrays)
that are diﬃcult or even impossible to store and manipulate in computer memory. The
bootstrap turns out to be a solution to this problem. We provide speciﬁc algorithms
for resampling functional fragments for mean and covariance testing and for conﬁdence
intervals for eigenelements. Our simulation study shows that the proposed methods are
superior to the currently only available approach based on omitting incomplete curves.
Let us now describe the contributions of Paper E more speciﬁcally.
First, we focus on inference about the mean of functional data. We consider estimation
of the mean function µ by the cross-sectional average of available observations as before.
In Kraus (2015, Proposition 1) (Paper C) it was shown that under non-restrictive assumptions
on the observation pattern such an estimator, ˆµ, is consistent. Paper C goes
further and provides the asymptotic distribution of the estimator, which is essential in
the derivation of the limiting distribution of a test statistics. The paper introduces sets
of conditions on the observation pattern. Then it is shown in Theorem 1 that the estimator
ˆµ is asymptotically distributed as a Gaussian process and a consistent estimator of
the limiting covariance operator is provided. Next, we consider K independent samples
of incompletely observed functional data. Our aim is to test the null hypothesis that
all K mean functions are equal against the general alternative that the null does not
hold. In the literature on complete functional samples there exist two main approaches
to comparing mean functions. One is based on the L2 distance between the means
and one uses projections on ﬁnite dimensional subspaces. We explore both approaches
in the fragmentary setting. Test statistics are constructed and their null asymptotic
distributions are obtained under appropriate assumptions.
Next, we develop methods of second-order inference for functional fragments. The
covariance function ρ(s, t) can be estimated by the empirical covariance using pairwise
complete observations. We previously showed that under certain assumptions on the observation
pattern, the operator ˆR with kernel ˆρ(s, t) consistently estimates R. Paper E
provides a deeper asymptotic study. We determine conditions on the pattern of missingness
that guarantee the weak convergence of the properly normalized diﬀerence between
ˆR and R to a Gaussian random operator (Theorem 3). These conditions in particular
do not require the existence of any completely observed curves in the data. An estimator
of the limiting covariance structure is provided. Then we study the estimators ˆλm
and ˆϕm of the eigenvalues and eigenfunctions of R. The estimators are obtained by the
eigendecomposition of ˆR. Theorem 4 establishes their asymptotic distributions with the
help of perturbation theory. The theorem generalizes the classic results for completely
observed functions. Next, we study tests for equality of covariance operators of several
populations. Tests of this null hypothesis can be based on the diﬀerences between the
estimators ˆRj and a null estimator ˆR. We propose two types of tests measuring the
16
importance of these contrasts: one approach is based on the Hilbert–Schmidt norm of
the contrasts and one is based on their projections on a subspace. We give the asymptotic
distribution of the Hilbert–Schmidt and projection statistics in Theorem 5. As an
alternative we explore an approach (previously proposed by other authors for complete
curves) that takes into account the fact that covariance operators do not form a linear
subspace of the Hilbert space of Hilbert–Schmidt operators and uses the square root
distance instead of the diﬀerence of covariances.
Section 4 of the paper deals with practical issues that arise due to partial observation.
Functional data procedures are practically implemented by discretization. Functions
then correspond to vectors (possibly with missing values), operators on the function
space correspond to matrices and operators on operators correspond to four-way arrays.
The direct implementation of the conﬁdence sets and tests using the asymptotic
distributions may be excessively demanding in terms of computer memory, especially
in the case of covariance inference. Projection covariance tests for complete functions
can avoid the computation, storage and manipulation with large arrays by computing
principal scores of each function with respect to the required low number d of eigenfunctions
(for example, our Paper A here does it). This dimension reduction approach is
not applicable in the case of incomplete functions because the principal scores cannot be
computed (temporal averaging is precluded by the incompleteness of the curves). Similar
problems arise with Hilbert–Schmidt norm tests which involve a large eigenproblem that
cannot be reduced due to missingness. To overcome these diﬃculties we use the bootstrap.
We propose algorithms for mean and covariance testing and for the construction
of conﬁdence intervals that are based on the resampling of functional fragments.
In the numerical part of the paper, we perform a simulation study whose main goal is
to investigate the impact of partial observation on the performance of the diﬀerent mean
and covariance tests and compare the proposed tests using complete and incomplete
curves with the simple approach using complete curves only. We also analyze the data
set of incomplete heart rate curves. All technical proofs are collected in the appendix.
A supplement provides further numerical results.
References
Bosq, D. (2000). Linear Processes in Function Spaces. Springer, New York.
Ferraty, F. and Romain, Y., editors (2011). The Oxford Handbook of Functional Data
Analysis. Oxford University Press, Oxford.
Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Springer,
New York.
Horv´ath, L. and Kokoszka, P. (2012). Inference for Functional Data with Applications.
Springer, New York.
Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis,
with an Introduction to Linear Operators. Wiley.
17
Kokoszka, P. and Reimherr, M. (2017). Introduction to Functional Data Analysis. CRC
Press.
Kraus, D. (2015). Components and completion of partially observed functional data.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):777–
801.
Kraus, D. (2019). Inferential procedures for partially observed functional data. Journal
of Multivariate Analysis, 173:583–603.
Kraus, D. and Panaretos, V. M. (2012). Dispersion operators and resistant second-order
functional data analysis. Biometrika, 99(4):813–832.
Kraus, D. and Stefanucci, M. (2019). Classiﬁcation of functional fragments by regularized
linear classiﬁers with domain selection. Biometrika, 106(1):161–180.
Kraus, D. and Stefanucci, M. (2020). Ridge reconstruction of partially observed functional
data is asymptotically optimal. Statistics & Probability Letters, 165:108813.
Panaretos, V. M., Kraus, D., and Maddocks, J. H. (2010). Second-order comparison
of Gaussian random functions and the geometry of DNA minicircles. Journal of the
American Statistical Association, 105(490):670–682.
Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer, New
York.
18
A. Second-order comparison of Gaussian random functions and
the geometry of DNA minicircles
By Victor M. Panaretos, David Kraus, and John H. Maddocks
Journal of the American Statistical Association, 105(490):670–682, 2010
DOI: 10.1198/jasa.2010.tm09239
19
Supplementary materials for this article are available online. Please click the JASA link at http://pubs.amstat.org.
Second-Order Comparison of Gaussian Random
Functions and the Geometry of DNA Minicircles
Victor M. PANARETOS, David KRAUS, and John H. MADDOCKS
Given two samples of continuous zero-mean iid Gaussian processes on [0,1], we consider the problem of testing whether they share the
same covariance structure. Our study is motivated by the problem of determining whether the mechanical properties of short strands of
DNA are signiﬁcantly affected by their base-pair sequence; though expected to be true, had so far not been observed in three-dimensional
electron microscopy data. The testing problem is seen to involve aspects of ill-posed inverse problems and a test based on a Karhunen–
Loève approximation of the Hilbert–Schmidt distance of the empirical covariance operators is proposed and investigated. When applied to
a dataset of DNA minicircles obtained through the electron microscope, our test seems to suggest potential sequence effects on DNA shape.
Supplemental material available online.
KEY WORDS: Covariance operator; DNA shape; Functional data analysis; Hilbert–Schmidt norm; Karhunen–Loève expansion; Regularization;
Spectral truncation; Two-sample testing.
1. INTRODUCTION
The understanding of the mechanical properties of the DNA
molecule constitutes a fundamental biophysical task, as important
biological processes, such as the packing of DNA in the
nucleus or the regulation of genes, can be affected by properties
such as stiffness and shape (Vilar and Leibler 2003; Tolstorukov
et al. 2005). The study of these properties can focus on different
scales, and accordingly involves a variety of mathematical
models and techniques. At a coarse-grained level, the behavior
of short (of the order of 150 base pairs) strands of DNA is
likened to that of a continuous elastic rod. By means of a reaction
called cyclization, two ends of this elastic rod bend and
twist and bind together to form a loop called a DNA minicircle.
These three-dimensional cyclic structures are an excellent
specimen for examining the elastic properties of DNA since a
minicircle is in a naturally stressed state without the application
of external forces. Furthermore, the short length of these
strands will amplify the dependence of the mechanistic behavior
on intrinsic factors such as the speciﬁc base pair sequence.
Such sequence-dependent shape characteristics are of special
interest as they potentially reveal a dual purpose of the DNA
base-pair sequence: in addition to holding the genetic code,
the sequence may inﬂuence the geometric properties of the
molecule. While in principle certain particular subsequences
are expected to have a strong effect on the mechanical properties
of DNA, empirical detection of this effect on stereological
data acquired through the electron microscope has been elusive
(Hagerman 1988; Amzallag et al. 2006). A speciﬁc example
is that of a subsequence called the TATA box, which promotes
gene transcription. It is thought that the mechanical properties
of this subseqence are intimately related with its function, and
that its presence in a DNA minicircle will enhance its ﬂexibility.
Nevertheless, exploratory comparisons between reconstructed
minicircles from microscope images containing TATA boxes
with reconstructed minicircles with no TATA box did not reVictor
M. Panaretos is Assistant Professor (E-mail: victor.panaretos@epﬂ.
ch), David Kraus is Postdoctoral Researcher, and John H. Maddocks is Professor,
Section de Mathématiques, Ecole Polytechnique Fédérale de Lausanne,
Lausanne 1015, Switzerland. The authors thank the editor, associate editor, and
two referees for providing detailed and constructive comments, and for their
fruitful suggestions. The last author wishes to acknowledge support from FN
grant 205320-112178.
veal any effects due to the presence of the sequence (Amzallag
et al. 2006).
Motivated by the need of two-sample comparison of loops,
as exempliﬁed in DNA minicircle experiments, this article considers
the problem of second-order comparison of two samples
of random functions, within a functional data analysis framework.
In particular, given realisations of n1 and n2 independent
copies of two continuous zero mean Gaussian processes X and
Y on a compact set, we consider the problem of testing the hypothesis
H0 :RX = RY against the alternative HA :RX = RY ,
where the covariance operators RX,RY are not necessarily stationary.
The literature on hypothesis testing for functional data
is mostly concentrated on tests pertaining to the mean function
(Fan and Lin 1998), as encountered, for instance, in functional
linear models (Cardot et al. 2003; Cuevas, Febrero, and
Fraiman 2004; Shen and Faraway 2004) or functional change
detection (Berkes et al. 2009). Hall and Van Keilegom (2007)
studied the important issue of the effect that the data smoothing
step may have on two-sample testing. Second-order tests
for functional data analysis pertaining to serial correlation were
also investigated (e.g., Gabrys and Kokoszka 2007; Horváth,
Hušková, and Kokoszka 2010). Although the seeds of functional
two-sample covariance tests can be found in Grenander
(1981), the problem of second-order comparison of functional
data has—interestingly—so far received relatively little
attention. A related recent article by Benko, Härdle, and Kneip
(2009) proposed two-sample bootstrap tests for speciﬁc aspects
of the spectrum of functional data, such as the equality of a subset
of the eigenfunctions, or—assuming that the eigenfunctions
are shared—equality of a subset of eigenvalues.
In this article, we consider the difﬁculties associated with
this testing problem, and it is seen that the extension of
ﬁnite-dimensional procedures can lead to complications, as the
inﬁnite-dimensional version of the problem constitutes an illposed
inverse problem. As an alternative solution, we propose
a test based on the approximation of the Hilbert–Schmidt distance
of the empirical covariance operators of the two samples
of functions based on the Karhunen–Loève expansion. The asymptotic
distribution of the test statistic is determined and its
© 2010 American Statistical Association
Journal of the American Statistical Association
June 2010, Vol. 105, No. 490, Theory and Methods
DOI: 10.1198/jasa.2010.tm09239
670
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 671
performance is investigated computationally. The application
of our methodology to an electron microscope dataset of two
groups of minicirles characterized by the presence or absence
of a TATA box suggests the potential existence of signiﬁcant
differences in the two groups, which eluded previous analyses
as these focused on the mean (the shape of the minicircle),
whereas we detect the differences in the covariance structure
(the ﬂexibility/stiffness).
The article is organized as follows. The next section describes
the three-dimensional functional dataset of DNA minicircles,
from acquisition to registration, and includes a preliminary
exploratory analysis. The ﬁrst part of the third section then
provides some functional data analysis background. Section 3.2
introduces our spectral test statistic and develops its asymptotic
distribution, while Section 3.3 treats the problem of tuning the
amount of regularization. In Section 4 the power and level of the
test under various scenarios is investigated by means of simulation.
Section 5 presents the results of a two-sample analysis of
the DNA minicircles through the spectral test statistics, and the
article concludes with a short discussion.
2. DNA MINICIRCLE DATA
The dataset of interest was reconstructed from electron micrographs
imaged by Jan Bednar at the Laboratory of Ultrastructural
Analysis of the University of Lausanne, Switzerland.
A total of 99 DNA minicircles of 158 base-pair length were
vitriﬁed and imaged under two different angles, yielding two
projected images of the same specimen, which were then used
to reconstruct three-dimensional structural models (Jacob et al.
2006). The reconstructed data consist of 99 closed curves (DNA
minicircles) in R3 of two types: both types have identical base
pair sequences, except for a 14 base-pair window where 65
curves contain the TATA sequence, while the remaining 34 contain
a different sequence, called a CAP sequence. Biophysical
considerations suggest that the presence of a TATA box will
have a signiﬁcant effect on the geometry of the minicircle, and
the goal is to compare these two groups to probe for such an
effect.
In its reconstructed form, each curve is represented as a combination
of periodic B-spline basis functions taking values in
R3. To perform a functional data analysis of the minicircles it
is required to register the data. Each curve has thus been centered
and scaled, so that the center of mass is at zero and the
length of the curve is one. The nature of the experimental setup
in single-particle electron microscopy requires that the minicircles
be imbedded unconstrained in the aqueous solution, so that
the reconstructed curves are not aligned: the original (x,y,z)coordinates
for the different curves are not directly comparable
as each curve was subjected to a random unobservable orthogonal
transformation. It is thus necessary to align the curves.
Landmark alignment methods (e.g., Gasser and Kneip 1995)
are not applicable as the exact DNA sequence is not detectable
from an electron micrograph. On the other hand, more ﬂexible
methods such as warping (e.g., Gervini and Gasser 2004;
Tang and Müller 2008) are inappropriate since nonrigid alignment
will alter the second-order properties that are of principal
interest. As an alternative, we rigidly align curves by their
intrinsic characteristics: each curve was individually aligned
using the coordinate system induced by its moments of inertia
tensor (e.g., Arnold 1989), which is described as follows.
Consider an object in three dimensions described by a mass
distribution μ—for example, for a DNA minicircle, μ will be
the uniform measure supported on the curve. Suppose that the
object is rotating around an axis, which without loss of generality,
is given by span(u) := {λu:λ ∈ R} for some u ∈ S2.
Let r(u,x) := (I − uu )x denote the distance of a point x
from the subspace span(u). The moment of inertia of the object
around the axis u is given by
J (u) :=
R3
r2
(u,x)μ(dx) =
R3
(I − uu )x 2
μ(dx).
Given a coordinate system deﬁned by an orthonormal basis, say
the canonical basis (e1,e2,e3), we can use only these basis vectors
to compactly represent the moment of inertia with respect
to any other axis passing by the origin. Deﬁne the inertia matrix
as
J :=
R3
x (ei ejI − eiej )xμ(dx)
i,j
.
Notice that the diagonal elements of the above matrix are the
moments of inertia with respect to the axes of the coordinate
system. The moment of inertia around any unit vector u can
now be recovered as J (u) = u Ju. Since the tensor is symmetric,
it possesses real eigenvalues and orthonormal eigenvectors
forming a basis, which admit the following interpretation:
the ﬁrst eigenvector, say w1, determines the axis (ﬁrst principal
axis of inertia, PAI1) around which the curve is most difﬁcult
to rotate, in the sense that the corresponding angular moment
is maximized: w1 Jw1 ≥ u Ju for any other u ∈ S2. The projection
on the plane orthogonal to w1 is “most spread” in this
sense. The second eigenvector determines the axis within the
ﬁrst principal plane around which the projected curve is most
difﬁcult to rotate. That is, within the ﬁrst principal plane, the
projection on the line orthogonal to PAI2 is most spread. Hence,
PAI3 carries the most spatial information, whereas PAI1 contains
the smallest amount of information. Then, for each curve,
the starting point was determined as the point where the projection
on the ﬁrst principal plane intersects the horizontal (PAI2)
positive semi-axis and the orientation was chosen as counterclockwise
in this plane (i.e., at the beginning the PAI3 coordinate
increases from zero and PAI2 is positive).
The projections onto the principal axes of the minicircle
curves are depicted in Figures 1 and 2. The data appear to be
well aligned, and seem to be elliptical on average within the
principal plane of inertia. Deviations from this principal plane,
on the other hand, seem to be lacking systematic structure. The
effectiveness of this alignment method is of crucial importance,
as we will not be able to otherwise proceed with the testing
problem (procrustean alignment of the curves will require us to
optimize a sum of squares criterion with respect to 99 orthogonal
transformations).
A visual inspection reveals ﬁve curves (plotted with dashed
lines) that appear to be “standing out” of the rest—outliers in
a broad sense. Judging whether or not a curve (an inﬁnite dimensional
object) is an outlier or not can be far trickier than
in the vector case. In particular, it can be that there are further
“outlying curves” that do not appear to stick out of the
crowd, but are nevertheless intrinsically different from the rest.
For this reason, we pursue a robust analysis for the mean curve
672 Journal of the American Statistical Association, June 2010
Figure 1. Projection of DNA curves on the ﬁrst principal plane. Five removed outlying observations plotted in dashed lines. The mean curves
(in white) are computed without outlying observations.
using a functional median introduced in Gervini (2008). The
idea is simple: an iterative robust procedure will assign weights
to each curve, and we can then detect outlying curves by looking
at small weights. The method conﬁrms our visual intuition,
and reveals no further outliers. The outlying observations are
removed, and after this preprocessing stage we are left with 94
aligned smooth curves.
3. METHODS
3.1 Background: FDA and Karhunen–Loève Expansions
We adopt a functional data analysis perspective (Ramsay
and Silverman 2005; Ferraty and Vieu 2006) and model each
curve as the realization of a stochastic process indexed by
the closed interval [0,1] and taking values in R3 (but everyFigure
2. Coordinates of DNA curves on the principal axes of inertia. Five removed outlying observations plotted with dashed lines. Mean
curves (in white) are computed without outlying observations.
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 673
thing readily extends to the case of Rd). In particular, we assume
that we have two independent collections X1,...,Xn1
and Y1,...,Yn2 , of iid Gaussian processes on [0,1], considered
as random elements of the Hilbert space L2[0,1] of
coordinate-wise square-integrable R3-valued functions with the
inner product f,g =
1
0 f(t) g(t)dt. Here, f(t) represents
the transpose of the vector-valued function f(t) ∈ R3. Assuming,
without loss of generality, that the mean functions are
zero, the processes are characterized by their respective covariance
kernels RX(s,t) = cov(Xi(s),Xi(t)) = E{Xi(s)Xi (t)},
and RY(s,t), respectively. Associated with the covariance kernel
is the covariance operator RX :L2[0,1] → L2[0,1] deﬁned
as RX(f)(t) = cov( Xi,f ,Xi(t)) =
1
0 RX(t,s)f(s)ds.
Throughout the article, we will be assuming RX to be continuous,
so that RX is bounded and the X process is continuous
(resp. the Y process).
Inference for iid collections of inﬁnite-dimensional random
elements is often carried out in practice by an “optimal” reduction
to a ﬁnite-dimensional setting, using ﬁnitely many appropriately
chosen contrasts in a functional principal component
analysis (e.g., Ramsay and Silverman 2002, 2005; Hall
and Hosseini-Nasab 2006; also see Dauxois, Pousse, and Romain
1982 for distributional asymptotics). This procedure exploits
the Karhunen–Loève theorem (e.g., Adler 1990), which
allows for a representation of the process by a stochastic Fourier
series with respect to the orthonormal eigenfunctions {ϕ
(j)
X }∞
j=1
of the operator RX,
Xi(t) =
∞
j=1
λ
(j)
X ξijϕ
(j)
X (t),
where {λ
(j)
X }∞
j=1 is the nonincreasing sequence of corresponding
eigenvalues and {ξij} is an iid array of standard Gaussian
random variables. Convergence of the series is in mean square,
uniformly in t ∈ [0,1].
Thus, in a practical setting, the empirical covariance kernel
may be used to “optimally” reduce inﬁnite-dimensional inferential
problems to multivariate ones. Letting RX stand for
the empirical covariance kernel, RX(s,t) := 1
n1
n1
i=1(Xi(s) −
X(s))(Xi(t) − X(t)) , we denote its eigenvalues (or principal
scores) by {λk,n1
X }n1
k=1 and its eigenfunctions (or principal components)
by {ϕk,n1
X }n1
k=1. The ﬁnite-dimensional reduction is then
achieved by retaining a ﬁnite number of principal components
{ Xi − X,ϕk,n1
X }K
k=1 in lieu of each Xi. These are zero mean
and uncorrelated random variables, with corresponding sample
variances λk,n1
X . Similarly, for the second sample, the analogous
quantities are RY , RY , λ
(j)
Y , ϕ
(j)
Y (and their empirical “hat” counterparts).
The dimension reduction afforded by the Karhunen–
Loève expansion is the tool we will next employ to construct
our test.
3.2 Second-Order Comparison of Gaussian Processes
Let {Xi}n1
i=1 and {Yi}n2
i=1 constitute two iid random samples
of Gaussian processes indexed by the interval [0,1] and taking
values in R3 (or indeed Rd). As mentioned in the previous section,
these are regarded as random elements of the Hilbert space
L2[0,1] of square-integrable R3-valued functions (where integration
is to be understood coordinate-wise). Assuming that the
covariance operators RX and RY associated with the processes
are continuous, we wish to test the hypothesis pair
H0 : RX = RY,
HA : RX = RY.
(1)
A natural ﬁrst approach to developing a test for the hypothesis
pair in Equation (1) is to attempt to extend tests developed
for the ﬁnite-dimensional version of the problem, which was
extensively studied. The majority of test statistics for the equality
of covariance matrices of Gaussian vectors are based on the
determinant, trace, or maximum/minimum eigenvalues of matrices
such as: S1S2S−1
, S1S−1
2 , S2(S1 + S2)−1 (Roy 1953; Pillai
1955; Kiefer and Schwartz 1965; Giri 1968); here, S1 and
S2 are the empirical covariance matrices corresponding to each
sample, and S is the pooled empirical covariance matrix. Evidently,
such tests cannot immediately be carried over to the case
of Gaussian processes: inversion of an empirical covariance operator
will be required, which transforms the construction of the
test statistic into an ill-posed inverse problem.
The operator Rn1
X (resp. Rn2
Y ) will be of rank at most n1
(resp. n2) as its image is the subspace spanned by {Xi}n1
i=1 (resp.
{Yi}n2
i=1). Therefore, we cannot talk of its inverse, except if we
restrict the operator on span{Xi}n1
i=1 (resp. span{Yi}n1
i=1), but the
two spans will not coincide in general and the two empirical operators
will not be diagonalized by the same basis. Furthermore,
since the processes are assumed to be second order, the operators
RX and RY are necessarily bounded (in fact compact), and
it must be the case that λ
(k)
X ,λ
(k)
Y
k→∞
−→ 0, the rate of convergence
depending on the degree of smoothness of the Gaussian
processes (the smoother the process, the faster the rate). Thus,
for any ﬁnite n1 and n2, however large, a test statistic employing
an “inverse” of RX composed with RY will be unstable to
perturbations of the Y-data.
In the inﬁnite-dimesional case, we propose the use of a
test statistic based on the norm of the difference of the two
empirical covariance operators. Recall that for trace-class
operators, one may deﬁne the Hilbert–Schmidt norm. Consider
an integral operator R :f →
1
0 R(·,s)f(s)ds such that
1
0
1
0 trace{R(s,t) R(s,t)}dsdt < ∞. The Hilbert–Schmidt
norm of the operator R is deﬁned as
R HS :=
1
0
1
0
trace{R(s,t) R(s,t)}dsdt.
Assuming that the covariance operators in question are Hilbert–
Schmidt, a test may be based on the squared Hilbert–Schmidt
distance RN
X − RN
Y
2
HS of their empirical counterparts. Of
course, the sampling distribution of this latter quantity will depend
on the unknown covariance operators even asymptotically.
To be able to “normalize” the test statistic, we employ a very
useful property of the Hilbert–Schmidt norm: for any orthonormal
system {ei}∞
i=1 of L2[0,1], we have
R 2
HS =
∞
i=1
Rei
2
L2 . (2)
Therefore, we may use a basis to obtain a countable expression
for RN
X − RN
Y
2
HS. In practice, one will need to truncate a series
such as the above to obtain an “optimal” ﬁnite-dimensional
674 Journal of the American Statistical Association, June 2010
reduction, that is, the choice of contrasts {ei} should be such
that the truncated version of Equation (2) retains the bulk of the
norm.
For each of the two empirical operators, the optimal contrasts
will coincide with their eigenfunctions, as dictated by the
Karhunen–Loève expansion, but to use the relation in Equation
(2) we need to use a common basis. As a compromise, we
thus choose the eigenfunctions {ϕk,N
XY } corresponding to the empirical
covariance operator of the pooled sample of N = n1 +n2
curves and base our test on
K
k=1
(RN
X − RN
Y )ϕk,N
XY
2
L2 ,
which by Parseval’s theorem, may be further approximated by
K
i=1
K
j=1
(RN
X − RN
Y )ϕi,N
XY ,ϕ
j,N
XY
2
. (3)
With this quantity in mind, the following theorem, whose proof
may be found in the Appendix, provides the basis for our test:
Theorem 1. Let {Xn}n1
n=1 and {Yn}n2
n=1 be two collections of
zero mean iid continuous Gaussian random functions indexed
by the interval [0,1] and taking values in Rd, possessing covariance
operators RX and RY with distinct eigenvalues. Let
Rn1
X and Rn2
Y denote the empirical covariance operators based
on {Xn}n1
n=1 and {Yn}n2
n=1. For N = n1 + n2, let RN
XY denote
the empirical covariance operator of the pooled collection, and
{ϕk,N
XY }N
k=1 the corresponding eigenfunctions. Finally, let λk,n1
X,XY ,
λk,n2
Y,XY denote the empirical variance of the kth Fourier coefﬁcient
of {Xn}n1
n=1 and {Yn}n2
n=1, respectively, with respect to
the eigenfunctions {ϕn,K
XY }N
n=1. Assuming that E[ X1
4
L2 ] < ∞,
E[ Y1
4
L2 ] < ∞, and n1/N → θ ∈ (0,1) as N = n1 + n2 → ∞,
it follows that, under the hypothesis H0 :RX = RY ,
TN(K) :=
n1n2
2N
K
i=1
K
j=1
(Rn1
X − Rn2
Y )ϕi,N
XY ,ϕ
j,N
XY
2
n1
N
λi,n1
X,XY +
n2
N
λi,n2
Y,XY
×
n1
N
λ
j,n1
X,XY +
n2
N
λ
j,n2
Y,XY
w
−→ χ2
K(K+1)/2
as N → ∞, for any ﬁnite K ≤ rank(RX) = rank(RY) ≤ ∞.
Under the alternative hypothesis, the test statistic will converge
to a sum of K(K + 1)/2 dependent shifted chi square random
variables.
Our proposed test procedure is thus to reject the hypothesis
H0 :RX = RY at level α, whenever the test statistic exceeds the
corresponding critical value,
TN(K) ≥ χ2
K(K+1)/2,1−α.
Of course, conducting the test requires the selection of a spectral
truncation level, K. This choice must be made judiciously,
as it has a direct bearing on the power of the test:
1. Conservative choices of K [i.e., choosing K
rank(RX) ∧ rank(RY)] may result in Type II error due to
differences in the higher frequency covariance structure,
especially in situations where the two covariances share
the same eigenfunctions, but have different eigenvalues at
higher frequencies.
2. Greedy choices of K [choosing K > rank(RX) ∧
rank(RY)] will inﬂate the variance of the test statistic
since an element of ill-posedness will enter when dividing
with the empirical eigenvalues of higher order terms.
In the latter sense, the test can also be thought of as an L2regularized
test. These aspects are further considered quantitatively
in Section 4. It should be noted that the problem of
choosing K is directly analogous to the choice of a cutoff point
in principal component analysis and the choice of a bandwidth
in a nonparametric problem; thus we deal with it using empirical
eigenvalue scree-plots as well as penalized goodness-of-ﬁt
criteria (see Sections 3.3 and 5.1).
A more user-friendly expression for the test statistic T can
be given if we introduce some additional notation. Let λ
ij,N
X,XY :=
Rn1
X ϕi,N
XY ,ϕ
j,N
XY = n−1
1 i Xi − X,ϕi,N
XY Xi − X,ϕ
j,N
XY be the
empirical covariance of the ith and jth Fourier coefﬁcients of
the X-curves, with respect to the basis {ϕk,N
XY }k≥1 (resp. λ
ij,N
Y,XY ).
For simplicity, we also write λ
jj,N
X,XY ≡ λ
j,N
X,XY (resp. λ
jj,N
Y,XY ). Then
we may re-express the test statistic as
TN(K) :=
n1n2
2N
K
i=1
K
j=1
((λ
ij,N
X,XY − λ
ij,N
Y,XY)2
)
n1
N
λi,n1
X,XY +
n2
N
λi,n2
Y,XY
×
n1
N
λ
j,n1
X,XY +
n2
N
λ
j,n2
Y,XY .
If for some reason, we a priori know the eigenfunctions of RX
and RY to be equal, then the following test statistic may be used
instead of T:
T1 =
K
k=1
n1n2
N
(λk,N
X,XY − λk,N
Y,XY)2
2((n1/N)λk,N
X + (n2/N)λk,N
Y )2
.
The motivation for this statistic is that when the eigenfunctions
coincide, then
K
k=1
(Rn1
X − Rn2
Y )ϕk,N
XY
2
L2 ≈
K
k=1
(λk,N
X,XY − λk,N
Y,XY)2
.
It follows as an immediate corollary to Theorem 1 that, under
H0, the statistic T1 is asymptotically chi-square distributed with
K degrees of freedom [assuming n1/N → θ ∈ (0,1)]. One may
also wish to consider modiﬁed versions of the test statistics T
and T1, obtained via suitable variance-stabilizing transformations.
In the case of the test statistic T, we apply a log transformation
to the diagonal terms of the sum in Equation (3), and
Fisher’s z-transformation to the off-diagonal terms to obtain a
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 675
test statistic with the same asymptotic distribution as T (an immediate
corollary to Theorem 1),
T∗
=
K
k=1
n1n2
N
(logλk,N
X,XY − logλk,N
Y,XY)2
2
+
1≤j<k≤K
n1n2
N
1
2
log
λ
j,N
XY λk,N
XY + λ
jk,N
X,XY
λ
j,N
XY λk,N
XY − λ
jk,N
X,XY
−
1
2
log
λ
j,N
XY λk,N
XY + λ
jk,N
Y,XY
λ
j,N
XY λk,N
XY − λ
jk,N
Y,XY
2
.
A variance-stabilized alternative to T1 may also be similarly
constructed by retaining only the ﬁrst component of T∗ (the
diagonal terms), yielding
T∗
1 =
K
j=1
n1n2
N
(logλ
j,N
X,XY − logλ
j,N
Y,XY)2
2
.
The latter statistic is approximately χ2-distributed with K degrees
of freedom. Simulations conducted in Section 4 seem to
suggest that the modiﬁed tests achieve a level closer to the nominal
level, and consequently, may provide higher power.
In the inﬁnite rank case, one might wish to let K to grow
along with N, allowing for the comparison of progressively
ﬁner and ﬁner differences (located at the extreme tails of the
operator spectra) as sample size increases. As noted previously,
any such attempt will necessarily lead to instabilities:
due to the fast decay of the eigenvalues, we are attempting
to compare extremely small quantities, based on the empirical
tails of the spectra, which are highly unstable. This instability
will manifest itself through the very large integrated
mean squared errors involved when estimating higher order
eigenfunctions, whose available bounds grow for ﬁxed N depending
inversely on the rate of decay of the spectrum (see
also Bosq 2000, lemma 4.3); the ill-posedness is especially severe
for smooth processes. Controlling the rate of growth of
K with respect to both N and the rate of decay of the true
eigenvalues will thus be necessary—decreasing the amount of
regularization requires an increase in sample size, depending
also on the spectral decay properties. Modifying the test statistic
to obtain a central limit theorem as KN → ∞ will require
a very slow rate of growth of KN with respect to N
since:
1. Although the truncation level grows as KN, the number of
summands in the test statistic grows like K2
N.
2. While these K2
N summation terms do become independent
as N grows (allowing for a CLT phenomenon), no mixing
concept applies. In effect this means that one has to look
at the convergence in distribution to independence of a
random vector of increasing dimension (= K2
N). For any
ﬁxed dimension the required weak convergence will be at
a rate of N−1/2—therefore KN must grow slow enough
to allow the N−1/2 rate to compensate for the K2
N rate of
increase of the dimension.
3. The required global convergence to independence is regulated
by the convergence of the empirical eigenfunctions
to the true ones; this in turn depends on the spacings between
the true eigenvalues. For K components, the rate
of convergence of the Kth empirical eigenfunction decays
like N−1/2 max{(λK−1 − λK)−1,(λK − λK+1)−1}. Therefore,
when we let KN grow, it has to be at a rate slow
enough to annihilate the blow-up of the inverse spacing
of order KN.
The study of these intricacies is rather technical, and further
development is contained in the supplement.
3.3 On the Selection of Truncation Level
By analogy to ﬁnite-dimensional principal component
analysis (PCA), the choice of a truncation parameter K can
be made on the basis of scree plots and cumulative variance
plots. A visual inspection of the scree plots can be employed
to identify inﬂection points, which combined with the information
provided by the cumulative variance plots, can suggest an
appropriate truncation level K for use in testing. Note that the
decrease of the scores λk,N
X,XY and λk,N
Y,XY is not monotone, since
the basis {ϕk,N
XY } does not correspond to the eigenbasis of either
of the two groups of curves. Therefore, a little more care needs
to be taken, although the basic idea still holds.
The truncation of the Hilbert–Schmidt norm expansion effectively
induces smoothing upon the curves, and can be regarded
as a choice of a regularization tuning parameter. Consequently,
potentially more automatic criteria can be based on tuning the
amount of smoothing so as to minimize a penalized goodnessof-ﬁt
error. Concentrating on the X-curves, a natural deﬁnition
of goodness-of-ﬁt error is,
PEX(K) :=
n1
n=1
K
k=1
X∗
n,ϕk,N
XY ϕk,N
XY − X∗
n
2
L2
=
n1
n=1
Xn(K) − X∗
n
2
L2 ,
where X∗
i is the ith mean-corrected curve. Of course, the above
criterion is nonincreasing in K since it accounts only for the ﬁt,
and there is no penalty for the “complexity” of Xn(K). Such
a penalty is often based on the norm of the image of Xn(K)
through a suitably chosen differential operator (in the spirit
of Ramsay and Silverman 2005, section 5.3.3). The choice of
penalty reﬂects the qualitative speciﬁcation of what “parsimonious”
is in a given context. In the present scenario, a sample of
curves is available, and so the penalty can be made to be datadependent,
by penalizing deviations from the average smoothness
properties of the observed curves. These smoothness properties
are naturally reﬂected by the norm of the reproducing
kernel Hilbert space (RKHS) generated by the empirical covariance
operator of the X-sample, RX, yielding the penalized
676 Journal of the American Statistical Association, June 2010
ﬁt criterion,
PFCX(K) =
n1
n=1
Xn(K) − X∗
n
2
L2
GOFX(K)
+
2 N
j=1 λ
j,N
XY
n1
n1
n=1
n1
j=1
1
λ
j,N
X
Xn(K),ϕ
j,N
X
2
PENX(K)
. (4)
When the null hypothesis is true, we expect to have ϕ
j,N
X ≈
ϕ
j,N
XY ; this essentially reduces PFCX(K) to the Gaussian pseudolikelihood-based
Akaike information criterion (AIC) employed
by Yao, Müller, and Wang (2005a) (see also Yao, Müller, and
Wang 2005b). The analogous quantity PFCY(K) can similarly
be deﬁned for the Y-curves. Since the sample size for the two
groups are not equal, the natural choice of K is then given
by minimizing the sum of goodness-of-ﬁt terms [GOFX(K)
and GOFY(K)] plus the convex combination of the smoothness
penalties [PENX(K) and PENY(K)]:
argmin
K
GOFX(K) + GOFY(K)
+
n1
N
PENX(K) +
n2
N
PENY(K) .
In practice, the number of terms taken in the sum comprising
the penalty may be less than ni, to avoid dividing by terms that
are numerically zero. A variant of this selection criterion can
be based on the leave-one-out cross-validated prediction error,
where one whole curve is left out at a time (Rice and Silverman
1991). The performance of the selection criterion is investigated
in simulations presented in the next section.
4. A SIMULATION STUDY
To assess the behavior of the proposed tests under the null hypothesis
and under various alternatives we carry out a number
of simulations. We consider one situation with equal covariance
functions (simulation scenario A) and several alternative
conﬁgurations (scenarios B–I). The two test statistics T and T∗
introduced in the previous section are considered under various
choices of K, the truncation level, and for the automatic
selection K∗ given by the penalized ﬁt criterion. The number
of observations in each sample is 50. The tests are replicated
5000 times under H0 and 1000 times under HA, respectively, at
the 5% nominal level of signiﬁcance using the asymptotic χ2
approximation.
In the ﬁrst eight scenarios, the Gaussian processes in both
samples are of the form
3
j=1
ξj
√
2sin(2πj(t + δj))
+
3
j=1
ζj
√
2cos(2πj(t + ηj)), t ∈ [0,1],
where the coefﬁcients ξj, ζj are independent Gaussian random
variables with mean zero and var(ξj) = vj, var(ζj) = wj (the
variance terms where chosen so as to induce “elbow” effects
as one expects to see in practice). Various values of vj, wj, δj,
ηj used in A–H are reported together with the corresponding
results in Table 1 (the shift parameters δj, ηj are reported only
for F, the only case where they are nonzero). The last scenario
deals with rough processes (inﬁnitely many components).
Results for scenario A show that the true level for all variants
of the test is close to the nominal level, provided the number of
Table 1. Empirical rejection probabilities on the nominal level 5%, sample size n1 = n2 = 50, number of replications 5000 for A, 1000 for
B–I. Here, uX = (vX,wX) (resp. uY ) and K∗ is the automatic truncation choice given by the penalised ﬁt criterion
K
Parameters Test 1 2 3 4 K∗
A uX = (12,7,0.5,9,5,0.3) T 0.045 0.049 0.044 0.044 0.047
uY = (12,7,0.5,9,5,0.3) T∗ 0.051 0.056 0.057 0.056 0.059
B uX = (14,7,0.5,6,5,0.3) T 0.422 0.264 0.185 0.150 0.148
uY = (8,7,0.5,6,5,0.3) T∗ 0.443 0.315 0.223 0.174 0.175
C uX = (15,10,0.5,4,3,0.3) T 0.186 0.331 0.218 0.169 0.167
uY = (11,6,0.5,4,3,0.3) T∗ 0.201 0.366 0.269 0.207 0.208
D uX = (12,7,0.5,9,3,0.3) T 0.040 0.204 0.836 0.973 0.962
uY = (12,7,0.5,2,5,0.3) T∗ 0.047 0.221 0.848 0.984 0.980
E uX = (12,7,0.5,9,3,0.3) T 0.047 0.246 0.644 0.964 0.962
uY = (12,7,0.5,3,9,0.3) T∗ 0.055 0.267 0.686 0.976 0.975
F uX = uY = (12,7,4,0.5,0.3,0.1) T 0.257 0.693 0.909 1.000 1.000
δX = (0.15,0.15,0.15) T∗ 0.273 0.706 0.916 1.000 1.000
G uX = (12,7,0.5,8,6,0.3) T 0.042 0.040 0.054 1.000 1.000
uY = (12,7,0.5,8,0,0.3) T∗ 0.047 0.048 0.068 1.000 1.000
H uX = (12,7,0.5,9,5,0.3) T 0.044 0.140 0.500 1.000 1.000
uY = (12,7,0.5,0,5,0.3) T∗ 0.049 0.154 0.520 1.000 1.000
I Brownian motion versus T 0.719 0.608 0.483 0.377 0.493
Ornstein–Uhlenbeck process T∗ 0.731 0.644 0.532 0.443 0.546
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 677
components K does not exceed the effective complexity of the
covariance operator (which is 4 in this case). The slight conservatism
of T is removed by variance stabilizing transformations
used in T∗. Indeed, the stabilized statistics seem to be preferable
because they also provide slightly higher power (as is seen
in the remaining simulations).
Under scenario B, both covariance operators are of effective
complexity 4 and possess the same sequence of eigenfunctions
(the same set with the same order), but the sequences of
eigenvalues differ (the largest eigenvalue is different). Not surprisingly,
the power decreases as K increases because there is
no difference in the components other than in the ﬁrst one, so
adding them increases the degrees of freedom without any signiﬁcant
contribution to the test statistic. Conﬁguration C is similar
to B, but with the two largest eigenvalues being different.
The highest power is achieved with K = 2, as expected. When
compared to the next few scenarios, where there are differences
associated with the eigenfunctions also, the power in B and C
is clearly lower. This is due to the fact that the test statistic
takes the comparison of the eigenfunctions—where there are
no differences—into account, and thus is not as powerful in detecting
differences that lie only on the eigenvalues (the diagonal
form of the tests T1 and T∗
1 will be more powerful in this case).
In scenario D, the effective complexity of the operators is
the same in Equation (4), the operators have the same set of
eigenfunctions (in different order) and different sequences of
eigenvalues. The difference of the covariance operators is not
detected by tests with one component because the largest eigenvalue
and the corresponding eigenfunction are the same in both
samples. When the choice of K is close to the true effective
complexity, the power of the tests is very high (this includes the
automatic choice). The same is true for the next four scenarios
as well.
Under scenario E, both operators (of effective rank 4) have
the same sequence of eigenvalues, and the same set of eigenfunctions,
but the latter are permuted to correspond to different
eigenvalues. This scenario illustrates a situation where the diagonal
form of the test statistics (T1 and T∗
1 ) will be inapplicable.
It is interesting to make the comparison with scenario D, where
the sets of eigenfunctions are the same for both samples as well.
In D the sequences of eigenvalues differ also, hence more information
is on the diagonal.
Scenario F differs from the previous conﬁgurations in that
the sets of eigenfunctions are completely different (sines versus
shifted sines). The eigenvalues are the same, and the effective
operator rank is 3 in both cases.
In the next conﬁguration, scenario G, the ﬁrst three eigenvalues
and eigenfunctions are the same in both samples. The covariance
operators have different effective ranks: 4 in the ﬁrst
sample, 3 in the second sample. Therefore, it is not surprising
that the departure from H0 is not detected by tests with less than
4 components while it is clearly detected by four-component
tests. Note that with the automatic choice K∗, the alternative is
always detected.
Conﬁguration H is again a situation with different effective
ranks of operators (4 versus 3) but unlike the previous situation,
only the ﬁrst eigenfunction and eigenvalue coincide in
both samples. The next two eigenvalues are different and the
corresponding eigenfunctions differ as well. Thus, as of K = 2,
the tests start detecting the alternative, with highest power for
K = 4.
Under scenario I, curves in both samples come from distributions
with covariance operators with inﬁnite rank, namely the
standard Brownian motion W(t) and the Ornstein–Uhlenbeck
process U(t) satisfying dU(t) = −θU(t)dt +dW(t) with θ = 1.
The covariance operators of the two processes differ in all components.
The major portion of the difference is captured by tests
with one component, then the power slowly decays.
A general observation when focusing on the behavior of the
tests when the number of components K was selected using the
selection criterion introduced in the previous section is that the
power and level are comparable with those when employing
the true effective rank. Under scenario A, the selection criterion
chose K = 4 in 96.3% of simulations and K = 5 in 3.7% of
simulations. Doing the same for the alternative conﬁgurations,
it turned out that the power is similar to the power of tests with
ﬁxed values of K close to the values most frequently selected
by the selection criterion. Hence this automatic dimension reduction
technique appears to be useful in practice.
It should be mentioned that the role of the selection criterion
is to probe the effective complexity of the data and not
the complexity of the difference between the two samples. The
selection rule is not related to the null hypothesis or the alternative
and does not reﬂect validity or invalidity of either of them.
This explains the reliability of the post-selection test. Note that
a completely different approach can be based on the selection
of the “most different” components (the most likely alternative)
using a criterion involving the test statistic in the spirit of datadriven
smooth tests (e.g., Ledwina 1994).
5. ANALYSIS OF DNA MINICIRCLES
5.1 Finite-Dimensional Approximation
Figure 3 shows the empirical variance of the scores with
respect to the basis {ϕk,N
XY } separately for the TATA and CAP
groups (λk,N
X,XY and λk,N
Y,XY , respectively, in the notation used previously)
as well as for the pooled sample (λ
j,N
XY ). The plots
also display cumulative proportions of the total variance explained
by the corresponding components. Separate plots are
constructed for the analysis carried out marginally on each principal
axis and jointly on the principal plane.
When inspecting the marginal plots for the projections on
each axis of inertia, we observe that four or at most ﬁve principal
components should constitute an adequate choice. When
looking at the marginal plot for the projection onto the principal
plane of inertia, it seems that setting K = 6 or K = 7 is
more than adequate (accounting for at least 85% and 90% of
the variance, respectively, and with a clear “elbow” effect).
The reason for placing special emphasis on the principal
plane is that, as one can observe from Figure 2, the DNA minicircle
curves tend to be planar on average, and the more interesting
signal is not to be found in the deviations from the planar
aspect of the structure, but within the planar structure itself
(see the discussion at the end of the next section). The penalized
prediction error criterion introduced in Section 3.3 yields
K = 7 components in the principal plane.
678 Journal of the American Statistical Association, June 2010
Figure 3. Empirical variances (scree plot) and cumulative proportions of variance explained by components for the TATA (circles) and CAP
(diamonds) group and for both groups together (squares).
5.2 First-Order Inference
As was mentioned in the Introduction, a previous exploratory
analysis of the data (Amzallag et al. 2006) that used clustering
of the minicircles with respect to a Procrustean metric did not
reveal any observable differences between the geometry of the
two groups. The clustering distance used (a mean-square-based
pairwise Procrustean distance) induces clustering with respect
to the mean shape of the minicircles, which can be seen to be essentially
identical between the two groups (Figure 2). To probe
this ﬁnding more formally, we test the hypothesis of equal mean
curves versus a general alternative, based on a variant of the test
proposed by Berkes et al. (2009). We reject the hypothesis of
equal mean curves when the value of the statistic
K
j=1
n1n2
N
( X,ϕ
j,N
XY − Y,ϕ
j,N
XY )2
λ
j,N
XY
is large compared to a χ2
K distribution (the approximation employs
results in Dauxois, Pousse, and Romain 1982). The results
of this comparison are displayed in Table 2. The correTable
2. p-values for comparison of mean functions
in the TATA and CAP group for various truncation
levels K, for the full three-dimensional curves, and
their projections onto the prinipal plane of inertia
K PAI1, 2, 3 PAI2, 3
1 0.40 0.64
2 0.68 0.69
3 0.85 0.64
4 0.60 0.55
5 0.34 0.58
6 0.46 0.61
sponding values of the test statistic are insigniﬁcant and one
cannot reject the null hypothesis; indeed, the results of the test
do not vary much with K.
As discussed in the previous section, it seems, in fact, that
the interesting “signal” of the minicircles is effectively planar
(see Figure 2). It is, therefore, interesting to test the hypothesis
that the mean function of the PAI1 coordinate is zero—for this
will suggest that our analysis should concentrate on the prinicipal
inertia plane (the projection of the Gaussian processes on
this plane is obviously a Gaussian process). To this aim, we
use the one-sample version of the test statistic used for mean
comparison (which in the one-sample situation, is in fact an approximate
likelihood-ratio statistic; Grenander 1981). For the
TATA group the p-value of the test with K = 4 components is
0.29. For the CAP curves the p-value is 0.30 (also using four
components). Hence the tests show no signiﬁcant systematic
deviation of the curves from the ﬁrst principal plane, and their
three-dimensional nature seems to only be due to random variation
around a planar mean shape. For this reason, in the next
section we concentrate on the comparison of the curves projected
onto the principal plane of inertia.
5.3 Second-Order Inference
As the ﬁrst-order comparison of the two minicircle groups
did not reveal any signiﬁcant differences, we turn our attention
to the detection of second-order differences. Indeed, since the
scientiﬁc hypothesis is that one type of curve (TATA) is more
ﬂexible, it may be intuitively expected that a detectable difference
will lie in the covariance structure rather than the mean
structure.
We test the hypothesis that both groups of curves share the
same covariance operator by employing the test statistic T∗.
The results are summarized in Table 3. Marginal tests on each
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 679
Table 3. p-values for the comparison of covariance functions in the
TATA and CAP group on different principal inertia axes using
the test statistic T∗ under various truncation levels K
p-value
K PAI3 PAI2 PAI1 PAI2, 3
1 0.252 0.313 0.976 0.167
2 0.001 0.118 0.823 0.005
3 0.000 0.087 0.782 0.025
4 0.001 0.022 0.886 0.051
5 0.001 0.053 0.555 0.009
6 0.010 0.087 0.327 0.005
7 0.019 0.098 0.360 0.023
8 0.046 0.173 0.148 0.094
inertia axis show that the covariance functions of the projections
onto PAI3 seem signiﬁcantly different for the two groups (with
either the empirical selection K = 4 or the automatic choice
K = 5). Differences of projections onto PAI2 appear marginally
insigniﬁcant depending on the choice of K (the empirical choice
is K = 5 and the automatic choice is K = 7). No signiﬁcant difference
is observed for PAI1, indicating that random deviations
from the ﬁrst principal plane may have the same covariance
structure in the two groups (which is in keeping with our previous
ﬁnding that the deviations from the principal plane can be
thought to be residual). Since the curves appear to be planar on
average, it is the covariance of their planar components where
most structure is to be found. Indeed, when our test is carried
out for the projection of the curves onto the principal plane of
inertia using K = 6 (empirical) or K = 7 (automatic), it rejects
the null hypothesis of no ﬂexibility differences, at the 1% and
3% signiﬁcance levels, respectively. In fact, the test based on
T∗
1 gives even more signiﬁcant results, yielding a p-value that is
numerically zero.
In the frequency domain, these differences can already be
seen in the scree plots (Figure 3), where the TATA curves are
seen to be more ﬂexible in the sense that the variances of their
Fourier coefﬁcients are more inﬂated when compared to the
CAP curves. Since the covariance kernels associated with the
two operators under comparison are matrix-valued functions,
there is no easy way to visualize the detected differences in the
time domain. Figure 4 contains surface and contour plots of
the empirical covariance kernels restricted to the third principal
axis—the axis where the most signiﬁcant differences were detected.
The plot reveals differences both in terms of the norm as
well as in terms of the structure.
6. CONCLUDING REMARKS
Motivated by the problem of comparison of groups of DNA
minicircles, we introduce and study a testing procedure for two
sample-comparison of Gaussian processes with respect to their
covariance structure.
The proposed test function is based on an approximation
of the Hilbert–Schmidt distance between the empirical covariance
operators of the two groups, by means of the Karhunen–
Loève representation of the pooled sample. The approximation
Figure 4. Surface and contour plots of the empirical covariance kernels corresponding to the TATA and CAP projections onto the third axis
of inertia.
680 Journal of the American Statistical Association, June 2010
was seen to admit a regularization interpretation, the problem
of testing presenting aspects of ill-posedness. The asymptotic
distribution of the test function was established, and variancestabilized
variants with similar asymptotic properties were proposed.
Finite-sample simulations under the null and various alternatives
were used to investigate the performance of the proposed
test. It should be noted that the results obtained readily
extend to random functions deﬁned over arbitrary compact
Euclidean domains, and taking values in Euclidean spaces of
arbitrary dimension (i.e., random ﬁelds).
The test was then carried out for a sample of 94 DNA minicircles
of two different types. One type is believed to possess
higher ﬂexibility than the other, but this eluded empirical conﬁrmation
via electron microscopy. Our test rejected the hypothesis
that the curves share the same covariance structure on their
principal plane of inertia (the signals are essentially planar),
providing support for the potential existence of differences between
the geometry of the two groups. Interestingly, the difference
was detected in the second-order characteristics, whereas
previous analyses focused on ﬁrst-order characteristics.
An important aspect of our testing procedure, as is the case
with any spectral truncation regularization procedure, is the
choice of truncation level K for the series representation of
the Hilbert–Schmidt norm. A careless choice of truncation
can affect the power of the test procedure. Our proposed approach
for the choice of K was through visual inspection of
functional PCA scree plots, combined with penalized prediction
error minimization. Interesting further work will be to investigate
LASSO-type component selection. Yet a further approach
will be to consider adaptive modiﬁcations of the proposed
tests that will automatically choose the level K based
on the data; for example, tests based on statistics of the form
maxK(TN(K) − βK logN), for some tuning parameter β > 0.
The asymptotic approximations for the distributions of the
test statistics investigated hold for Gaussian processes. Departures
from this assumption will affect the limiting law of the
statistics. In simulations we observed that the test derived under
the Gaussian assumption used in a non-Gaussian case becomes
conservative when the scores have lighter tails than the normal
distribution and anticonservative in the opposite case. Our tests
are based on sums of squares of components which are asymptotically
normal independent variables. When the data are not
Gaussian, these components have asymptotically a multivariate
normal distribution with unknown covariance structure. The
limiting covariance matrix can be estimated and a chi-square
test statistic can be based on the corresponding quadratic form
(see also Horváth, Hušková, and Kokoszka 2010 for a similar
approach in a different context). Some simulations showed that
the convergence to the limiting distribution might be slow and
one has to use only a small value of K, especially for the offdiagonal
test.
Of course, testing whether a process is Gaussian is a research
project in itself, but informal qq-plots constructed for
the Karhunen–Loève coefﬁcients of the minicircle data did
not reveal any noteworthy departures from normality. For the
beneﬁt of the doubt, however, we also employed permutation
tests based on our test statistics, with similar results—but with
slightly more inﬂated p-values (Panaretos and Kraus 2009).
APPENDIX
Proof of Theorem 1
Introduce the notation Xif := Xi,f Xi and Yif = Yi,f Yi, so that
Rn
X = n−1
i Xi and Rn
Y = n−1
i Yi. These are viewed as random
elements of the Hilbert space of Hilbert–Schmidt operators acting on
L2[0,1]. Under the hypothesis H0 :RX = RY , the collections {Xi}
and {Yi} are iid random operators with mean RX = RY and common
covariance S := E[Xi ⊗Xi]−RX ⊗RX = E[Yi ⊗Yi]−RY ⊗RY ,
where ⊗ denotes the tensor product, (u ⊗ v)w = v,w Hu for any elements
v,w,u of a Hilbert space (H, ·,· H). In addition, our moment
assumptions imply that E Xi
2
HS < ∞. We may, therefore, apply the
Hilbert space central limit theorem (e.g., Bosq 2000, theorem 2.7) to
conclude that
√
n1(R
n1
X − RX)
w
−→ Z1 and
√
n2(R
n2
Y − RY)
w
−→ Z2 as n1,n2 → ∞,
where Z1 and Z2 are independent Gaussian random operators with
mean 0 and covariance operator S. Now, given i,j, consider the sequence
of random variables
W
i,j
N = n1n2/N(R
n1
X − R
n2
Y )sgn[ ϕi,N
XY ,ϕi ]ϕi,N
XY ,
sgn[ ϕ
j,N
XY ,ϕj ]ϕ
j,N
XY .
On the one hand, the strong law in Hilbert space implies that RN
XY −
RX HS
a.s.
−→ 0 under the hypothesis H0. Consequently, convergence
also occurs with probability 1 in the strong operator topology, so that
by Bosq (2000, lemma 4.3)
sgn[ ϕk,N
XY ,ϕk ]ϕk,N
XY − ϕk L2
a.s.
−→ 0 ∀k ≥ 1. (A.1)
On the other hand, as N → ∞ with n1/N → θ ∈ (0,1) we will have
n2
N
√
n1R
n1
X −
n1
N
√
n2R
n2
Y
w
−→
√
1 − θZ1 −
√
θZ2 = Z ,
(A.2)
with Z a zero-mean Gaussian random operator with covariance S.
Combining Equations (A.1) and (A.2) with the Hilbert space Slutsky
lemma establishes that, for all i,j ∈ {1,...,K},
W
i,j
N
w
−→ Z ϕi,ϕj .
For the next step, we note that Z , being a Gaussian process itself, also
admits a Karhunen–Loève decomposition, with respect to the eigenfunctions
of S. These eigenfunctions can be retrieved directly from
the deﬁnition of S and the Karhunen–Loève expansion of the typical
X process, X = i
√
λiξiϕi. Deﬁning the operator ijf := ϕi,f ϕj,
we immediately see that X = i,j λiλjξiξj ij and RX = j λj jj.
Hence, upon recalling that the {ξi} are an iid standard Gaussian array
we may write
S = E[X ⊗ X ] − RX ⊗ RX
=
i,j,q,p
λiλjλpλqE[ξiξjξpξq] ij ⊗ qp −
i,j
λiλj ii ⊗ jj
=
i=j
λiλj ii ⊗ jj +
i=j
λiλj ij ⊗ ji +
i=j
λiλj ij ⊗ ij
+
i
3λ2
i ii ⊗ ii −
i
λ2
i ii ⊗ ii −
i=j
λiλj ii ⊗ jj
= 2
i
λ2
i ii ⊗ ii +
i=j
λiλj( ij ⊗ ji + ij ⊗ ij),
since E[ξiξjξpξq] is 1 whenever pairs of indices are equal but not all
indices are totally coincident, 3 when all indices are equal, and zero
Panaretos, Kraus, and Maddocks: Second-Order Functional Comparisons and DNA Geometry 681
otherwise. Regrouping the summation by adding the terms that are
symmetric with respect to their indices, we further obtain
S = 2
i
λ2
i ii ⊗ ii
+
i<j
λiλj( ij ⊗ ji + ij ⊗ ij + ji ⊗ ij + ji ⊗ ji)
= 2
i
λ2
i ii ⊗ ii
+
i<j
λiλj{ ij ⊗ ( ij + ji) + ji ⊗ ( ij + ji)}
=
i
(
√
2λi)2
ii ⊗ ii +
i<j
λiλj( ij + ji) ⊗ ( ij + ji).
It is straightforward to verify that { ij + ji}i<j ∪ { ii}i≥1 constitutes
a complete orthogonal system of operators for the Hilbert space
of Hilbert–Schmidt operators acting on L2[0,1]. We may, therefore,
represent Z in a Karhunen–Loève expansion as
Z =
√
2
i
λiζii ii +
i<j
λ
1/2
i λ
1/2
j ζij( ij + ji)
for {ζij}∞
i,j=1 an iid array of standard Gaussian variables. Consequently,
we may express the Gaussian process Z ϕk as
Z ϕk =
√
2
∞
i=1
λiζii ϕi,ϕk ϕi
+
i<j
λ
1/2
i λ
1/2
j ζij( ϕi,ϕk ϕj + ϕj,ϕk ϕi)
=
√
2λkζkkϕk +
i<j
λ
1/2
i λ
1/2
j ζij ϕi,ϕk ϕj
+
i<j
λ
1/2
i λ
1/2
j ζij ϕj,ϕk ϕi
=
√
2λkζkkϕk +
k<j
λ
1/2
k λ
1/2
j ζkjϕj +
i<k
λ
1/2
i λ
1/2
k ζikϕi,
where we used the fact that {ϕi} is an orthonormal system. It follows
that for arbitrary k,n ∈ {1,...,K}, the random variable Z ϕk,ϕn
admits the representation
Z ϕk,ϕn =
√
2λkζkk ϕk,ϕn +
k<j
λ
1/2
k λ
1/2
j ζkj ϕj,ϕn
+
i<k
λ
1/2
i λ
1/2
k ζik ϕi,ϕn
=
⎧
⎪⎪⎪⎨
⎪⎪⎪⎩
√
2λkζkk if k = n
λ
1/2
k λ
1/2
n ζkn if k < n
λ
1/2
k λ
1/2
n ζnk if k > n.
It follows that Z ϕk,ϕk
iid
∼ N(0,2λ2
k) independently of Z ϕm,
ϕn
iid
∼ N(0,λmλn), m = n. Consequently, we have
1
2
Z ϕk,ϕk
2
λ2
k
iid
∼ χ2
1 ,
independently of
1
2
Z ϕm,ϕn
2 + Z ϕn,ϕm
2
λmλn
=
Z ϕm,ϕn
2
λmλn
∼ χ2
1 .
The continuous mapping theorem now implies that
1
2
(W
ij
N)2 + (W
ji
N)2
λiλj
=
n1n2
2N
K
i=1
K
j=1
(R
n1
X − R
n2
Y )ϕi,N
XY ,ϕ
j,N
XY
2
λiλj
w
−→ χ2
K(K+1)/2.
To complete the proof, we note that
n1
N
λ
k,n1
X,XY +
n2
N
λ
k,n2
Y,XY
p
−→ θλk + (1 − θ)λk = λk ∀k ∈ {1,...,K},
so that the result follows from the application of Slutsky’s lemma.
SUPPLEMENTAL MATERIALS
Additional plots and tables and detailed study: Additional
plots and tables are available in a supplementary ﬁle. In
addition, the supplementary ﬁle contains a more detailed
study of the problem of comparing the complete spectrum,
extending the discussion in the last part of Section 3.2.
(Supplement.pdf)
[Received April 2009. Revised December 2009.]
REFERENCES
Adler, R. J. (1990), An Introduction to Continuity, Extrema, and Related Topics
for General Gaussian Processes. Lecture Notes and Monographs Series,
Hayward: Institute of Mathematical Statistics. [673]
Amzallag, A., Vaillant, C., Jacob, M., Unser, M., Bednar, M., Kahn, J. D.,
Dubochet, J., Stasiak, A., and Maddocks, J. H. (2006), “3D Reconstruction
and Comparison of Shapes of DNA Minicircles Observed by Cryo-Electron
Microscopy,” Nucleic Acids Research, 34 (18), e125. [670,678]
Arnold, V. I. (1989), Mathematical Methods of Classical Mechanics, New York:
Springer. [671]
Benko, M., Härdle, W., and Kneip, A. (2009), “Common Functional Principal
Components,” The Annals of Statistics, 37, 1–34. [670]
Berkes, I., Gabrys, R., Horváth, L., and Kokoszka, P. (2009), “Detecting
Changes in the Mean of Functional Observations,” Journal of the Royal
Statistical Society, Ser. B, 71, 927–946. [670,678]
Bosq, D. (2000), Linear Processes in Function Spaces, New York: Springer.
[675,680]
Cardot, H., Ferraty, F., Mas, A., and Sarda, P. (2003), “Testing Hypotheses in
the Functional Linear Model,” Scandinavian Journal of Statistics, 30 (1),
241–255. [670]
Cuevas, A., Febrero, M., and Fraiman, R. (2004), “An ANOVA Test for Functional
Data,” Computational Statistics and Data Analysis, 47, 111–122.
[670]
Dauxois, J., Pousse, A., and Romain, Y. (1982), “Asymptotic Theory for the
Principal Component Analysis of a Random Vector Function: Some Applications
to Statistical Inference,” Journal of Multivariate Analysis, 12, 136–
154. [673,678]
Fan, J., and Lin, S.-K. (1998), “Tests of Signiﬁcance When the Data Are
Curves,” Journal of the American Statistical Association, 93, 1007–1021.
[670]
Ferraty, F., and Vieu, P. (2006), Nonparametric Functional Data Analysis, New
York: Springer. [672]
Gabrys, R., and Kokoszka, P. (2007), “Portmanteau Test of Independence for
Functional Observations,” Journal of the American Statistical Association,
102, 1338–1348. [670]
Gasser, T., and Kneip, A. (1995), “Searching for Structure in Curve Samples,”
Journal of the American Statistical Association, 90, 1179–1188. [671]
Gervini, D. (2008), “Robust Functional Estimation Using the Median and
Spherical Principal Components,” Biometrika, 95 (3), 587–600. [672]
Gervini, D., and Gasser, T. (2004), “Self-Modelling Warping Functions,” Journal
of the Royal Statistical Society, Ser. B, 66 (4), 959–971. [671]
Giri, N. (1968), “On Tests of the Equality of Two Covariance Matrices,” The
Annals of Mathematical Statistics, 39, 275–277. [673]
Grenander, U. (1981), Abstract Inference, New York: Wiley. [670,678]
Hagerman, P. J. (1988), “Flexibility of DNA,” Annual Review Biophysics and
Biophysical Chemistry, 17, 265–286. [670]
Hall, P., and Hosseini-Nassab, M. (2006), “On Properties of Functional Principal
Components Analysis,” Journal of the Royal Statistical Society, Ser. B,
68 (1), 109–126. [673]
682 Journal of the American Statistical Association, June 2010
Hall, P., and Van Keilegom, I. (2007), “Two Sample Tests in Functional Data
Analysis Starting From Discrete Data,” Statistica Sinica, 17, 1511–1531.
[670]
Horváth, L., Hušková, M., and Kokoszka, P. (2010), “Testing the Stability of
the Functional Autoregressive Process,” Journal of Multivariate Analysis,
101 (2), 352–367. [670,680]
Jacob, M., Blu, T., Vaillaint, C., Maddocks, J. H., and Unser, M. (2006), “3-D
Shape Estimation of DNA Molecules From Stereo Cryo-Electron Micrographs
Using a Projection Steerable Snake,” IEEE Transations on Image
Processing, 15 (1), 214–227. [671]
Kiefer, J., and Schwartz, R. (1965), “Admissible Bayes Character of T2-Test,
R2-Test, and Other Fully Invariant Tests for Classical Multivariate Normal
Problems,” The Annals of Mathematical Statistics, 36, 747–770. [673]
Ledwina, T. (1994), “Data-Driven Version of Neyman’s Smooth Test of Fit,”
Journal of the American Statistical Association, 89, 1000–1005. [677]
Panaretos, V. M., and Kraus, D. (2009), “Second Order Comparison of Gaussian
Processes With Applications to DNA Shape Analysis,” Technical Report
01-09, Chair of Mathematical Statistics, EPFL. [680]
Pillai, K. C. S. (1955), “Some New Test Criteria in Multivariate Analysis,” The
Annals of Mathematical Statistics, 26, 117–121. [673]
Ramsay, J. O., and Silverman, B. W. (2002), Applied Functional Data Analysis:
Methods and Case Studies, New York: Springer. [673]
(2005), Functional Data Analysis, New York: Springer. [672,673,675]
Rice, J., and Silverman, B. W. (1991), “Estimating the Mean and Covariance
Structure Nonparametrically When the Data Are Curves,” Journal of the
Royal Statistical Society, Ser. B, 53, 233–243. [676]
Roy, S. N. (1953), “On a Heuristic Method of Test Construction and Its Use
in Multivariate Analysis,” The Annals of Mathematical Statistics, 24, 220–
238. [673]
Shen, Q., and Faraway, J. (2004), “An F Test for Linear Models With Functional
Responses,” Statistica Sinica, 14, 1239–1257. [670]
Tang, R., and Müller, H. G. (2008), “Pairwise Curve Synchronization for Functional
Data,” Biometrika, 95 (4), 875–889. [671]
Tolstorukov, M. Y., Virnik, K. M., Adhya, S., and Zhurkin, V. B. (2005), “ATract
Clusters May Facilitate DNA Packaging in Bacterial Nucleoid,” Nucleic
Acids Research, 33 (12), 3907–3918. [670]
Vilar, J. M. G., and Leibler, S. (2003), “DNA Looping and Physical Constraints
on Transcription Regulation,” Journal of Molecular Biology, 331 (5), 981–
989. [670]
Yao, F., Müller, H. G., and Wang, J. L. (2005a), “Functional Data Analysis of
Sparse Longitudinal Data,” Journal of the American Statistical Association,
100, 577–590. [676]
(2005b), “Functional Linear Regression Analysis for Longitudinal
Data,” The Annals of Statistics, 33, 2873–2903. [676]
Supplemental File: Second–Order Comparison of Gaussian
Random Functions and the Geometry of DNA Minicircles
This supplementary note contains additional plots and tables in Section 1. In addition,
Section 2 contains a more detailed study of the problem of comparing the complete spectrum,
extending the discussion in the last part of Section 3.2 in the main body of the paper.
1 Supplementary Figures and Tables
This section contains ﬁgures and a table not presented in the main body of the paper. The
ﬁrst two ﬁgures contain plots of the projected aligned curves onto their principal axes of
inertia, including their superimposition. The third ﬁgure contains scree plots with respect to
the mixed eigenbasis for the two groups separately, as well as jointly. The last ﬁgure depicts
the Normal QQ plots of the Karhunen-Lo`eve residuals, as described in the discussion section
of the paper.
Finally, a complete table containing the results of the simulations for level and power
corresponding to Section 4 is also given. In addition to the main test statistic proposed in
the paper, the complete table also presents simulations for the diagonal form of the statistic
(which compares only the eigenvalues). It is observed that when the diﬀerence lies only in
the eigenvalues, this test statistic performs more powerfully, as would be expected. However,
in the cases where diﬀerences also lie in the eigenfunctions, it is outperformed by the full
version of the test statistic.
1
−0.15 −0.05 0.05 0.15
−0.2−0.10.00.10.2
Proj. on Prin. Plane 1, TATA
PAI2
PAI3
−0.15 −0.05 0.05 0.15
−0.2−0.10.00.10.2
Proj. on Prin. Plane 1, all
PAI2
PAI3
−0.15 −0.05 0.05 0.15
−0.2−0.10.00.10.2
Proj. on Prin. Plane 1, CAP
PAI2
PAI3
Figure 1: Projection of DNA curves on the ﬁrst principal plane. Five removed outlying observations plotted in green.
Mean curves (yellow and cyan) computed without outlying observations.
2
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 3, TATA
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 3, all
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 3, CAP
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 2, TATA
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 2, all
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 2, CAP
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 1, TATA
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 1, all
arclength
0.0 0.2 0.4 0.6 0.8 1.0
−0.2−0.10.00.10.2
Principal Axis 1, CAP
arclength
Figure 2: Coordinates of DNA curves on the principal axes of inertia. Five removed outlying
observations plotted in green. Mean curves (yellow and cyan) computed without outlying
observations.
3
2 4 6 8 10
0e+004e−058e−05
Lambda for PAI3
lambda
q
q
q
q
q
q q q q q
2 4 6 8 10
0.00.20.4
Var prop for PAI3
varprop
q
q
q
q
q q q q q q
2 4 6 8 10
0.40.60.81.0
Cum var prop for PAI3
cumvarprop
q
q
q
q
q q q q q q
2 4 6 8 10
0e+004e−058e−05
Lambda for PAI2
lambda
q
q
q q
q
q
q q q q
2 4 6 8 10
0.000.150.30
Var prop for PAI2
varprop
q
q
q q
q
q
q q q q
2 4 6 8 10
0.40.60.81.0
Cum var prop for PAI2
cumvarprop
q
q
q
q
q
q q q q q
2 4 6 8 10
0.000000.00015
Lambda for PAI1
lambda
q
q
q
q
q
q
q q q q
2 4 6 8 10
0.00.10.20.3
Var prop for PAI1
varprop
q
q
q
q
q
q
q q q q
2 4 6 8 10
0.40.60.81.0
Cum var prop for PAI1
cumvarprop
q
q
q
q
q
q q q q q
2 4 6 8 10
0.000000.000060.00012
Lambda for PAI2,3
lambda
q
q
q
q q
q
q
q q q
2 4 6 8 10
0.050.150.25
Var prop for PAI2,3
varprop
q
q
q
q q
q
q
q q q
2 4 6 8 10
0.30.50.70.9
Cum var prop for PAI2,3
cumvarprop
q
q
q
q
q
q
q q q q
Figure 3: Empirical variances (scree plot), proportions and cumulative proportions of variance
explained by components for the TATA (blue lines with circles) and CAP (red with
diamonds) group and for both groups together (black with squares).
4
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 0 1 2
−0.020.020.04
PAI1,2,3, TATA, score 1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.040.000.02
PAI1,2,3, CAP, score 1
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
qq
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
−2 0 1 2
−0.040.00
PAI1,2,3, TATA, score 2
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
−2 −1 0 1 2
−0.020.000.020.04
PAI1,2,3, CAP, score 2
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
−2 0 1 2
−0.03−0.010.01
PAI1,2,3, TATA, score 3
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.020.000.02
PAI1,2,3, CAP, score 3
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
−2 0 1 2
−0.020.000.02
PAI1,2,3, TATA, score 4
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
−2 −1 0 1 2
−0.020.000.01
PAI1,2,3, CAP, score 4
Figure 4: QQ plots corresponding to the centred Fourier coeﬃcients when projecting onto the ﬁrst four empirical
eigenfunctions for each sample of curves, respectively. The exact distribution of these quantities will not be Gaussian,
even if the processes are Gaussian. However, asymptotically, their distribution will be Gaussian. There do not appear
systematic deviations, except for the plot corresponding to the third Fourier coeﬃcient in the TATA group, which seems
to suggest lighter upper tails as compared to the Gaussian.
5
Table 1: Empirical rejection probabilities on the nominal level 5 %, sample size n1 = n2 = 50,
number of replications 5000 for A, 1000 for B–I. Here, uX = (vX, wX) (resp. uY ) and K∗ is the
automatic truncation choice given by the penalised ﬁt criterion.
K
Parameters Test 1 2 3 4 K∗
A uX
= (12, 7, 0.5, 9, 5, 0.3) T 0.045 0.049 0.044 0.044 0.047
uY
= (12, 7, 0.5, 9, 5, 0.3) T∗
0.051 0.056 0.057 0.056 0.059
T1 0.045 0.046 0.045 0.047 0.047
T∗
1 0.051 0.054 0.056 0.061 0.061
B uX
= (14, 7, 0.5, 6, 5, 0.3) T 0.422 0.264 0.185 0.150 0.148
uY
= (8, 7, 0.5, 6, 5, 0.3) T∗
0.443 0.315 0.223 0.174 0.175
T1 0.422 0.317 0.265 0.219 0.222
T∗
1 0.443 0.350 0.306 0.267 0.267
C uX
= (15, 10, 0.5, 4, 3, 0.3) T 0.186 0.331 0.218 0.169 0.167
uY
= (11, 6, 0.5, 4, 3, 0.3) T∗
0.201 0.366 0.269 0.207 0.208
T1 0.186 0.380 0.312 0.279 0.273
T∗
1 0.201 0.420 0.358 0.317 0.314
D uX
= (12, 7, 0.5, 9, 3, 0.3) T 0.040 0.204 0.836 0.973 0.962
uY
= (12, 7, 0.5, 2, 5, 0.3) T∗
0.047 0.221 0.848 0.984 0.980
T1 0.040 0.202 0.766 0.803 0.799
T∗
1 0.047 0.217 0.783 0.822 0.820
E uX
= (12, 7, 0.5, 9, 3, 0.3) T 0.047 0.246 0.644 0.964 0.962
uY
= (12, 7, 0.5, 3, 9, 0.3) T∗
0.055 0.267 0.686 0.976 0.975
T1 0.047 0.227 0.477 0.597 0.594
T∗
1 0.055 0.250 0.509 0.620 0.617
F uX
= uY
= (12, 7, 4, 0.5, 0.3, 0.1) T 0.257 0.693 0.909 1.000 1.000
δX
= (0.15, 0.15, 0.15) T∗
0.273 0.706 0.916 1.000 1.000
T1 0.257 0.474 0.521 0.567 0.637
T∗
1 0.273 0.496 0.544 0.594 0.655
G uX
= (12, 7, 0.5, 8, 6, 0.3) T 0.042 0.040 0.054 1.000 1.000
uY
= (12, 7, 0.5, 8, 0, 0.3) T∗
0.047 0.048 0.068 1.000 1.000
T1 0.042 0.047 0.051 1.000 1.000
T∗
1 0.047 0.061 0.062 1.000 1.000
H uX
= (12, 7, 0.5, 9, 5, 0.3) T 0.044 0.140 0.500 1.000 1.000
uY
= (12, 7, 0.5, 0, 5, 0.3) T∗
0.049 0.154 0.520 1.000 1.000
T1 0.044 0.139 0.478 0.992 0.992
T∗
1 0.049 0.155 0.497 0.993 0.993
I Brownian motion versus T 0.719 0.608 0.483 0.377 0.493
Ornstein–Uhlenbeck process T∗
0.731 0.644 0.532 0.443 0.546
T1 0.719 0.627 0.547 0.476 0.551
T∗
1 0.731 0.666 0.596 0.542 0.595
6
2 Comparing the Full Spectrum
The test procedure developed in the paper employs an optimal ﬁnite dimensional reduction
in order to regularise the problem of testing. This is motivated by a Parseval decomposition
of the Hilbert-Schmidt distance between the two operators,
RX − RY
2
HS =
K
k=1
(RX − RY ) ϕk
XY
2
L2 + ,
where can be made arbitrarily small by appropriate choice of K. By making such a choice,
the statistic will be (eventually) able to detect departures from the null hypothesis unless
one operator is contained within a ball of small radius centred at the other operator; in this
latter case, the test will still be able to detect the diﬀerence (eventually), except if this small
diﬀerence lies completely at the high frequency end of the spectrum (in which case, for all
practical purposes, the diﬀerence is irrelevant).
We are willing to tolerate this small level of “bias”, in order to control the overall type
II error of the problem. Comparison of the higher order terms of the operator spectrum
on the basis of a ﬁnite sample is an ill-deﬁned estimation problem: the fast decay of the
spectrum means that we are attempting to compare extremely small quantities that have
variance roughly proportional to their magnitude. In addition, the estimators of higher order
eigenfunction will be characterised by very large integrated mean squared errors (available
bounds grow for ﬁxed N depending inversely on the rate of decay of the spectrum). Therefore,
by trying to increase K in order to eliminate the small type II error introduced by the
truncation, we are in eﬀect causing an overall blow-up of the type II error.
If one nevertheless wishes to compare even the ﬁnest diﬀerences in the spectrum, then
one needs to let K grow to inﬁnity along with N, K = KN and modify the test statistic so
as to obtain a Gaussian limit. Regularisation now manifests itself by the imposition of an
allowed rate of growth of KN . That is, a rate of growth of K relative to N that does not
7
allow overwhelming instabilities due to the growing K. As one might expect, this growth
will depend inversely on the rate of decay of the true eigenvalues (a lot of data is required
to compare the ﬁnest details of the two procsses). Inevitably, in fact, this rate will be rather
slow due to the following:
(a) Although the truncation level will grow as KN , the number of terms being compared
is K2
N .
(b) While these K2
summation terms do become independent as N grows (allowing for a
CLT phenomenon) no mixing concept applies. In eﬀect, this means that one has to look
at the convergence in distribution to independence of a random vector of increasing
dimension (= K2
N ). For any ﬁxed dimension, the weak convergence will be at a rate of
N−1/2
. Therefore, if one wishes to use Lp
norms in order to use the Hilbert structure
of the problem, KN must grow slow enough to allow the N−1/2
rate to compensate for
the K2
N rate of increase in dimension.
(c) This required “global convergence” to independence is regulated by the convergence
of the empirical eigenfunctions to the true ones; this in turn depends on the spacings
between the true eigenvalues: the rate of convergence of the Kth empirical eigenfunction
behaves like N−1/2
λ−1
K . Therefore, when we let K grow, it has to be at rate slow
enough, to allow N−1/2
to annihilate the blow-up of the inverse eigenvalues.
The above heuristics are made precise in the proof of the next theorem, which provides
a suﬃcient regularisation rate for asymptotically comparing the whole spectrum of inﬁnite
rank processes.
Theorem 1. Let {Xn}n1
n=1 and {Yn}n2
n=1 be two collections of zero mean iid continuous
Gaussian random functions indexed by the interval [0, 1] and taking values in Rd
, possessing
covariance operators RX and RY . Suppose that both operators are of inﬁnite rank and have
distinct eigenvalues. Let Rn1
X and Rn2
Y denote the empirical covariance operators based on
8
{Xn}n1
n=1 and {Yn}n2
n=1. For N = n1 + n2, let RN
XY denote the empirical covariance operator
of the pooled collection, and { ˆϕk,N
XY }N
k=1 the corresponding eigenfunctions. Finally, let ˆλk,n1
X,XY ,
ˆλk,n2
Y,XY denote the empirical variance of the kth Fourier coeﬃcient of {Xn}n1
n=1 and {Yn}n2
n=1,
respectively, with respect to the eigenfunctions { ˆϕn,K
XY }N
n=1. Assuming that E[ X1
4
L2 ] < ∞,
E[ Y1
4
L2 ] < ∞, and n1/N → θ ∈ (0, 1) as N = n1 + n2 → ∞, it follows that, under the
hypothesis H0 : RX = RY ,
SN :=
n1n2
2N KN (KN + 1)/2
KN
i=1
KN
j=1
(Rn1
X − Rn2
Y ) ˇϕi,N
XY , ˇϕj,N
XY
2
−
KN (KN + 1)
2
w
−→ N(0, 1),
as N → ∞, for any KN ↑ ∞ such that K7
N λ
−3/2
3KN /2 = o(
√
N), where
ˇϕk,N
XY =
ˆϕk,N
XY
n1
N
ˆλk,n1
X,XY + n2
N
ˆλk,n2
Y,XY
.
Proof of Theorem 2. Let {ZNk} denote the triangular array of random variables deﬁned as
ZNk :=
1
KN (KN + 1)/2
n1n2
N
(Rn1
X − Rn2
Y ) ˇϕ
i(k),N
XY , ˇϕ
j(k),N
XY
2
− 1 , i(k) = j(k)
and
ZNk :=
1
KN (KN + 1)/2
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕ
i(k),N
XY , ˇϕ
i(k),N
XY
2
− 1 , otherwise,
where (i(k), j(k)) is the the kth element of the index array {(i, j) : i ≤ j ≤ KN }, when
enumerating row-wise. Clearly, for κN = KN (KN + 1)/2,
SN =
κN
k=1
ZNk.
9
Write ZN := (n1n2/N)1/2
(Rn1
X − Rn2
Y ) and deﬁne
ZNk :=
n1n2
N
(Rn1
X − Rn2
Y )sgn[ ˇϕ
i(k),N
XY , ˇϕi(k) ] ˇϕ
i(k),N
XY , sgn[ ˇϕ
j(k),N
XY , ˇϕj(k) ] ˇϕ
j(k),N
XY , i(k) = j(k)
and
ZNk :=
n1n2
2N
(Rn1
X − Rn2
Y )sgn[ ˇϕ
i(k),N
XY , ˇϕi(k) ] ˇϕ
i(k),N
XY , sgn[ ˇϕ
i(k),N
XY , ˇϕi(k) ] ˇϕ
i(k),N
XY , otherwise,
where we use the notation ˇϕk := λ
−1
2
k ϕk. The corresponding natural ﬁltration is denoted by
FN,k := σ(ZNm; 1 ≤ m ≤ k), and notice that {ZNk} is also adapted to the ﬁltration {FN,k}.
Finally, we will write ZNj := (ZN1, . . . , ZNj) (resp. ZNj). We will show that
(A) κN
k=1 E ZNk1{|ZNk|≤1}|FN,k−1
P
−→ 0.
(B) κN
k=1 Var ZNk1{|ZNk|≤1}|FN,k−1
P
−→ 1.
(C) κN
k=1 P[|ZNk| > |FN,k−1]
P
−→ 0, ∀ > 0.
The conclusion will then follow from an “almost-martingale” central limit theorem for
triangular arrays, Shorack (5, Thm. 12.2). Fix some N, let d = κN , and let ζ ∼ Nd(0, I).
Letting d∞ denote the Kolmogorov metric, we obtain
d∞ ZNd, ζ ≤ d∞ ZNd,
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1
+ d∞
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1
, ζ
First we concentrate on the second term of the right hand side. From the proof of Theorem
1 and P´olya’s theorem we know that this term converges to zero. In fact, recalling that
Rn1
X = n−1
1
ni
i=1 Xi (resp. Rn2
Y ) and that the ϕk are the eigenfunctions of the common
covariance operator, the convergence can be seen to be due to the standard multidimensional
10
central limit theorem. We therefore have the following Berry-Esseen upper bound (e.g.
DasGupta (2, Cor. 11.1)),
d∞
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1
, ζ ≤
Cd
1
4
√
N
.
Turning our attention to the ﬁrst term in our triangle inequality, and letting νi(k) :=
sgn[ ˇϕ
i(k),N
XY , ˇϕi(k) ], we note that
E ZNd −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1 1
=
=
d
k=1
E ZNk −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(k), ˇϕj(k)
where, for every 1 ≤ k ≤ d we have
ZNk −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(k), ˇϕj(k)
= ZN νi(k) ˇϕ
i(k),N
XY , νj(k) ˇϕ
j(k),N
XY − ZN ˇϕi(k), ˇϕj(k)
= ZN νi(k) ˇϕ
i(k),N
XY , νj(k) ˇϕ
j(k),N
XY − ZN νi(k) ˇϕ
i(k),N
XY , ˇϕj(k) + ZN νi(k) ˇϕ
i(k),N
XY , ˇϕj(k) − ZN ˇϕi(k), ˇϕj(k)
= ZN νi(k) ˇϕ
i(k),N
XY , νj(k) ˇϕ
j(k),N
XY − ˇϕj(k) + ZN νi(k) ˇϕ
i(k),N
XY − ˇϕi(k) , ˇϕj(k)
= ZN νi(k) ˇϕ
i(k),N
XY , νj(k) ˇϕ
j(k),N
XY − ˇϕj(k) + ZN ˇϕj(k), νi(k) ˇϕ
i(k),N
XY − ˇϕi(k)
≤ ZN νi(k) ˇϕ
i(k),N
XY
L2
νj(k) ˇϕ
j(k),N
XY − ˇϕj(k)
L2
+ ZN ˇϕj(k) L2 νi(k) ˇϕ
i(k),N
XY − ˇϕi(k)
L2
≤ ZN HS νi(k) ˇϕ
i(k),N
XY
L2
νj(k) ˇϕ
j(k),N
XY − ˇϕj(k)
L2
+ ZN HS ˇϕj(k) L2 νi(k) ˇϕ
i(k),N
XY − ˇϕi(k)
L2
= ZN HS νj(k) ˇϕ
j(k),N
XY − ˇϕj(k)
L2
+ νi(k) ˇϕ
i(k),N
XY − ˇϕi(k)
L2
Here we have used the Cauchy-Schwartz inequality and the fact that ZN is a bounded
11
operator. By the triangle inequality we now obtain
ZN HS νj(k) ˇϕ
j(k),N
XY − ˇϕj(k)
L2
+ νi(k) ˇϕ
i(k),N
XY − ˇϕi(k)
L2
≤ ZN HS νj(k) ˇϕ
j(k),N
XY − νj(k)λ
−1/2
j(k) ˆϕ
j(k),N
XY
L2
+ νj(k)λ
−1/2
j(k) ˆϕ
j(k),N
XY − ˇϕj(k)
L2
+ νi(k) ˇϕ
i(k),N
XY − νi(k)λ
−1/2
i(k) ˆϕ
i(k),N
XY
L2
+ νi(k)λ
−1/2
i(k) ˆϕ
i(k),N
XY − ˇϕi(k)
L2
= ZN HS (ˆλ
−1/2
j(k) − λ
−1/2
j(k) ) + λ
−1/2
j(k) νj(k) ˆϕ
j(k),N
XY − ϕj(k)
L2
+(ˆλ
−1/2
i(k) − λ
−1/2
i(k) ) + λ
−1/2
i(k) νi(k) ˆϕ
i(k),N
XY − ϕi(k)
L2
where we have used the simpliﬁed notation
ˆλi(k) =
n1
N
ˆλ
i(k),n1
X,XY +
n2
N
ˆλ
i(k),n2
Y,XY .
We now apply the inequality given in Bosq (1, Lem. 4.3) and obtain
ZN HS (ˆλ
−1/2
j(k) − λ
−1/2
j(k) ) + (ˆλ
−1/2
i(k) − λ
−1/2
i(k) ) + λ
−1/2
j(k) νj(k) ˆϕ
j(k),N
XY − ϕj(k)
L2
+ λ
−1/2
i(k) νi(k) ˆϕ
i(k),N
XY − ϕi(k)
L2
≤ ZN HS (ˆλ
−1/2
j(k) − λ
−1/2
j(k) ) + (ˆλ
−1/2
i(k) − λ
−1/2
i(k) )
λ
−1/2
j(k) 2
√
2 max (λj(k)−1 − λj(k))−1
, (λj(k) − λj(k)+1)−1
RN
XY − RX HS
+ λ
−1/2
i(k) 2
√
2 max (λi(k)−1 − λi(k))−1
, (λi(k) − λi(k)+1)−1
RN
XY − RX HS
Recapitulating, we have obtained
ZNk −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(k), ˇϕj(k)
≤ ZN HS (ˆλ
−1/2
j(k) − λ
−1/2
j(k) ) + (ˆλ
−1/2
i(k) − λ
−1/2
i(k) )
λ
−1/2
j(k) 2
√
2 max (λj(k)−1 − λj(k))−1
, (λj(k) − λj(k)+1)−1
RN
XY − RX HS
12
+ λ
−1/2
i(k) 2
√
2 max (λi(k)−1 − λi(k))−1
, (λi(k) − λi(k)+1)−1
RN
XY − RX HS
Now we take expectations on both sides, expand the right hand side, and repeatedly apply
the Cauchy-Schwartz inequality (with respect to the mean-square norm) to obtain
E ZNk −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(k), ˇϕj(k)
≤ E ZN
2
HS E(ˆλ
−1/2
j(k) − λ
−1/2
j(k) )2 + E ZN
2
HS E(ˆλ
−1/2
i(k) − λ
−1/2
i(k) )2
+λ
−1/2
j(k) 2
√
2 max (λj(k)−1 − λj(k))−1
, (λj(k) − λj(k)+1)−1
E ZN
2
HS E RN
XY − RX
2
HS
+λ
−1/2
i(k) 2
√
2 max (λi(k)−1 − λi(k))−1
, (λi(k) − λi(k)+1)−1
E ZN
2
HS E RN
XY − RX
2
HS
We note ﬁrst that, by Minkowski’s inequality, E ZN
2
HS is bounded above for all N,
by deﬁnition of the random operator ZN . Next, E(ˆλ−1
i(k) − λ−1
i(k))2 and E(ˆλ−1
i(k) − λ−1
i(k))2
are, asymptotically in N, of the order of O(λ
−1/2
i(k) N−1/2
) and so are also of the order of
O(λ
−1/2
i(d) N−1/2
), when k ≤ d. This can be seen by applying the Delta method to the CLT
given in Dauxois et. al (3, Prop. 8). Finally, E RN
XY − RX
2
HS is asymptotically of the
order of O(N−1/2
) by the CLT in Hilbert Space (Bosq (1, Thm 2.7)).
Now by deﬁnition of i(k) and j(k), we have that i(d)[i(d) + 1]/2 = j(d)[j(d) + 1]/2 = d,
so that it holds that
λi(k) = λ√
8d+1−1
2
≥ λ3
√
d
2
.
Combining all the above, we arrive at
E ZNk −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(k), ˇϕj(k) = O λ
−3/2
3
√
d/2
N−1/2
.
so that
E ZNd −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1 1
= O λ
−3/2
3
√
d/2
N−1/2
d .
13
Letting dW denote the L1-Wasserstein distance between two probability measures, we have
(e.g. Gibbs & Su (4)),
d∞(GN,d, HN,d) ≤ (1 + hN,d ∞) dW (GN,d, HN,d)
≤ (1 + hN,d ∞) E ZNd −
n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1 1
= (1 + hN,d ∞)O λ
−3/4
3
√
d/2
N−1/4
d1/2
.
where HN,d is the distribution function of n1n2
2N
(Rn1
X − Rn2
Y ) ˇϕi(m), ˇϕj(m)
d
m=1
, GN,d is
the distribution function of ZNd, and hN,d is the density function of HN,d. But hN,d is the
density of a diﬀerence of two independent random vectors, each of which is in turn the sum
of n1 and n2 iid random vectors, respectively. Thus, letting h
[1]
d and h
[2]
d be the respective
densities, and by symmetry, we have,
hN,d ∞ = h
[1]
d,n1
∗ . . . ∗ h
[1]
d,n1
n1 times
∗ h
[2]
d,n2
∗ . . . ∗ h
[2]
d,n2
n2 times
∞ ≤ h
[1]
d,n1
∗ . . . ∗ h
[1]
d,n1
n1 times
1 h
[2]
d,n2
∗ . . . ∗ h
[2]
d,n2
n2 times
∞
= h
[2]
d,n2
∗ . . . ∗ h
[2]
d,n2
n2 times
∞
Now it is immediate that
h
[2]
d,n2
∗ . . . ∗ h
[2]
d,n2 ∞ ≤ h[2]
n2
∗ . . . ∗ h[2]
n2 ∞,
where h
[2]
n2 is the marginal density of n1n2
2N
( 1
n2
X1) ˇϕi(1), ˇϕj(1) . But it must the case that
h
[2]
n2 ∗ . . . ∗ h
[2]
n2 ∞ be bounded above, since n2
i=1
n1n2
2N
( 1
n2
Xi) ˇϕi(1), ˇϕj(1) is a sequence of
variables with diﬀuse laws converging weakly to a non-degenerate Gaussian.
We are thus in a position to conclude that
d∞ ZNd, ζ = O λ
−3/4
3
√
d/2
N−1/4
d1/2
. (1)
14
Now recall that, with probability one,
E ZNk1{|ZNk|≤1}|FN,k−1 =
+∞
−∞
1
√
κN
x2
− 1 1{|x2−1|≤
√
2κN }FeZNk| eZN,k−1
(dx|ZN,k−1)
where he have used standard notation for conditional distribution functions. It follows that,
given ζ a standard Gaussian random variable,
E ZNk1{|ZNk|≤1}|FN,k−1 − E
1
√
κN
(ζ2
− 1)1{|ζ2−1|≤
√
κN }
=
+∞
−∞
1
√
κN
x2
− 1 1{|x2−1|≤
√
2κN }FeZNk| eZN,k−1
(dx|ZN,k−1)
−
+∞
−∞
1
√
κN
x2
− 1 1{|x2−1|≤
√
2κN }Fζ(dx)
=
+∞
−∞
1
√
κN
x2
− 1 1{|x2−1|≤
√
2κN } F
eZN,k−1
eZNk| eZN,k−1
− Fζ (dx)
with the alternative notation F
eZN,k−1
eZNk| eZN,k−1
(x) ≡ FeZNk| eZN,k−1
(x|ZN,k−1). From (1) we have that
for ζ ∼ Nk(0, I), d∞(ZNk, ζ) = O λ
−1/3
3
√
d/2
N−1/4
k1/2
, so by Lemma 1 (see below), given any
z ∈ Rk−1
,
sup
x∈R
Fz
eZNk| eZN,k−1
(x) − Fζ(x) = O λ
−3/4
3
√
d/2
N−1/4
k1/2
and so given z ∈ Rk−1
+∞
−∞
1
√
κN
x2
− 1 1{|x2−1|≤
√
2κN } Fz
eZNk| eZN,k−1
− Fζ (dx) = O λ
−3/4
3
√
κN /2N−1/4
k1/2
κ
1/4
N .
Consequently, for {ζk} an iid sequence of standard Gaussian variables, and for all ω ∈ Ω,
κN
k=1
E ZNk1{|ZNk|≤1}|FN,k−1 − E
1
√
κN
(ζ2
k − 1)1{|ζk|≤
√
κN } = O

 κ
7/4
N
N1/4λ
3/4
3
√
κN /2

 = O

 K
7/2
N
N1/4λ
3/4
3
√
κN /2


15
And, since
K7
N λ
−3/2
3
√
2KN (KN +1)
2
≤ K7
N λ
−3/2
3KN
2
= o
√
N ,
it follows from our assumptions that the quantity above converges to zero almost certainly.
But, on the other hand,
κN
k=1
E ZNk1{|ZNk|≤1}|FN,k−1
≤
κN
k=1
E ZNk1{|ZNk|≤1}|FN,k−1 − E
1
√
κN
(ζ2
k − 1)1{|ζk|≤
√
κN }
+
κN
k=1
E
1
√
κN
(ζ2
k − 1)1{|ζk|≤
√
κN }
with the last term obviously converging to zero as N → ∞ so that condition (A) is fulﬁlled.
We now turn our attention to condition (B). By deﬁnition:
κN
k=1
Var ZNk1{|ZNk|≤1}|FN,k−1 =
κN
k=1
E Z2
Nk1{|ZNk|≤1}|FN,k−1 −
κN
k=1
E2
ZNk1{|ZNk|≤1}|FN,k−1
That the second term converges to zero almost surely follows from our proof of condition
(A). Hence, it suﬃces to concentrate on the ﬁrst term. Following the same steps as with
(A), we may write
+∞
−∞
(x2
− 1)
2
2κN
1{|x2−1|≤
√
2κN } Fz
eZNk| eZN,k−1
− Fζ (dx) = O

 K
3/2
N
N1/4λ
3/4
3
√
κN /2


This in turn imples that, with probability one,
κN
k=1
E ZNk1{|ZNk|≤1}|FN,k−1 − E
1
√
κN
(ζ2
k − 1)1{|ζk|≤
√
κN }
N→∞
−→ 0.
16
Finally, we see that
κN
k=1
E Z2
Nk1{|ZNk|≤1}|FN,k−1
=
κN
k=1
E Z2
Nk1{|ZNk|≤1}|FN,k−1 − E
1
2κN
(ζ2
k − 1)2
1{|ζk|≤
√
κN }
+
κN
k=1
E
1
2κN
(ζ2
k − 1)2
1{|ζk|≤
√
κN }
with the last term clearly converging to 1 almost certainly. This establishes condition (B).
Finally, we concentrate on condition (C). By deﬁnition,
P[|ZNk| > |FN,k−1] = 1 − E [1 {|ZNk| < } |FN,k−1]
= 1 + E 1 |ζ2
− 1| <
√
κN − E [1 {|ZNk| < } |FN,k−1]
−E 1 |ζ2
− 1| <
√
κN
= E 1 |ζ2
− 1| <
√
κN − E [1 {|ZNk| < } |FN,k−1] + P[|ζ2
− 1| >
√
κN ]
It is clear from our analysis of (A) and (B) that
κN
k=1
E 1 |ζ2
− 1| <
√
κN − E [1 {|ZNk| < } |FN,k−1]
a.s.
−→ 0.
Finally, we have
κN
k=1
P[|ζ2
− 1| >
√
κN ] = κN P[|ζ2
− 1| >
√
κN ] = O

κN e−(1+
√
κN )
1/2
1 +
√
κN
1/4

 N→∞
−→ 0
by the tail decay properties of the Gaussian distribution. This completes the proof.
Lemma 1. Assume that Fn is a sequence of distribution functions on Rd
converging weakly
17
to a standard Gaussian distribution function Φd
, at a rate n in the Kolmogorov distance,
sup
x∈Rd
|Fn(x) − Φd
(x)| = O( n).
Letting d = p + q, and given y ∈ Rq
, we have
sup
x∈Rp
|Fn(x|y) − Φq
(x)| = O( n).
Proof. By deﬁnition, and by our uniform bound, given any y ∈ Rq
we have that
sup
x∈Rp
|Fn(x|y)Fn(y) − Φp
(x)Φq
(y)| = sup
x∈Rp
|Fn(x, y) − Φd
(x, y)| = O( n).
Now divide across by Φq
(y), and obtain
sup
x∈Rp
Fn(x|y)
Fn(y)
Φq(y)
− Φp
(x) = O( n) (2)
By assumption of the theorem, it must also be that
|Fn(y) − Φq
(y)| = O( n).
In turn, this implies that
Fn(y)
Φq(y)
− 1 = O( n), (3)
for if this were not the case, for every α > 0 and M ≥ 1, there would exist and m ≥ M such
that
Fm(y)
Φq(y)
− 1 >
α
Φq(y)
| m|,
or equivalently, for every α > 0 and M ≥ 1, there would exist and m ≥ M such that
|Fm(y) − Φq
(y)| > α| m|,
18
which would contradict the fact that supu |Fn(u) − Φq
(u)| ∈ O( n).
Now conditions (2) and (3) allow us to complete the proof by applying the triangle
inequality:
d∞ (Fn(·|y), Φp) ≤ d∞ Fn(·|y),
Fn(y)
Φq(y)
Fn(·|y) + d∞
Fn(y)
Φq(y)
Fn(·|y), Φp
since
d∞ Fn(·|y),
Fn(y)
Φq(y)
Fn(·|y) = sup
x∈Rp
Fn(x|y) −
Fn(y)
Φq(y)
Fn(x|y)
= 1 −
Fn(y)
Φq(y)
sup
x∈Rp
|Fn(x|y)|
= 1 −
Fn(y)
Φq(y)
= O( n)
References
[1] Bosq, D.(2000). Linear processes in function spaces. Springer.
[2] DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer.
[3] Dauxois, J. Pousse, A. & Romain, Y. (1982). Asymptotic theory for the principal component
analysis of a random vector function: some applications to statistical inference. Journal
of Multivariate Analysis, 12: 136–154.
[4] Gibbs, A.L. & Su, F.E. (2002). On choosing and bounding probability metrics. International
Statistical Review, 70(3): 419–435.
[5] Shorack, G. R. (2000). Probability for Statisticians. Springer.
19
B. Dispersion operators and resistant second-order functional
data analysis
By David Kraus and Victor M. Panaretos
Biometrika, 99(4):813–832, 2012
DOI: 10.1093/biomet/ass037
52
Biometrika (2012), 99, 4, pp. 813–832 doi: 10.1093/biomet/ass037
C 2012 Biometrika Trust Advance Access publication 26 August 2012
Printed in Great Britain
Dispersion operators and resistant second-order functional
data analysis
BY DAVID KRAUS AND VICTOR M. PANARETOS
Institute of Mathematics, Ecole Polytechnique F´ed´erale de Lausanne, 1015 Lausanne,
Switzerland
david.kraus@epfl.ch victor.panaretos@epfl.ch
SUMMARY
Inferences related to the second-order properties of functional data, as expressed by covariance
structure, can become unreliable when the data are non-Gaussian or contain unusual observations.
In the functional setting, it is often difficult to identify atypical observations, as their
distinguishing characteristics can be manifold but subtle. In this paper, we introduce the notion
of a dispersion operator, investigate its use in probing the second-order structure of functional
data, and develop a test for comparing the second-order characteristics of two functional samples
that is resistant to atypical observations and departures from normality. The proposed test
is a regularized M-test based on a spectrally truncated version of the Hilbert–Schmidt norm of
a score operator defined via the dispersion operator. We derive the asymptotic distribution of the
test statistic, investigate the behaviour of the test in a simulation study and illustrate the method
on a structural biology dataset.
Some key words: Covariance operator; Karhunen–Lo`eve expansion; M-estimation; Resistant test; Spectral truncation;
Two-sample testing.
1. INTRODUCTION
The second-order structure of a random function is key to understanding the nature of the
functional observations that it induces, as it is inextricably linked with the smoothness properties
of the stochastic fluctuations of the function. Given a suitable random function in a separable
Hilbert space, e.g., L2[0, 1], these second-order properties are encapsulated in the covariance
operator. The link with the smoothness properties of the random function is then given by the
Karhunen–Lo`eve expansion (e.g., Adler, 1990), which provides an optimal Fourier representation
of the random function, using a basis comprised by the eigenfunctions of this operator.
Consequently, a significant part of functional data analysis has concentrated on estimating the
covariance operator, and employing its spectral decomposition in order to probe the smoothness
properties of the functional data; see Bosq (2000), Dauxois et al. (1982), Hall & Hosseini-Nasab
(2006), Ramsay & Silverman (2005), Gervini (2006), Hall et al. (2006) and Yao & Lee (2006),
to name but a few. A natural inference problem is that of comparing the covariance structures of
two samples of functional data, in order to decide whether they share the same fluctuation properties.
Aspects of this problem were considered in Benko et al. (2009), who employed a bootstrap
procedure to compare subsets of eigenfunctions or eigenvalues of the two samples in a financial
context. The more global problem of testing whether two samples share the same covariance
operator was investigated in the Gaussian case by Panaretos et al. (2010), motivated by
the study of mechanical properties of DNA, and subsequently by Boente et al. (2011) through
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
814 DAVID KRAUS AND VICTOR M. PANARETOS
a simulation-based approach. In a slightly different setting, Gabrys & Kokoszka (2007) and
Horv´ath et al. (2010) investigated second-order tests to detect the presence or change of serial
correlation in functional data. The goal of this paper is to study the problem of second-order
inference in a more general setting. We focus on situations where the data are not Gaussian, and
indeed may be characterized by the presence of influential observations. That we do not use the
word outlier is deliberate: in the functional case, observations can significantly impact the empirical
covariance operator, though they may not be outlying. The infinite-dimensional nature of the
data means that an observation can be atypical in many ways, the deviation from the mean being
only one; observations close to the mean may contain unusual frequency components. Detection
of such observations via exploratory techniques may be nontrivial (Sun & Genton, 2011).
Such influential observations might significantly influence the estimation of the covariance,
and, even more profoundly, the quality of the estimators of its spectrum. For these reasons, robustified
estimates of the spectrum have been proposed, based on the spectra of robust estimators
of the covariance operator. Locantore et al. (1999) proposed the use of the spectrum of the socalled
spherical covariance operator in a discretized setting (Boente & Fraiman, 1999). Gervini
(2008) introduced the functional median and further studied the properties of the spherical covariance
spectrum for functional data concentrated on an unknown finite-dimensional hyperplane.
Bali et al. (2012) adapted the projection-pursuit method of Li & Chen (1985) in the functional
case. The sensitivity of the empirical covariance operator and its spectrum to the presence of
influential observations can have an impact on testing procedures for the covariance operator.
This is already observed in the finite-dimensional case (Layard, 1974; Olson, 1974), where
deviations from a Gaussian assumption, or the presence of influential observations, can completely
ruin a testing procedure even in one dimension (Box, 1953; Hampel et al., 1986). Finitedimensional
robust or resistant tests for covariance matrices cannot be directly extended to the
functional case, as they often depend on the assumption of an invertible empirical covariance,
which will by default be violated in the functional case for all sample sizes (Tiku & Balakrishnan,
1985; O’Brien, 1992; Zhang et al., 1991; Anderson, 2006). Even if a pseudo-inverse operator is
employed, one immediately runs into the problem of ill-posedness.
To cope with these issues, this paper introduces a class of operators that we term dispersion
operators that are implicitly defined through a variational problem, motivated by M-estimators
of location for the tensor product of the centred functional observations. It is then proposed
that these operators be used as proxies for the covariance operator, when inferences on the
second-order structure are to be drawn for non-Gaussian and potentially contaminated functional
samples. The implicit definition of a dispersion operator gives rise to a score equation, as the
dispersion operator is a zero of the Fr´echet derivative of the variational problem with respect to
the operator argument. This functional score equation is then used as a basis to construct a test
for the second-order comparison of two functional samples. The test is based on the distance of
the functional score equation under the null hypothesis from zero, measured by an appropriately
renormalized Hilbert–Schmidt distance.
2. SECOND-ORDER INFERENCE BASED ON THE DISPERSION OPERATOR
2·1. Covariance operators
To describe the second-order properties of a random element X in a separable Hilbert space
of functions H, often taken to be L2[0, 1], with norm · and inner product ·, · , one typically
considers the covariance operator of X, C : H → H, defined as
C ( f ) = E{ f, X − μ (X − μ)};
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 815
here μ = E(X) represents the mean of the function X. For example, in the case H ≡ L2[0, 1],
with inner product f, g =
1
0 f (t)g(t) dt, the covariance operator is represented as an integral
operator
C ( f ) =
1
0
r(·, s) f (s) ds,
where r(s, t) = E[{X(s) − μ(s)}{X(t) − μ(t)}] stands for the covariance kernel of the process
X. For the purposes of this paper, it will be more fruitful to think of the covariance operator as an
operator related to tensor products on H, rather than through the sample path perspective based
on the covariance kernel. In particular, we will think of the covariance operator as
C = E{(X − μ) ⊗ (X − μ)},
where ⊗ stands for the tensor product on H: for f, g ∈ H, f ⊗ g defines an operator on H
through ( f ⊗ g)(h) = g, h f , where h ∈ H. In this setting, and provided that E( X 2) < ∞,
the covariance operator C can itself be thought of as an element of a Hilbert space, the space
HS(H, H) of Hilbert–Schmidt operators acting on H. This is the space of linear operators R on
H such that
R HS =
∞
k=1
Rek
2
1/2
< ∞,
where {ek} is any orthonormal basis of H. Here, · HS defines a norm on HS(H, H), corresponding
to the inner product R1, R2 HS = ∞
k=1 R1ek, R2ek . In what follows, we will usually omit
the subscript HS, as the nature of the norm or inner product employed, whether it is an operator
or an element norm, will be clearly implied from the space where its argument belongs.
In this Hilbert–Schmidt setting, the covariance operator can be seen as the operator C ∈
HS(H, H) that solves the variational problem
min
R∈HS(H,H)
E{ (X − μ) ⊗ (X − μ) − R 2
}.
The sample counterpart of the covariance operator, the empirical covariance operator,
ˆCn =
1
n
n
i=1
(Xi − ¯X) ⊗ (Xi − ¯X),
can be represented as the solution to the problem
min
R∈HS(H,H)
1
n
n
i=1
(Xi − ¯X) ⊗ (Xi − ¯X) − R 2
,
where X1, . . . , Xn is a collection of independent and identically distributed copies of X, and
¯X = n−1 n
i=1 Xi stands for their empirical mean. This being essentially a least squares problem,
both the empirical covariance operator and methods based on it will be sensitive to the
presence of atypical observations in the dataset X1, . . . , Xn. In fact, it can also be seen that the
empirical covariance operator admits a Gaussian maximum likelihood estimator interpretation,
in a Cram´er–Wold sense: if X is assumed Gaussian, then ˆCn is the unique element of HS(H, H)
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
816 DAVID KRAUS AND VICTOR M. PANARETOS
such that, for every f ∈ H, f, ˆCn f is the unique maximum likelihood estimator of the variance
of f, X . The law of X is completely determined by the laws of the collection { f, X : f ∈ H},
and of course f, X is Gaussian with mean f, μ and variance f, C f .
The basic strategy of this paper will be to obtain procedures pertaining to the second-order
structure of X that are more resistant to departures from normality and to the presence of influential
observations by replacing the squared norm in the variational problem defining the covariance
by a less sensitive loss function. This gives rise to a new class of second-order characteristics,
which we call dispersion operators.
2·2. Dispersion operators
Let P be a distribution on the separable Hilbert space H and let X be a random element with
this distribution. The usual covariance is the integral of the operator
P(x; μ) = (x − μ) ⊗ (x − μ), x ∈ H,
with respect to P. This suggests that a dispersion operator could be defined as an M-estimator of
the location of P(X; μ). Let ρ be a nonnegative, differentiable, strictly increasing and convex
function on R+
0 with ρ(0) = 0. We define the ρ-dispersion operator of the distribution P as
R(P) = arg min
R∈HS(H,H)
M(P; R, μ), (1)
where
M(P; R, μ) = EP[ρ{ P(X; μ) − R } − ρ{ P(X; μ) }]
= [ρ{ P(x; μ) − R } − ρ{ P(x; μ) }] dP(x).
(2)
In the definition of the dispersion operator, μ is chosen to be some suitable element of H with
the interpretation of a location parameter. It is natural to use μ equal to the ρ-centre
μ(P) = arg min
μ∈H
L(P; μ),
where
L(P; μ) = EP{ρ( X − μ ) − ρ( X )} = {ρ( x − μ ) − ρ( x )} dP(x).
Equivalently, one may define μ(P) and R(P) as solutions to score equations. The objective
functionals L(P; μ) and M(P; R, μ) are real-valued functionals defined on the Hilbert spaces H
and HS(H, H), respectively. The corresponding scores are their Fr´echet derivatives, that is, linear
functionals on the corresponding Hilbert space that can be uniquely identified with an element
of that Hilbert space. Specifically, the centre μ(P) is the solution to the functional equation
G(P; μ) = 0,
where the element
G(P; μ) =
∂
∂μ
L(P; μ) = EP
ρ ( X − μ )
X − μ
(μ − X) =
ρ ( x − μ )
x − μ
(μ − x) dP(x)
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 817
of H determines the Fr´echet derivative of L with respect to μ. The dispersion operator is defined
as the solution to the operator equation
G (P; R, μ) = O, (3)
where O is the zero operator on H and the operator
G (P; R, μ) =
∂
∂R
M(P; R, μ) = EP
ρ { P(X; μ) − R }
P(X; μ) − R
{R − P(X; μ)}
=
ρ { P(x; μ) − R }
P(x; μ) − R
{R − P(x; μ)} dP(x)
determines the Fr´echet derivative of M with respect to R.
The empirical dispersion operator based on the sample X1, . . . , Xn is the dispersion operator
of the empirical distribution ˆP of the sample, that is, R( ˆP). The empirical dispersion operator can
be in general computed around any element μ ∈ H; in practice, one naturally uses the empirical
centre μ( ˆP), i.e., the centre of the empirical distribution.
PROPOSITION 1. Let P be a distribution on the separable Hilbert space H that is not concentrated
on a line in H or on four points of H. Assume that ρ is nonnegative, strictly increasing
on [0, ∞) and convex. Then, the objective function M(P; R, μ) as a functional of R is strictly
convex for any μ ∈ H and thus the ρ-dispersion operator around μ exists and is unique.
Proposition 1 holds without any moment assumptions because the subtraction of
ρ{ P(X; μ) } and ρ( X ) in the definition of M(P; R, μ) and L(P; μ), respectively, guarantees
the existence and finiteness of the objective functions. Under fairly weak further assumptions,
we may also deduce that the empirical dispersion operator is well defined and consistent.
COROLLARY 1. Let X1, . . . , Xn be independent random elements with law P that has no discrete
component and is such that the probability that X1, . . . , Xn be collinear is zero (n 3).
Then, for n 5, the empirical ρ-dispersion operator corresponding to X1, . . . , Xn exists and is
almost surely unique. Moreover, if ˆμ is consistent for a location parameter μ, then the empirical
dispersion operator around ˆμ is itself consistent for the dispersion operator around μ.
We remark, for example, that the empirical functional median, i.e., the empirical centre corresponding
to ρ(u) = u, was proven to be consistent for its theoretical counterpart in Gervini
(2008). In fact, in the setting of Corollary 1, this result can be extended to location parameters
corresponding to strictly increasing convex ρ-functions.
It is seen from (1) or (3) that the ρ-dispersion operator is self-adjoint. Moreover, from the
spectral decomposition found in Proposition 2, it will follow that the ρ-dispersion operator is
positive semidefinite. Although many results derived in this paper are valid for a wide class of
functions ρ, the choice ρ(u) = uq for some q > 0 is especially attractive as the resulting centre
is scale invariant and the dispersion is scale equivariant. For general ρ, it would be more appropriate
to use a suitably studentized version of the objective functions; to this end, one can insert
a preliminary estimator of the trace into the objective function.
We now provide explicit formulae for two main choices of the ρ-function.
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
818 DAVID KRAUS AND VICTOR M. PANARETOS
When choosing ρ(u) = u2, the score determining the ρ-dispersion operator equals
G (P; R, μ) = EP[2{R − P(X; μ)}]. Thus, R(P) can be found explicitly as R(P) =
EP{P(X; μ)}. As the score for the ρ-centre is G(P; μ) = EP{2(μ − X)}, the solution is μ(P) =
EP(X). Hence, the dispersion operator is the usual covariance operator.
The choice ρ(u) = u is expected to place less emphasis on influential observations and result
in more resistant procedures. The corresponding score operators for the dispersion and centre are
G (P; R, μ) = EP
R − P(X; μ)
R − P(X; μ)
, G(P; μ) = EP
μ − X
μ − X
.
The parameter μ(P) has been studied by a number of authors under different names in the multivariate
as well as functional settings. In the multivariate context Chaudhuri (1996) calls μ(P)
the geometric median; other authors (Serfling, 2004; Sirki¨a et al., 2009) use the name spatial
median and some authors (Huber & Ronchetti, 2009; Fritz et al., 2012) use the term L1-centre
or L1-median. In the functional setting, μ(P) was studied by Locantore et al. (1999) and by
Gervini (2008), who calls it the functional or spatial median. We use the term spatial median for
μ(P) and, similarly, we call R(P) the spatial dispersion operator. To clarify the terminology, we
recall that
S (P) = EP
(X − μ) ⊗ (X − μ)
X − μ 2
is called the spherical covariance operator (Locantore et al., 1999). Unlike the parameters under
the L2-type loss function, the spatial median and spatial dispersion are not available explicitly.
Their empirical counterparts ˆμ = μ( ˆP) and ˆR = R( ˆP) can, however, be obtained numerically,
employing a Newton–Raphson algorithm, as explained in the Appendix.
The score function ρ (u) = quq−1 corresponding to ρ(u) = uq is unbounded unless q = 1.
Therefore, the estimator of the spatial dispersion operator, q = 1, is resistant, whereas other
choices are nonresistant due to the effect of outliers, q > 1, or inliers, q < 1.
Although the dispersion operator is in general different from the covariance operator unless
ρ(u) = u2, it carries useful information on second-order properties of the distribution. There
is an interesting link between the spectra of the dispersion and covariance operator. Let X
admit the Karhunen–Lo`eve expansion X = μ + ∞
k=1 λ
1/2
k βkϕk, where β1, β2, . . . are zeromean
unit-variance uncorrelated random variables, {λk : k 1} are the nonincreasing nonnegative
eigenvalues and {ϕk : k 1} are the complete orthonormal eigenfunctions of the covariance
operator C (P) = EP{(X − μ) ⊗ (X − μ)} = ∞
k=1 λkϕk ⊗ ϕk. We now investigate the eigendecomposition
of the theoretical ρ-dispersion operator R(P) defined via M-estimation as the
solution to (3). The main result is as follows.
PROPOSITION 2. Assume that the Fourier coefficient sequence {βk}∞
k=1 has a joint distribution
that is invariant under the change of the sign of any component. Then, the dispersion operator
R(P) has the same eigenfunctions as the covariance operator C (P), i.e., there exists a nonnegative
sequence {δk}∞
k=1 such that R(P) = ∞
k=1 δkϕk ⊗ ϕk. Furthermore, the eigenvalues
δ1, δ2, . . . satisfy the conditions
δk = λk
E
ρ [{ i (δi −λi β2
i )2+ i |=l λi λlβ2
i β2
l }1/2]
{ i (δi −λi β2
i )2+ i |=l λi λlβ2
i β2
l }1/2 β2
k
E
ρ [{ i (δi −λi β2
i )2+ i |=l λi λlβ2
i β2
l }1/2]
{ i (δi −λi β2
i )2+ i |=l λi λlβ2
i β2
l }1/2
(k = 1, 2, . . .).
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 819
A similar result relating the covariance operator and the spherical covariance operator S (P)
was obtained by Gervini (2008, Theorem 3) who showed that, under the assumption of exchangeability
of the coefficient sequence, both operators have the same eigenfunctions in the same
order; see also Marden (1999) and Boente & Fraiman (1999). Our proposition shows that the
ρ-dispersion operator also has the same set of eigenfunctions. We conjecture that, potentially
under further assumptions, the order of the eigenfunctions is also the same; computational experiments
back this conjecture. Gervini (2008) assumed that the Karhunen–Lo`eve expansion has
only finitely many terms, i.e., that the distribution is concentrated on a finite-dimensional subspace,
whereas our results hold even for processes with infinite series expansions. On the other
hand, Gervini (2008) needed no moment assumptions, whereas we need to assume finite second
moments: without moment assumptions the convergence of an infinite Karhunen–Lo`eve series
is not guaranteed, while a finite sum is always well defined regardless of the properties of the
random summands.
2·3. The two-sample test
Having defined the notion of a dispersion operator, we now construct a two-sample secondorder
test based upon it. Let X1, . . . , Xn1 and Y1, . . . , Yn2 be two independent random samples
from distributions P1, P2 on H, whose ρ-centres are μ(P1), μ(P2) and ρ-dispersion operators
are R(P1), R(P2). The goal is to test the null hypothesis H0: R(P1) = R(P2) against the general
alternative H1: R(P1) |= R(P2). Note that μ(P1), μ(P2) can be equal or different, as neither H0
nor H1 specifies their relation. We propose to employ the general idea of score tests, that is, to
base the test on the estimating score for the general model, without assuming H0, evaluated at
the null estimate of the parameter.
As the centres μ(P1), μ(P2) are not restricted under the null hypothesis, they can be estimated
separately by minimizing L( ˆP1; μ1), L( ˆP2; μ2), i.e., by solving G( ˆP1; μ1) = 0, G( ˆP2; μ2) =
0, respectively. Denote μ( ˆPj ) by ˆμj ( j = 1, 2). On the other hand, the null estimator of the
dispersion is based on both samples. As we now have two samples, we need to extend our notation
to cover situations with two distributions, empirical or theoretical, mixed at proportions a and
1 − a for a ∈ (0, 1). We denote
M(P1, P2, a; R1, R2, μ1, μ2) = aM(P1; R1, μ1) + (1 − a)M(P2; R2, μ2).
The common null value R of the dispersion operator is estimated by ˆR, which minimizes
M( ˆP1, ˆP2, an; R, R, ˆμ1, ˆμ2) where an = n1/n with n = n1 + n2. Equivalently, ˆR solves
G ( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2) = O, the null estimating equation, where G (P1, P2, a; R, μ1, μ2) =
aG (P1; R, μ1) + (1 − a)G (P2; R, μ2).
Using the reparameterization R = (R1 + R2)/2, T = (R1 − R2)/2, we have R1 = R + T ,
R2 = R − T and we need to test H0: T = O against H1: T |= O. For the test, we need the score
in the general model
∂
∂(R, T )T
M( ˆP1, ˆP2, an; R + T , R − T , ˆμ1, ˆμ2) =
G ( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2)
B( ˆP1, ˆP2, an; R, ˆμ1, ˆμ2)
where B(P1, P2, a; R, μ1, μ2) = aG (P1; R, μ1) − (1 − a)G (P2; R, μ2). The score test is
based on this general score at the null estimator. When evaluated at (R, T ) = ( ˆR, O), the
score is zero in the first component. Thus, the test can be based on the second component
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2).
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
820 DAVID KRAUS AND VICTOR M. PANARETOS
When the null hypothesis holds, the score operator B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is expected to
be close to the zero operator, otherwise it should be far from the zero operator. To perform the
test, we need to measure the distance of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) from the zero operator and
assess the significance of the resulting test statistic.
One way to measure the distance of the score operator from zero is to use its Hilbert–Schmidt
norm. A drawback of this approach is that the resulting statistic does not have a tractable asymptotic
distribution. The score operator turns out to be asymptotically Gaussian, but its Hilbert–
Schmidt norm is not asymptotically distribution-free. In the context of comparison of covariance
operators, Boente et al. (2011) use a simulation procedure to approximate the distribution of the
statistic.
Another idea is to mimic the standard procedure from settings where the parameter of interest
is Euclidean. In such settings, the difference of the score vector from zero is measured with the
help of a quadratic form involving the score vector and the inverse of its covariance matrix. The
quadratic statistic is usually asymptotically chi-square distributed and the null hypothesis is then
rejected when the value of the statistic is significantly large. In the functional context, the score
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is infinite dimensional. Due to the noninvertibility of its covariance
operator, one cannot construct a quadratic statistic. We overcome this problem by regularizing
the score operator using spectral truncation.
The test object B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) is an element of the space of operators HS(H, H).
Recall that HS(H, H) is a Hilbert space with inner product defined as
A1, A2 =
∞
k=1
A1ek, A2ek =
∞
j=1
∞
k=1
ej , A1ek ej , A2ek , A1, A2 ∈ HS(H, H),
where {ek : k = 1, 2, . . . } is an arbitrary complete orthonormal basis of H. For any complete
orthonormal basis {Ek : k = 1, 2, . . . } of HS(H, H), an operator A ∈ HS(H, H) and the square
of its Hilbert–Schmidt norm can be written as
A =
∞
k=1
A , Ek Ek, A 2
=
∞
k=1
A , Ek
2
.
Instead of this infinite series, one can use a truncated version. If U ⊂ HS(H, H) is a suitably
chosen finite-dimensional linear subspace with an orthonormal basis {U1, . . . , UL}, then instead
of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2 one can use
πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2
= B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU
2
=
L
l=1
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), Ul
2
,
where πU is the projection onto the subspace U . That is, the test can be based on a score vector
with components
Sl = B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), Ul (l = 1, . . . , L). (4)
One particular way of choosing the basis elements Ul is to derive them from a basis of the Hilbert
space H. If U is a K-dimensional linear subspace of H with an orthonormal basis {u1, . . . , uK },
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 821
then one may use the L = K(K + 1)/2 orthonormal operators of the form
Ujk =
u j ⊗ u j ( j = k),
(u j ⊗ uk + uk ⊗ u j )/21/2 ( j < k).
(5)
There is yet another way of motivating the above truncation. Instead of measuring the
difference of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) from zero on the entire Hilbert space H, we can
measure how it differs from the zero operator when attention is restricted to the linear
subspace U. More precisely, instead of B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2), we use the operator
πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU , where πU is the projection operator on U. Its squared Hilbert–
Schmidt norm
πU B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)πU
2
=
K
j=1
K
k=1
u j , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)uk
2
is a truncated version of
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) 2
=
∞
j=1
∞
k=1
ej , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)ek
2
,
where {ej : j = 1, 2, . . . } is any complete orthonormal basis of H. The resulting scores
Sjk = u j , B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)uk (1 j k K)
are equivalent to (4) with Ul of the form (5).
It is natural to use the basis operators of the form (5) with u1, . . . , uK being the first K eigenfunctions
of the dispersion operator R because, in light of Mercer’s theorem, they carry the
main portion of information about the dispersion operator. In practice, the eigenfunctions of R
are not known, so one uses the eigenfunctions of the pooled sample estimator ˆR. The number
of components K can be selected as the minimal number for the cumulative proportion of dispersion
explained by the subspace to exceed a certain threshold, e.g., 80% of the trace of the
corresponding pooled sample dispersion operator. The proportion of dispersion, corresponding
to the eigenvalues of the dispersion operator, is in general not equivalent to the proportion of
variability, corresponding to the eigenvalues of the covariance operator.
To construct the test statistic, instead of simply summing squares of the terms Sl of the
form (4), one combines them in a quadratic form reflecting their covariance structure.
The formal test will be based on the asymptotic distribution of the test statistic. Let n1,
n2 be such that n1 → ∞, n2 → ∞ and an = n1/n → a ∈ (0, 1). Assume that G(Pj ; μ) 2,
G (Pj ; R, μ) 2 ( j = 1, 2) are finite. Let the function ρ: R+
0 → R+
0 be twice differentiable,
strictly increasing, and convex with ρ(0) = 0. Assume that the laws P1, P2 satisfy the conditions
of Corollary 1 and the expectations EPj
{ρ ( X − μ )2}, EPj
[ρ { P(X; μ) − R }2],
EPj
{ρ ( X − μ )}, EPj
[ρ { P(X; μ) − R }] and
EPj
ρ ( X − μ )
X − μ
, EPj
ρ { P(X; μ) − R }
P(X; μ) − R
( j = 1, 2)
are finite. Assume that the derivatives D(Pj ; μ), D(Pj ; R, μ), D(Pj ; R, μ) given in (A1)–
(A3) in the Appendix exist for j = 1, 2.
Let S be a score vector of length L of the form (4) for some linearly independent operators
Ul = U
(n)
l . Let the operators Ul be either nonrandom, independent of n, or convergent in
probability to some nonrandom limits, up to a possible sign ambiguity in the sense that there
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
822 DAVID KRAUS AND VICTOR M. PANARETOS
exist some operators U ∞
l such that | U
(n)
l , U ∞
l | converges to 1. In this set-up, we have the
following theorem.
THEOREM 1. Under the null hypothesis H0 : R(P1) = R(P2), the score
n1/2B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) converges weakly to a mean zero Gaussian random operator
with covariance operator, which can be consistently estimated by W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)
given in (A5) in the Appendix. The asymptotic distribution of the score vector n1/2S is L-variate
zero-mean Gaussian with a covariance matrix that is consistently estimated by a matrix W
with entries Wj,l = Uj , W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2)Ul ( j,l = 1, . . . , L). The test statistic
T = nST
W−1S asymptotically follows a χ2 distribution with L degrees of freedom.
We now deal with the two main cases, spatial and L2-type, explicitly. In the spatial case, ρ(u) =
u, we test the null hypothesis that the spatial dispersion operators are equal in both samples. The
score operator takes the form
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) =
1
n
n1
i=1
ˆR − P(Xi ; ˆμ1)
ˆR − P(Xi ; ˆμ1)
−
1
n
n2
i=1
ˆR − P(Yi ; ˆμ2)
ˆR − P(Yi ; ˆμ2)
.
The Fr´echet derivatives D(P; μ), D(P; R, μ) involved in the covariance operator of the score
are
D(P; μ) = EP
1
X − μ
I −
(X − μ) ⊗ (X − μ)
X − μ 2
,
D(P; R, μ) = EP
1
P(X; μ) − R
I −
{P(X; μ) − R} ⊗ {P(X; μ) − R}
P(X; μ) − R 2
,
and the derivative D(P; R, μ) evaluated at f ∈ H is
D(P; R, μ) f = EP
−Q(X; μ) f
P(X; μ) − R
+
P(X; μ) − R, Q(X; μ) f
P(X; μ) − R 3
{P(X; μ) − R} .
When the L2 approach, ρ(u) = u2, is employed, the hypothesis to be tested states that the
covariance operators in both samples are equal. The null estimator of R takes the form ˆR =
an
ˆR1 + (1 − an) ˆR2, that is, the pooled covariance estimator. The test score operator equals
B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) = an2( ˆR − ˆR1) − (1 − an)2( ˆR − ˆR2) = 4an(1 − an)( ˆR2 − ˆR1),
which is a multiple of the difference of the empirical covariance operators. So, the test is equivalent
to a Wald-type test proposed by Panaretos et al. (2010). This is different from the spatial
test for which the score does not simplify to the difference of the spatial dispersions, so the score
test differs from the Wald test. To compute the covariance operator of the test score, we first
notice that D(P; R, μ) = −2 EP{Q(X; μ)} equals zero at μ = μ(P) = EP(X); see (A4) in the
Appendix. Consequently, the fact that the centres of the two distributions must be estimated does
not affect the asymptotic distribution, as could be expected. Also, D(P; R, μ) = 2I. Hence, after
straightforward calculations, the estimator of the covariance operator of the test operator is
W( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) = 4an(1 − an){(1 − an)J( ˆP1; ˆR, ˆμ1) + anJ( ˆP2; ˆR, ˆμ2)}
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 823
= 16an(1 − an)
× (1 − an)
1
n1
n1
i=1
{P(Xi ; ˆμ1) − ˆR1} ⊗ {P(Xi ; ˆμ1) − ˆR1}
+ an
1
n2
n2
i=1
{P(Yi ; ˆμ2) − ˆR2} ⊗ {P(Yi ; ˆμ2) − ˆR2} .
In Panaretos et al. (2010), the limiting covariance of the L2 score for the Wald-type test was
investigated in the special case of Gaussian data and a simpler formula was found.
3. A SIMULATION STUDY
In order to investigate the performance of the testing procedure introduced in § 2·3, we generate
random samples of size n1, n2 of curves of the form
X(t) = μ1(t) +
10
k=1
λ
1/2
1k a1k21/2
sin{2πk(t + γ1k)} +
10
k=1
ν
1/2
1k b1k21/2
cos{2πk(t + δ1k)},
Y(t) = μ2(t) +
10
k=1
λ
1/2
2k a2k21/2
sin{2πk(t + γ2k)} +
10
k=1
ν
1/2
2k b2k21/2
cos{2πk(t + δ2k)},
where the coefficients ajk, bjk are mutually independent random variables with zero-mean and
unit variance. Three symmetric coefficient distributions are considered: normal, uniform and t5,
all scaled to have unit variance. As the test procedures are invariant with respect to the location
shift of one or both samples, we set μ1(t) = μ2(t) = 0. Unless stated otherwise, we set γjk =
δjk = 0 in all situations. We perform the nonresistant L2 test and the proposed spatial dispersion
test at the nominal level α = 0·05. The sample sizes are n1 = n2 = 50. The basis of the subspace
for dimension reduction consists of several leading eigenfunctions of the pooled sample estimator
of the dispersion operator; that is, the pooled sample empirical covariance for the L2 test and the
pooled sample empirical spatial dispersion for the spatial test. The number of components K
included in the basis is selected as the minimal number needed to explain at least 80% of the
dispersion.
We first study the behaviour of the test procedures under the null hypothesis. We set λ1k =
λ2k = k−3 and ν1k = ν2k = (1/3)k.
We begin with uncontaminated samples to verify that the tests maintain the prescribed nominal
level. The first row of Table 1 shows that, in general, the asymptotic distribution approximates
the distribution of both test statistics reasonably well. The asymptotic approximation for the L2
method is slightly less accurate and tends to be liberal for distributions with light tails, i.e., normal
and uniform.
Next we simulate datasets contaminated by atypical observations. Mean contamination, i.e.,
observations whose mean is different from the mean of the central distribution, usually impacts
the level more seriously than pure covariance contamination, i.e., observations with the same
mean but different covariance structure. Thus, we focus on mean contamination, i.e., outliers, in
the study of the resistance of the level. In one or both samples, mj out of nj observations were
replaced by observations that have mean function μcont
j instead of μj and the same covariance
structure as the original distribution. We consider various distances of the contamination distribution
from the central distribution and various contamination proportions, as indicated in Tables 1
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
824 DAVID KRAUS AND VICTOR M. PANARETOS
Table 1. Empirical rejection probabilities (%) at the nominal level α = 5% under the null
hypothesis. Samples of size n1 = n2 = 50 are contaminated by m1, m2 observations with
mean functions μcont
1 , μcont
2 , respectively, and the same covariance structure as the central
distribution. Estimates are based on 2000 simulation runs
Normal t5 Uniform
m1 μcont
1 (t) m2 μcont
2 (t) L2 Spatial L2 Spatial L2 Spatial
0 0 7·1 5·0 5·4 5·3 7·8 4·6
5 1 5 1·5 − 3 sin(πt) 9·2 6·6 8·2 6·4 10·0 4·6
5 1·5 5 1·5 − 3 sin(πt) 14·4 6·4 14·6 6·8 14·6 4·6
5 2·5 5 1·5 − 3 sin(πt) 22·9 6·0 23·0 7·2 23·0 5·1
5 1 5 2 − 4 sin(πt) 11·2 7·2 10·3 7·7 11·7 5·2
5 1·5 5 2 − 4 sin(πt) 18·8 7·2 19·8 7·8 20·0 5·4
5 2·5 5 2 − 4 sin(πt) 30·4 7·2 32·4 8·2 30·8 6·4
5 1 5 2·5 − 5 sin(πt) 14·1 8·2 14·0 8·0 15·0 6·4
5 1·5 5 2·5 − 5 sin(πt) 25·9 8·2 25·4 8·4 27·8 6·5
5 2·5 5 2·5 − 5 sin(πt) 41·8 8·3 46·4 9·0 42·4 7·2
5 1 0 7·4 6·0 6·4 5·4 8·6 5·0
5 1·5 0 12·6 5·9 11·2 5·7 13·4 4·6
5 2·5 0 19·0 6·1 17·8 6·0 17·8 4·7
0 5 1·5 − 3 sin(πt) 9·0 6·0 7·2 6·6 9·8 5·6
0 5 2 − 4 sin(πt) 12·3 6·8 10·8 7·7 13·0 6·6
0 5 2·5 − 5 sin(πt) 16·4 7·6 14·4 8·7 16·8 7·6
Table 2. Empirical rejection probabilities (%) at the nominal level
α = 5% under the null hypothesis. Samples of size n1 = n2 = 50
are contaminated by m1, m2 observations with mean functions
μcont
1 (t) = 1·5, μcont
2 (t) = 2 − 4 sin(πt), respectively, and the same
covariance structure as the central distribution. Estimates are based
on 2000 simulation runs
m1 = m, m2 = 0 m1 = 0, m2 = m m1 = m2 = m
m L2 Spatial L2 Spatial L2 Spatial
0 7·1 5·0 7·1 5·0 7·1 5·0
1 7·0 5·4 6·7 5·1 7·2 5·6
2 6·8 5·0 7·5 5·4 7·8 5·6
3 6·9 5·3 8·7 5·6 8·4 6·2
4 8·4 6·2 10·7 6·2 11·2 6·4
5 12·6 5·9 12·3 6·8 18·8 7·2
6 24·8 6·5 14·8 7·5 39·2 8·1
7 57·8 7·4 17·2 8·6 71·6 10·2
8 89·2 7·9 20·8 9·2 93·0 17·6
9 99·0 11·9 24·7 11·4 99·0 28·2
10 99·8 18·4 28·2 13·6 100·0 42·7
and 2. We consider only atypical observations that are not very far from the central distribution.
These are the most insidious because they are often hidden in the main, apparently typical part
of the dataset, do not stand out and thus are not easily identified visually, yet they often have a
devastating impact on the behaviour of the nonresistant test. To illustrate this, we plot in Fig. 1 typical
simulated samples with m1 = 5, μcont
1 (t) = 1·5 and m2 = 5, μcont
2 (t) = 2 − 4 sin(πt). When
looking at the plots, one would be unable to identify atypical observations, if they were not highlighted.
Visually, many of them do not seem to be very different from most curves, whereas some
curves from the central distribution could be considered unusual.
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 825
−4
0 0·2 0·4 0·6 0·8 1·0 0 0·2 0·4 0·6 0·8 1·0
−2
0
2
4
(a) (b)
−4
−2
0
2
4
Fig. 1. Simulated contaminated samples. (a) Samples with m1 = 5 atypical observations with
μcont
1 (t) = 1·5; (b) Samples with m2 = 5 atypical observations with μcont
2 (t) = 2 − 4 sin(πt).
Atypical obsevations plotted in bold.
Table 1 shows that the proposed spatial test is much more resistant to contamination than the
L2-type test. For instance, notice that for m1 = m2 = 5, i.e., 10% contamination of both samples,
the level of the spatial test in all situations considered is only slightly inflated, while the actual
level of the L2-type test exceeds 40%. Similarly, if one of the samples contains five atypical
observations and the other is not contaminated, i.e., 10% contamination of one sample with 5%
contamination overall, the spatial test rejects with probability close to the nominal level, while the
level of the L2-type test is as high as 19%. As the magnitude of atypical observations increases,
the true level of the L2 test, unlike that of the spatial one, increases dramatically. Comparing
the behaviour of the tests across the various coefficient distributions, we observe no important
differences. The higher resistance of the spatial method is also documented in Table 2, where the
dependence of the level on the amount of contamination is studied for Gaussian data. The spatial
procedure can tolerate much more contamination than can the L2-type method.
Now we focus on the behaviour of the tests under alternatives. We consider five alternative
scenarios. Under all of them, the parameters of the distribution of the first sample are λ1k = k−3
and ν1k = (2/5)k. The parameters of the second sample are as follows. Under scenario I, we have
λ2k = 1·6λ1k and ν2k = 1·6ν1k (k = 1, . . . , 10), so the samples differ only in scale, their covariance
structure is otherwise the same. Under scenario II, we use λ21 = 1·5, ν21 = 0·8 and λ2k = λ1k
and ν2k = ν1k (k = 2, . . . , 10), so the covariance operators differ in the two leading eigenvalues,
which however correspond to the same eigenfunctions. Scenario III has λ2k = λ1k (k = 1, . . . , 10)
and ν21 = 0·2, ν22 = 0·35 and ν2k = ν1k (k = 3, . . . , 10); here the difference is on the second and
third eigenvalues whose corresponding eigenfunctions are the same but in the opposite order.
Under scenario IV, we set λ22 = λ13, λ23 = λ12, ν22 = ν13, ν23 = ν12 and λ2k = λ1k, ν2k = ν1k
(k /∈ {2, 3}), so the difference occurs further down in the spectrum; eigenfunctions with indices
3, 4, 5, 6 are permuted, the leading two eigen-elements do not differ. Under scenario V, we use
λ2k = λ1k, ν2k = ν1k and γ2k = δ2k = 0·15 (k = 1, . . . , 10); in this case, the whole eigenbases are
different but the eigenvalues remain the same in both samples.
First, we compare the power of the proposed spatial method with the L2-type method for samples
without contamination. Table 3 shows that in most cases the power of the spatial test is lower
than the power of the L2-type test for distributions with light tails. The lower efficiency of the
spatial method is the price we pay for its increased resistance. Both methods have comparable
power in the heavy tailed case under most scenarios. Under scenario IV the spatial method outperforms
the L2-type method. This is due to the automatic selection of K: for instance in the normal
case, for the L2-type test K equals 3 in 91 percent of cases while, for the spatial test, K equals 4
in 96 percent of cases; as the covariance operators differ on the third to sixth eigen-elements, K
equal to 4 captures more of the difference.
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
826 DAVID KRAUS AND VICTOR M. PANARETOS
Table 3. Empirical rejection probabilities (%) at the nominal
level α = 5% under various alternative scenarios for
samples of size n1 = n2 = 50 without contamination. Estimates
are based on 1000 simulation runs
Normal t5 Uniform
L2 Spatial L2 Spatial L2 Spatial
I 55 40 28 30 93 62
II 53 29 28 22 92 48
III 74 53 36 38 99 85
IV 38 61 24 53 49 73
V 76 58 53 51 96 72
Table 4. Empirical rejection probabilities (%) of the spatial test
at the nominal level α = 5% under various alternative scenarios
for samples of size n1 = n2 = 50 contaminated by m1, m2 atypical
observations. Estimates are based on 1000 simulation runs
Contamination m1 m2 I II III IV V
configuration
0 0 40 29 53 61 58
A 5 5 12 16 57 64 59
5 0 34 25 54 62 58
0 5 15 16 56 63 61
B 5 5 29 22 36 39 55
5 0 33 28 46 74 55
0 5 40 28 49 34 57
C 5 5 24 18 34 39 52
5 0 32 22 43 50 62
0 5 31 24 43 49 48
Next, we investigate the impact of contamination on the power of the spatial test; we do not
study the L2-type test as we have seen before that its level is unreliable for contaminated data. The
goal is to study if and how contamination can decrease the power. Similarly to the null scenario,
here we also observed that mean contamination usually increases the rejection probability. Therefore,
it is more interesting to contaminate data with curves with atypical covariance structure. We
experimented with many configurations of atypical observations such that it is difficult to identify
them visually and found that often even covariance contamination increases the rejection probability.
Nevertheless, we were able to find some configurations for which we observed a decrease
of the power in some situations. The central distributions follow the same scenarios I–V as before
with normally distributed coefficients. Contamination configurations are as follows. Under configuration
A, the contamination distribution has λcont
1k = 1·4λ1k, νcont
1k = 1·4ν1k, λcont
2k = 0·25λ2k
and νcont
2k = 0·25ν2k (k = 1, . . . , 10), other parameters of the contamination distribution are the
same as for the central distribution. Under configuration B, we set λcont
1k = 0·3λ1k and λcont
2k =
0·3λ2k (k = 1, . . . , 10), νcont
1k = 0·3ν1k and νcont
2k = 0·3ν2k (k = 3, . . . , 10), and νcont
11 = νcont
21 = 1
and νcont
12 = νcont
22 = 0·9, while other parameters remain unchanged. Under configuration C, atypical
observations in the first sample follow the central distribution of the second sample and
atypical observations in the second sample follow the central distribution of the first sample.
The simulation results are presented in Table 4. We report only configurations with some
detrimental effect on the power, while many configurations not reported here do not have such
an effect. Under configuration A, we can see a decrease of the rejection probability for scenarios
I and II. Configuration A was specifically designed to decrease the power under scenario I:
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 827
TATA
−0·2
−0·1
0·0
0·1
0·2
−0·2
−0·2 −0·1 0 0·1 0·2 −0·2 −0·1 0 0·1 0·2
−0·1
0·0
0·1
0·2
CAP
Principal axis of inertia 2 Principal axis of inertia 2
Principalaxisofinertia3
Principalaxisofinertia3
Fig. 2. Projection of DNA minicircle curves on the first principal plane spanned by the second
and third principal axis of inertia. Atypical observations plotted in bold.
atypical observations deviate from the central distribution against the direction of the alternative;
specifically, both the central and contamination distributions have proportional covariance
operators but in the opposite direction. A similar phenomenon is seen for scenario II, where the
directions of the alternative and of the contamination distribution are in a similar relationship.
On the other hand, we observe no important effect of contamination of type A under scenarios
III–V because in these cases atypical observations do not go against the alternative. Under configuration
B, the power decreases mainly for scenarios III and IV. Configuration B downweights
components other than the first and second cosine component, where it puts higher weight equal
for both samples. As these are components carrying an important part of the difference between
the covariances, one expects some decrease of the rejection probability, especially under scenarios
III and IV. Under configuration C, the two samples are partly mixed, i.e., one sample
contaminates the other sample and vice versa. This blurs the difference and somewhat decreases
the power under some of the scenarios.
4. AN ILLUSTRATION: DNA MINICIRCLE DATA
We illustrate the proposed methods on a dataset consisting of reconstructed three-dimensional
electron microscope images of loops called minicircles obtained from short strands of DNA
(Amzallag et al., 2006). The dataset contains 99 DNA minicircles of two types, TATA, 65 observations,
and CAP, 34 observations, with identical base-pair sequences, except for a short subsequence
where they differ. The main question is whether this difference affects the flexibility
properties of the DNA minicircles. One way to formalize the flexibility properties is through the
fluctuation pattern around the mean minicircle shape. This naturally leads one to consider twosample
second-order functional comparisons. DNA minicircles are closed curves in R3. In the
original dataset, each curve was randomly rotated and shifted in R3 and had no starting point and
no orientation. In Panaretos et al. (2010), an alignment procedure based on the moment of inertia
tensor was used as a means of alignment of the curves in a common coordinate system. Figure 2
shows projections of aligned curves on the plane spanned by the two principal axes of inertia.
Using inverse weights induced by Gervini’s (2008) spatial median, Panaretos et al. (2010)
identified five unusual curves, possible outliers, and removed them from the analysis of the
covariance structure. These atypical curves, plotted in thick lines in Fig. 2, are visibly different
from the remaining curves. Panaretos et al. (2010) analysed the data without the atypical
observations using a test comparing empirical covariance operators under the assumption that
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
828 DAVID KRAUS AND VICTOR M. PANARETOS
the curves are Gaussian. Under this assumption, they observed significant differences at the 5%
level. These differences were highly significant with a numerically zero p-value, when the comparison
was restricted to the eigenvalues of the covariance operators; the corresponding empirical
eigenfunctions suggested that the eigenfunction structure of the two operators was very similar.
Taking advantage of the results in the present paper, we may run an L2-type test without
assuming normality. When doing so, with the atypical observations still removed, the p-value of
the L2-type score test of the equality of covariance operators equals 0·023 with the dimension
of the subspace on which the test operator is projected equal to K = 6, suggesting persistence
of the effect, independently of a Gaussian assumption. Instead of removing apparently atypical
observations manually, one might also wish to run an analysis on the complete dataset. However,
the performance of L2-type procedures was seen to be highly unstable in the presence of atypical
observations, such as the ones in the present dataset, see Tables 1 and 2. By contrast, the spatial
dispersion test was seen to maintain a level close to nominal in our simulations, especially
in outlier scenarios similar to the one in the minicircle data. There may be further influential
observations lurking in the sample. For this reason, we applied the score test based on the spatial
dispersion operator, using the full minicircle dataset. In contrast to the other procedures, this
yielded the p-value 0·353 indicative of a lack of significant differences in the spatial dispersions.
The value of K was selected as the minimal number of components needed to explain 80% of
the trace of the underlying null dispersion estimator. No further outliers were detected by the
resistant test. The discordance between the L2 and spatial tests is probably due to the reduced
efficiency of the resistant procedure when the two samples share common eigenfunctions, as
seems to be the case in the minicircle dataset; recall that the dispersion operator shares the same
eigenfunctions with the covariance operator, possibly up to order. It was seen in our simulations
that, in general, though the level of the spatial test was conserved, in the presence of influential
observations its power was appreciably reduced when differences were only in the eigenvalues,
i.e., under scenarios I and II in Table 4, as compared to scenarios where differences exist between
the eigenfunctions, too, i.e., scenarios III–V in Table 4. Moreover the present framework does
not immediately yield a special version of the test that would concentrate only on the eigenvalue
structure; the complete structure of the operator is taken into account.
ACKNOWLEDGEMENT
We thank the editor, associate editor, and two anonymous referees for their extensive, constructive,
and in-depth comments and suggestions. This research was supported in part by the
European Research Council.
SUPPLEMENTARY MATERIAL
Supplementary material available at Biometrika online includes proofs of Proposition 1,
Corollary 1, Proposition 2, Theorem 1 and a technical lemma needed in the proof of Theorem 1.
APPENDIX
Computation
Assume that the observations Xi ∈ H are represented as linear combinations of some known fixed basis
elements ψj , that is, Xi =
p
j=1 ξi j ψj . This representation is usually obtained by a least squares procedure,
possibly with smoothing, from some form of discrete original observations of Xi . The exact form of the
original data depends on the particular application. For instance, when H is a functional, L2
, space indexed
by one-dimensional time, the original data usually consist of observations Xi (tk) (k = 1, . . . , m) for a grid
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 829
of points t1 < · · · < tm. Now suppose that the original data are observed discretely but exactly, i.e., without
noise; later we explain how to handle noisy discrete observations.
The methods proposed in this paper have the advantage that all required quantities and operations can
be expressed in terms of basis coefficients; thus, from the computational point of view the task is multivariate.
To estimate the centre, it is enough to find the vector of coefficients mj in its basis expansion
μ =
p
j=1 mj ψj . Similarly, for the dispersion operator, we need to find the matrix of coefficients Rj j in
the expansion
R =
p
j=1
p
j =1
Rj j ψj ⊗ ψj .
For simplicity, we first assume that the basis ψ1, . . . , ψp is orthonormal. Then, the norm in the objective
function for μ is simply the norm of the coefficient vector, i.e., Xi − μ 2
= ξi − m 2
=
p
j=1(ξi j −
mj )2
, and the score operator G( ˆP; μ) is equivalent to the p-vector
1
n
n
i=1
ρ ( ξi − m )
ξi − m
(m − ξi ).
The Hilbert–Schmidt norm in the objective function for R is the Frobenius norm of the coefficient matrix,
i.e.,
P(Xi ; μ) − R 2
= (ξi − m)(ξi − m)T
− R 2
=
p
j=1
p
j =1
{(ξi j − mj )(ξi j − m j ) − Rj j }2
,
and the score operator G ( ˆP; R, μ) is equivalent to the p × p matrix
1
n
n
i=1
ρ { (ξi − m)(ξi − m)T
− R }
(ξi − m)(ξi − m)T
− R
{R − (ξi − m)(ξi − m)T
}.
For the two-sample test, the operator B( ˆP1, ˆP2, an; ˆR, ˆμ1, ˆμ2) and the basis elements Ul for dimension
reduction are equivalent to matrices, and the score components Sl are computed as their inner products.
Similarly, all quantities involved in the covariance matrix of the score vector are computed in a multivariate
setting. When the basis ψ1, . . . , ψp is not orthonormal, one simply multiplies each coefficient vector ξi by
the matrix A1/2
where A has entries aj j = ψj , ψj , and performs all computations, i.e., estimation of the
centre and dispersion, eigen-decomposition and the two-sample test, with these transformed multivariate
inputs. This corresponds to switching from the original basis to the orthonormal basis A−1/2
(ψ1, . . . , ψp)T
.
If needed, the centre and the eigenfunctions can then be obtained in the original basis by multiplying their
coefficient vectors by A−1/2
and in the dispersion by multiplying its coefficient matrix by A−1/2
from both
sides. We refer to Ramsay & Silverman (2005, § 8.4.2) for a detailed explanation of a similar problem of
computing functional principal components from coefficients with respect to a general non-orthonormal
basis.
To estimate the centre and dispersion one solves the corresponding multivariate optimization problem.
If ρ(u) = u2
, the solutions are the sample mean and covariance matrix of the coefficient vectors; otherwise
an iterative procedure is used. We use the Broyden–Fletcher–Goldfarb–Shanno quasi-Newton method
implemented in the R package (R Development Core Team, 2012) in the function optim, initialized by the
componentwise median of ξi for the centre and the componentwise median of (ξi − m)(ξi − m)T
for the
dispersion. This numerical procedure was reliable and reasonably fast in our experiments. This is in agreement
with a detailed study of the numerical performance of various algorithms for the spatial median
presented by Fritz et al. (2012).
In functional settings one can directly use the functional values on a grid of points instead of computing
with basis coefficients. The basis approach is slightly more general than the discretization approach
because it can be used for any separable Hilbert space, not only a functional space, and in the functional
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
830 DAVID KRAUS AND VICTOR M. PANARETOS
case it does not require a common grid for all functions. Standard software for functional data analysis,
such as the fda package in R, uses basis representations of data.
In many applications, the original functional values on a grid of points are observed with noise. In such
situations, some degree of smoothing is necessary for the reconstruction of the underlying functional data.
Ramsay & Silverman (2005, Chapter 5) describe how roughness penalties can be used to compute the basis
coefficients of the functions. After this preliminary step, our methods can be applied to the reconstructed
curve, i.e., their basis coefficients, as described above.
In the case of the spatial median, Gervini (2008, pp. 589–590) proposes an alternative method to deal
with noise in discretely observed functions. Rather than on denoising and reconstructing the curves, his
procedure is based on removing the bias, which is due to the errors, in the norm in the objective function
with the help of a consistent estimate of the variance of the errors. He uses this idea in connection with
numerical integration on a grid, but it can be adapted to the basis approach as well. However, this method
is less practical for second-order problems, as one would also need to estimate higher order moments of
the errors and use convoluted formulae to remove the bias from the norm in the objective functional.
Technical material
We now derive several key expressions pertaining to the assumptions, statement and discussion of
Theorem 1. We use the script font, e.g., D, J , I , for linear operators on H, i.e., linear mappings H → H,
the fraktur font, e.g., D, J, I, H, W, for linear operators on Hilbert–Schmidt operators on H, i.e., linear
mappings HS(H, H) → HS(H, H), and the blackboard bold font, e.g., D, J, H, Q, for linear operators
from H to Hilbert–Schmidt operators on H, i.e., linear mappings H → HS(H, H).
First, we introduce certain derivatives in the Fr´echet sense as follows. Denote by I and I the identity
operators on H and HS(H, H), respectively. The derivative
D(P; μ) =
∂
∂μ
G(P; μ) = EP
ρ ( X − μ )
X − μ
I +
ρ ( X − μ )
X − μ 2
−
ρ ( X − μ )
X − μ 3
P(X; μ) (A1)
is a linear mapping from H to H. The derivative
D(P; R, μ) =
∂
∂R
G (P; R, μ) = EP
ρ { P(X; μ) − R }
P(X; μ) − R
I
+
ρ { P(X; μ) − R }
P(X; μ) − R 2
−
ρ { P(X; μ) − R }
P(X; μ) − R 3
P(X; R, μ) , (A2)
where we denote P(x; R, μ) = {P(x; μ) − R} ⊗ {P(x; μ) − R}, is a linear mapping from HS(H, H)
to HS(H, H). We define
D(P; R, μ) =
∂
∂μ
G (P; R, μ), (A3)
which is a linear mapping from H to HS(H, H). To compute it, we first compute
Q(x; μ) =
∂
∂μ
P(x; μ).
We consider its value at some f ∈ H, i.e., we investigate the operator Q(x; μ) f ∈ HS(H, H). This is done
through its coordinate representation as follows. For any g1, g2 ∈ H, we have
g1, {Q(x; μ) f }g2 = g1,
∂
∂μ
P(x; μ) f g2 =
∂
∂μ
g1, P(x; μ)g2 f
=
∂
∂μ
( x − μ, g1 x − μ, g2 ) f = −( x − μ, g2 g1 + x − μ, g1 g2) f
= − x − μ, g2 g1, f − x − μ, g1 g2, f .
(A4)
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
Resistant functional data analysis 831
Then, the derivative of G (P; R, μ) with respect to μ evaluated at f ∈ H is
D(P; R, μ) f = − EP
ρ { P(X; μ) − R }
P(X; μ) − R
Q(X; μ) f
− EP
ρ { P(X; μ) − R }
P(X; μ) − R 2
−
ρ { P(X; μ) − R }
P(X; μ) − R 3
× P(X; μ) − R, Q(X; μ) f {P(X; μ) − R} .
We set
D0(P1, P2, a; R, μ1, μ2) = aD(P1; R, μ1) + (1 − a)D(P2; R, μ2),
D1(P1, P2, a; R, μ1, μ2) = aD(P1; R, μ1) − (1 − a)D(P2; R, μ2).
Next, using the notation f ⊗2
= f ⊗ f for f ∈ H and A ⊗2
= A ⊗ A for A ∈ HS(H, H), we define
J (P; μ) = EP
ρ ( X − μ )
X − μ
(μ − X) − G(P; μ)
⊗2
J(P; R, μ) = EP
ρ { P(X; μ) − R }
P(X; μ) − R
{R − P(X; μ)} − G (P; R, μ)
⊗2
and
J(P; R, μ) = EP
ρ { P(X; μ) − R }
P(X; μ) − R
{R − P(X; μ)} − G (P; R, μ)
⊗
ρ ( X − μ )
X − μ
(μ − X) − G(P; μ) .
Next, we denote
H1(P1, P2, a; R, μ1, μ2) = I − D1(P1, P2, a; R, μ1, μ2)D0(P1, P2, a; R, μ1, μ2)−1
,
H1(P1, P2, a; R, μ1, μ2) = H1(P1, P2, a; R, μ1, μ2)D(P1; R, μ1)D(P1; μ1)−1
,
H2(P1, P2, a; R, μ1, μ2) = I + D1(P1, P2, a; R, μ1, μ2)D0(P1, P2, a; R, μ1, μ2)−1
,
H2(P1, P2, a; R, μ1, μ2) = H2(P1, P2, a; R, μ1, μ2)D(P2; R, μ2)D(P2; μ2)−1
,
where I stands for the identity operator on HS(H, H). Finally, we set
W(P1, P2, a; R, μ1, μ2) = aW1(P1, P2, a; R, μ1, μ2) + (1 − a)W2(P1, P2, a; R, μ1, μ2), (A5)
where
W1(P1, P2, a; R, μ1, μ2) = H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗
− H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗
− H1(P1, P2, a; R, μ1, μ2)J(P1; R, μ1)∗
H1(P1, P2, a; R, μ1, μ2)∗
+ H1(P1, P2, a; R, μ1, μ2)J (P1; R, μ1)H1(P1, P2, a; R, μ1, μ2)∗
with ∗
denoting adjoint operators, and W2(P1, P2, a; R, μ1, μ2) is defined analogously with H2, H2 in
place of H1, H1, respectively, and P2 instead of P1 in J, J, J .
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
832 DAVID KRAUS AND VICTOR M. PANARETOS
REFERENCES
ADLER, R. J. (1990). An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes.
Institute of Mathematical Statistics Lecture Notes—Monograph Series, 12. Hayward: Institute of Mathematical
Statistics.
AMZALLAG, A., VAILLANT, C., JACOB, M., UNSER, M., BEDNAR, J., KAHN, J. D., DUBOCHET, J., STASIAK, A. &
MADDOCKS, J. H. (2006). 3D reconstruction and comparison of shapes of DNA minicircles observed by cryoelectron
microscopy. Nucleic Acids Res. 34, e125.
ANDERSON, M. J. (2006). Distance-based tests for homogeneity of multivariate dispersions. Biometrics 62, 245–53.
BALI, L., BOENTE, G., TYLER, D. E. & WANG, J.-L. (2012). Robust functional principal components: A projectionpursuit
approach. Ann. Statist. 39, 2852–82.
BENKO, M., H¨ARDLE, W. & KNEIP, A. (2009). Common functional principal components. Ann. Statist. 37, 1–34.
BOENTE, G. & FRAIMAN, R. (1999). Comment on a paper by Locantore et al. Test 8, 28–35.
BOENTE, G., RODRIGUEZ, D. & SUED, M. (2011). Testing the equality of covariance operators. In Recent Advances in
Functional Data Analysis and Related Topics, Ed. F. Ferraty, pp. 49–53. Heidelberg: Physica-Verlag.
BOSQ, D. (2000). Linear Processes in Function Spaces: Theory and Applications. New York: Springer.
BOX, G. E. P. (1953). Non-normality and tests on variances. Biometrika 40, 318–35.
CHAUDHURI, P. (1996). On a geometric notion of quantiles for multivariate data. J. Am. Statist. Assoc. 91, 862–72.
DAUXOIS, J., POUSSE, A. & ROMAIN, Y. (1982). Asymptotic theory for the principal component analysis of a vector
random function: some applications to statistical inference. J. Mult. Anal. 12, 136–54.
FRITZ, H., FILZMOSER, P. & CROUX, C. (2012). A comparison of algorithms for the multivariate L1-median. Comp.
Statist., to appear. doi: 10.1007/s00180-011-0262-4.
GABRYS, R. & KOKOSZKA, P. (2007). Portmanteau test of independence for functional observations. J. Am. Statist.
Assoc. 102, 1338–48.
GERVINI, D. (2006). Free-knot spline smoothing for functional data. J. R. Statist. Soc. B 68, 671–87.
GERVINI, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika
95, 587–600.
HALL, P. & HOSSEINI-NASAB, M. (2006). On properties of functional principal components analysis. J. R. Statist. Soc.
B 68, 109–26.
HALL, P., M¨ULLER, H.-G. & WANG, J.-L. (2006). Properties of principal component methods for functional and longitudinal
data analysis. Ann. Statist. 34, 1493–517.
HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J. & STAHEL, W. A. (1986). Robust Statistics. New York: Wiley.
HORV ´ATH, L., HUˇSKOV ´A, M. & KOKOSZKA, P. (2010). Testing the stability of the functional autoregressive process.
J. Mult. Anal. 101, 352–67.
HUBER, P. J. & RONCHETTI, E. M. (2009). Robust Statistics. Hoboken: Wiley.
LAYARD, M. W. J. (1974). A Monte Carlo comparison of tests for equality of convariance matrices. Biometrika 61,
461–5.
LI, G. & CHEN, Z. (1985). Projection-pursuit approach to robust dispersion matrices and principal components: Primary
theory and Monte Carlo. J. Am. Statist. Assoc. 80, 759–66.
LOCANTORE, N., MARRON, J. S., SIMPSON, D. G., TRIPOLI, N., ZHANG, J. T. & COHEN, K. L. (1999). Robust principal
component analysis for functional data. Test 8, 1–73.
MARDEN, J. I. (1999). Some robust estimates of principal components. Statist. Prob. Lett. 43, 349–59.
O’BRIEN, P. C. (1992). Robust procedures for testing equality of covariance matrices. Biometrics 48, 819–27.
OLSON, C. L. (1974). Comparative robustness of six tests in multivariate analysis of variance. J. Am. Statist. Assoc.
69, 894–08.
PANARETOS, V. M., KRAUS, D. & MADDOCKS, J. H. (2010). Second-order comparison of Gaussian random functions
and the geometry of DNA minicircles. J. Am. Statist. Assoc. 105, 670–82.
R DEVELOPMENT CORE TEAM (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria: R
Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org.
RAMSAY, J. & SILVERMAN, B. W. (2005). Functional Data Analysis. New York: Springer.
SERFLING, R. (2004). Nonparametric multivariate descriptive measures based on spatial quantiles. J. Statist. Plan.
Infer. 123, 259–78.
SIRKI ¨A, S., TASKINEN, S., OJA, H. & TYLER, D. E. (2009). Tests and estimates of shape based on spatial signs and
ranks. J. Nonparam. Statist. 21, 155–76.
SUN, Y. & GENTON, M. G. (2011). Functional boxplots. J. Comp. Graph. Statist. 20, 316–334.
TIKU, M. L. & BALAKRISHNAN, N. (1985). Testing the equality of variance-covariance matrices the robust way.
Commun. Statist. A 14, 3033–51.
YAO, F. & LEE, T. C. M. (2006). Penalized spline models for functional principal component analysis. J. R. Statist.
Soc. B 68, 3–25.
ZHANG, J., PANTULA, S. G. & BOOS, D. D. (1991). Robust methods for testing the pattern of a single covariance matrix.
Biometrika 78, 787–95.
[Received April 2011. Revised May 2012]
atUniversitÃ©&EPFLLausanneonFebruary17,2013http://biomet.oxfordjournals.org/Downloadedfrom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Biometrika (2012), xx, x, pp. 1–8
C 2007 Biometrika Trust
Printed in Great Britain
Supplementary ﬁle: Dispersion operators and resistant
second-order functional data analysis
BY DAVID KRAUS AND VICTOR M. PANARETOS
Section de Math´ematiques, Ecole Polytechnique F´ed´erale de Lausanne,
EPFL Station 8, 1015 Lausanne, Switzerland
david.kraus@epﬂ.ch victor.panaretos@epﬂ.ch
SUMMARY
This supplementary ﬁle contains proofs of Proposition 1, Corollary 1, Proposition 2, Theorem
1 and a technical lemma needed in the proof of Theorem 1. Equations in this supplement
are numbered (S1), (S2), . ..; equation numbers such as (1), (2), . .. or (A1), (A2), . .. refer to the
main body of the paper.
PROOF OF PROPOSITION 1
It sufﬁces to prove that the ﬁnitely-valued objective functional M(P; R, µ) given in equation
(2) in the paper admits a unique minimizer on the space of Hilbert–Schmidt operators acting
on H. By the triangle inequality, monotonicity and convexity of ρ we have that
EP(ρ[ P(X; µ) − {λR + (1 − λ)R } ] − ρ{ P(X; µ) })
≤ EP[ρ{λ P(X; µ) − R + (1 − λ) P(X; µ) − R } − ρ{ P(X; µ) }]
≤ λ EP[ρ{ P(X; µ) − R } − ρ{ P(X; µ) }]
+ (1 − λ) EP[ρ{ P(X; µ) − R } − ρ{ P(X; µ) }]
for any λ ∈ [0, 1] and arbitrary Hilbert–Schmidt operators R, R . Notice that since ρ is strictly
increasing, the ﬁrst inequality is strict unless P(X; µ) − R and P(X; µ) − R are collinear
almost surely. Equivalently, the inequality is strict whenever the distribution of P(X; µ) is not
concentrated on the line {tR + (1 − t)R : t ∈ R}.
We now investigate what this condition means geometrically in the space H. First, notice that
as the rank of P(X; µ) is 1, the rank of tR + (1 − t)R has to be 1 also. Now we distinguish
two cases.
First, if R, R are collinear, then the line is of the form {αR : α ∈ R}, which by the condition
on the rank is {αu ⊗ u : α ∈ R} for some u ∈ H. Since P(X; µ) is positive semideﬁnite, we
in fact have {αu ⊗ u : α ≥ 0}. Thus, the operator P(X; µ) lying on this line is equivalent to X
lying on the line {µ + βu : β ∈ R}.
Second, if R, R are not collinear, then operators of the form tR + (1 − t)R have rank 1 for
at most two values of t. To see this, notice that the rank condition implies that for all i < j,
det t
Rii Rij
Rji Rjj
+ (1 − t)
Rii Rij
Rji Rjj
= 0,
where Rij = ei, Rej , Rij = ei, R ej . This system of quadratic equations has at most two solutions.
Thus, the set {tR + (1 − t)R : t ∈ R} reduces at most to the set {α1u1 ⊗ u1, α2u2 ⊗
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
2 KRAUS, D., PANARETOS, V. M.
u2} for some nonnegative α1, α2 and some u1, u2 ∈ H. Hence, the operator P(X; µ) belonging
to this set is equivalent to X belonging to the set of at most four points {µ ± β1u1, µ ± β2u2}.
Therefore, if the distribution P is not concentrated on a line or on four points, the objective
function to be minimized is strictly convex. It follows that the minimum of the functional exists
and is unique.
PROOF OF COROLLARY 1
The empirical version of the functional deﬁning the dispersion operator is the expectation with
respect to the empirical distribution ˆP. Under our assumptions on P, the empirical distribution ˆP
is almost surely not concentrated on a line or on four points. Therefore, strict convexity, and thus
existence and uniqueness, follows with probability 1 by applying Proposition 1 to the empirical
distribution ˆP. Consistency then follows from strict convexity and the consistency of ˆµ, using
standard arguments.
PROOF OF PROPOSITION 2
Consider R of the form ∞
k=1 δkϕk ⊗ ϕk for some sequence δ1, δ2, . . . We will prove that
such an operator solves the estimating equation (5) showing that R and C have the same set of
eigenfunctions, and that the sequence δ1, δ2, . . . satisﬁes the condition (6).
We investigate the coordinates of the left-hand side of (5), with the aim of showing that the
values
ϕj, EP
ρ { R − P(X; µ) }
R − P(X; µ)
{R − P(X; µ)} ϕk (S1)
are zero for all j, k. By the orthonormality of ϕ1, ϕ2, . . . , we have that
R − P(X; µ) 2
=
∞
k=1
δkϕk ⊗ ϕk −
∞
j=1
∞
k=1
λ
1/2
j λ
1/2
k βjβkϕj ⊗ ϕk
2
=
k
(δk − λkβ2
k)2
+
k=j
λjλkβ2
j β2
k.
First, we compute the off-diagonal coordinates with j = k. The ﬁrst summand in (S1) is zero
because ϕj, Rϕk = 0. To show that the second summand in (S1) is zero, we use the fact that,
by assumption, the sequence {siβi}∞
i=1 with si = (−1)1{i=j} has the same joint distribution as
{βi}∞
i=1. Compute
Ajk = ϕj, EP
ρ { R − P(X; µ) }
R − P(X; µ)
P(X; µ) ϕk
= E
ρ [{ i(δi − λiβ2
i )2 + i=l λiλlβ2
i β2
l }1/2]
{ i(δi − λiβ2
i )2 + i=l λiλlβ2
i β2
l }1/2
λ
1/2
j λ
1/2
k βjβk
= E
ρ ([ i{δi − λi(siβi)2}2 + i=l λiλl(siβi)2(slβl)2]1/2)
[ i{δi − λi(siβi)2}2 + i=l λiλl(siβi)2(slβl)2]1/2
λ
1/2
j λ
1/2
k sjβjskβk
= −Ajk.
Thus, Ajk = 0. Therefore, the operator R is diagonalized by the same functions ϕ1, ϕ2, . . .
as C . By computing the diagonal coordinates with j = k in (5) we obtain (6).
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
Supplementary ﬁle: Dispersion operators and resistant functional data analysis 3
A TECHNICAL LEMMA
LEMMA 1. Under the assumptions of Theorem 1,
(a) the linear operator D(P; µ) deﬁned in equation (A1) is a bijection of H onto itself, it is
bounded and has bounded inverse,
(b) the linear operator D(P; R, µ) deﬁned in equation (A2) is a bijection of HS(H, H) onto
itself, it is bounded and has bounded inverse.
Proof. We prove part (a); the proof of part (b) is similar. The proof uses and extends the steps
of the proof of Lemma 1 (iii) of Gervini (2008) modiﬁed for the present context of general ρ and
generalized to the case of inﬁnitely many components in the Karhunen–Lo`eve expansion.
Recall that
D(P; µ) = EP
ρ ( X − µ )
X − µ
I +
ρ ( X − µ )
X − µ 2
−
ρ ( X − µ )
X − µ 3
P(X; µ) ;
see the appendix of the main body of the paper. To show that D(P; µ) is a bijection, we need
to ﬁnd for any h ∈ H a unique element f ∈ H such that D(P; µ)f = h. The set of orthonormal
eigenfunctions {ϕk}∞
k=1 of C can be extended to an orthonormal basis of H by possibly
adding some functions {ψk}q
k=1 with q ﬁnite or inﬁnite or zero. It is then enough to verify
the relation D(P; µ)f = h in terms of the Fourier coefﬁcients of both sides with respect to
the basis {ϕk}∞
k=1 ∪ {ψk}q
k=1, i.e., to show that D(P; µ)f, ϕk = h, ϕk for all k = 1, 2, . . .
and D(P; µ)f, ψk = h, ψk for all k = 1, . . . , q. As D(P; µ)f, ϕk = f, D(P; µ)ϕk and
D(P; µ)f, ψk = f, D(P; µ)ψk , we ﬁrst investigate D(P; µ)ϕk and D(P; µ)ψk.
We begin by exploring the structure of the operator D(P; µ). We can rewrite
EP
ρ ( X − µ )
X − µ 3
P(X; µ) = EP(˜ε ⊗ ˜ε),
where
˜ε =
ρ ( X − µ )1/2
X − µ 3/2
(X − µ) =
∞
k=1
λ
1/2
k
ρ ( X − µ )1/2
X − µ 3/2
βkϕk =
∞
k=1
˜λ
1/2
k
˜βkϕk (S2)
with
˜λk = λk EP
ρ ( X − µ )
X − µ 3
β2
k ,
˜βk =
ρ ( X − µ )1/2
X − µ 3/2
βk EP
ρ ( X − µ )
X − µ 3
β2
k
1/2
.
Thus, we need to ﬁnd the covariance operator of ˜ε. The series expansion (S2) of ˜ε is a Karhunen–
Lo`eve expansion because the coefﬁcients ˜βk have zero mean and unit variance and are uncorrelated
(which follows from the fact that the distribution of {βk} is invariant under the change of
the sign of any component). Therefore, since EP( ˜ε 2) < ∞, which follows immediately from
the assumption that
EP
ρ ( X − µ )
X − µ
< ∞,
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
4 KRAUS, D., PANARETOS, V. M.
the operator of interest, as the covariance operator of ˜ε, takes the form
EP
ρ ( X − µ )
X − µ 3
P(X; µ) =
∞
k=1
˜λkϕk ⊗ ϕk =
∞
k=1
EP
ρ ( X − µ )
X − µ 3
λkβ2
k ϕk ⊗ ϕk.
Using analogous arguments for
˙ε =
ρ ( X − µ )1/2
X − µ
(X − µ),
we can show that
EP
ρ ( X − µ )
X − µ 2
P(X; µ) =
∞
k=1
EP
ρ ( X − µ )
X − µ 2
λkβ2
k ϕk ⊗ ϕk.
Hence, we ﬁnally obtain D(P; µ) in the form
D(P; µ) = EP
ρ ( X − µ )
X − µ
I
+
∞
k=1
EP
ρ ( X − µ )
X − µ 2
−
ρ ( X − µ )
X − µ 3
λkβ2
k ϕk ⊗ ϕk.
Therefore, for k = 1, 2, . . . we have
D(P; µ)ϕk = EP
ρ ( X − µ )
X − µ
ϕk + EP
ρ ( X − µ )
X − µ 2
−
ρ ( X − µ )
X − µ 3
λkβ2
k ϕk
and, for k = 1, . . . , q, we have
D(P; µ)ψk = EP
ρ ( X − µ )
X − µ
ψk.
Thus, we obtain
D(P; µ)f, ϕk = νk f, ϕk (k = 1, 2, . . . ),
D(P; µ)f, ψk = η f, ψk (k = 1, . . . , q),
where
νk = EP
ρ ( X − µ )
X − µ
+ λk EP
ρ ( X − µ )
X − µ 2
−
ρ ( X − µ )
X − µ 3
β2
k (k = 1, 2, . . . )
and
η = EP
ρ ( X − µ )
X − µ
.
So f, the candidate for D(P; µ)−1h, should have Fourier coefﬁcients f, ϕk , f, ψk satisfying
the system of equations
νk f, ϕk = h, ϕk (k = 1, 2, . . . ), η f, ψk = h, ψk (k = 1, . . . , q).
To be able to write f, ϕk = h, ϕk /νk, we need to show that νk (k = 1, 2, . . . ) and η are
nonzero and ﬁnite. Then, f will be uniquely determined by the formula
f =
∞
k=1
h, ϕk
νk
ϕk +
q
k=1
h, ψk
η
ψk
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
Supplementary ﬁle: Dispersion operators and resistant functional data analysis 5
provided that f is a well-deﬁned element of H, that is,
f 2
=
∞
k=1
h, ϕk
2
ν2
k
+
q
k=1
h, ψk
2
η2
< ∞. (S3)
We assumed that η < ∞ and we immediately see that η > 0 because ρ is strictly increasing.
We now deal with νk (k = 1, 2, . . . ). We will show that there exist 0 < a ≤ b < ∞ such that
νk ∈ [a, b] for all k = 1, 2, . . .
First we establish the lower bound a. Using the Karhunen–Lo`eve expansion (S2) we can
rewrite
EP
ρ ( X − µ )
X − µ
= EP( ˜ε 2
) =
∞
k=1
˜λk =
∞
k=1
λk EP
ρ ( X − µ )
X − µ 3
β2
k . (S4)
Each term in the series on the right hand side of (S4) is obviously positive and by ﬁniteness of
the left hand side it is ﬁnite, and thus the differences
EP
ρ ( X − µ )
X − µ
− λk EP
ρ ( X − µ )
X − µ 3
β2
k , (S5)
which appear in the expression for νk, are positive and bounded away from zero by a constant a.
The remaining term
λk EP
ρ ( X − µ )
X − µ 2
β2
k (S6)
appearing in νk is nonnegative as ρ ≥ 0 because ρ is convex. It follows that νk ≥ a for all
k = 1, 2, . . .
Now we ﬁnd the upper bound b. By applying the same idea as in (S4) to ˙ε, we obtain
EP{ρ ( X − µ )} =
∞
k=1
λk EP
ρ ( X − µ )
X − µ 2
β2
k . (S7)
In view of (S7), the terms (S6) are smaller than or equal to EP{ρ ( X − µ )}. The differences
(S5) are smaller than
EP
ρ ( X − µ )
X − µ
.
Therefore, we have that νk ≤ b for all k = 1, 2, . . . with
b = EP
ρ ( X − µ )
X − µ
+ EP{ρ ( X − µ )}.
Finally, it remains to show (S3), which is now straightforward because
f 2
=
∞
k=1
h, ϕk
2
ν2
k
+
q
k=1
h, ψk
2
η2
≤
∞
k=1 h, ϕk
2 + q
k=1 h, ψk
2
min(a, η)
=
h 2
min(a, η)
< ∞.
This shows that f is a well deﬁned element of H and thus the linear operator D(P; µ) is a bijection
of H onto itself. It also shows that the inverse D(P; µ)−1 is a bounded operator. Hence also
the operator D(P; µ) is bounded by the bounded inverse theorem or by direct veriﬁcation.
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
6 KRAUS, D., PANARETOS, V. M.
Remark: As νk are bounded away from zero and bounded from above, the operator D(P; µ) is
only a small perturbation of a multiple of the identity. This gives an intuitive explanation why it
inherits its bijectivity and boundedness.
PROOF OF THEOREM 1
It is enough to prove the weak convergence of n1/2B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2). The weak convergence
of the vector with components Sl will then follow directly from Slutsky’s theorem.
The continuous mapping theorem and Slutsky’s theorem will then establish the weak convergence
of the statistic T. Applying a Taylor expansion (Nelson, 1969, Theorem 6, p. 12) of
B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) around the true values of the parameters yields
n1/2
B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = n1/2
B(ˆP1, ˆP2, an; R, µ1, µ2)
+ D1(ˆP1, ˆP2, an; R , µ1, µ2)n1/2
( ˆR − R)
+ a1/2
n D(ˆP1; R , µ1)n
1/2
1 (ˆµ1 − µ1)
− (1 − an)1/2
D(ˆP2; R , µ2)n
1/2
2 (ˆµ2 − µ2),
(S8)
where
D1(P1, P2, a; R, µ1, µ2) =
∂
∂R
B(P1, P2, a; R, µ1, µ2)
= aD(P1; R, µ1) − (1 − a)D(P2; R, µ2)
and
D(P; R, µ) =
∂
∂R
G (P; R, µ), D(P; R, µ) =
∂
∂µ
G (P; R, µ).
See the Appendix in the main body of the paper for explicit formulae.
We now turn to develop certain asymptotic representations for ˆµ1, ˆµ2 and ˆR. Using the Taylor
expansion, law of large numbers and consistency of ˆµ1 we get
0 = n
1/2
1 G(ˆP1; ˆµ1) = n
1/2
1 G(ˆP1; µ1) + D(ˆP1; µ†
1)n
1/2
1 (ˆµ1 − µ1)
= n
1/2
1 G(ˆP1; µ1) + D(P1; µ1)n
1/2
1 (ˆµ1 − µ1) + oP (1),
where the term oP (1) is due to the fact that we replace D(ˆP1; µ1) by its limit D(P1; µ1). From
this and an analogous expansion for µ2 we obtain
n
1/2
1 (ˆµ1 − µ1) = −D(P1; µ1)−1
n
1/2
1 G(ˆP1; µ1) + oP (1),
n
1/2
2 (ˆµ2 − µ2) = −D(P2; µ2)−1
n
1/2
2 G(ˆP2; µ2) + oP (1).
(S9)
The existence of the bounded inverse operators in the above equations, as well as of other inverse
operators appearing later in the proof, is shown in Lemma 1. The Taylor expansion of the
estimating score for R around the true values is
O = n1/2
G (ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = n1/2
G (ˆP1, ˆP2, an; R, µ1, µ2)
+ D0(ˆP1, ˆP2, an; R‡
, µ‡
1, µ‡
2)n1/2
( ˆR − R)
+ a1/2
n D(ˆP1; R‡
, µ‡
1)n
1/2
1 (ˆµ1 − µ1)
+ (1 − an)1/2
D(ˆP2; R‡
, µ‡
2)n
1/2
2 (ˆµ2 − µ2),
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
Supplementary ﬁle: Dispersion operators and resistant functional data analysis 7
where D0(P1, P2, a; R, µ1, µ2) = aD(P1; R, µ1) + (1 − a)D(P2; R, µ2). This yields
n1/2
( ˆR − R) = −D0(P1, P2, a; R, µ1, µ2)−1
{n1/2
G (ˆP1, ˆP2, an; R, µ1, µ2) + a1/2
n D(P1; R‡
, µ‡
1)n
1/2
1 (ˆµ1 − µ1)
+ (1 − an)1/2
D(P2; R‡
, µ‡
2)n
1/2
2 (ˆµ2 − µ2)}
+ oP (1);
(S10)
here again the term oP (1) is present because we replace the empirical distributions by their
theoretical counterparts in D0 and D.
The different Taylor expansions we have used contain various elements denoted by , †, ‡
which lie on the line segments between the true and estimated corresponding parameters. We
will replace all of these elements by the true values of the parameters. Due to the consistency
of the estimators, the difference between a quantity at the true value of the parameters and at
a value on the line segment between the true value and the estimator converges in probability to
zero. Moreover, the quantities involving elements marked with , † or ‡ are always multiplied by
a term that is bounded in probability (by its convergence in distribution which will be seen later).
Hence, the change we make by replacing the elements marked with , † or ‡ by their true values
is asymptotically negligible. The reason for doing this is that we obtain simpler formulas.
Denote
H1(P1, P2, a; R, µ1, µ2) = I − D1(P1, P2, a; R, µ1, µ2)D0(P1, P2, a; R, µ1, µ2)−1
,
H1(P1, P2, a; R, µ1, µ2) = H1(P1, P2, a; R, µ1, µ2)D(P1; R, µ1)D(P1; µ1)−1
,
H2(P1, P2, a; R, µ1, µ2) = I + D1(P1, P2, a; R, µ1, µ2)D0(P1, P2, a; R, µ1, µ2)−1
,
H2(P1, P2, a; R, µ1, µ2) = H2(P1, P2, a; R, µ1, µ2)D(P2; R, µ2)D(P2; µ2)−1
,
where I stands for the identity operator on HS(H, H). Inserting (S9) and (S10) into (S8), we
obtain
n1/2
B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = a1/2
n H1(P1, P2, a; R, µ1, µ2)n
1/2
1 G (ˆP1; R, µ1)
− a1/2
n H1(P1, P2, a; R, µ1, µ2)n
1/2
1 G(ˆP1; µ1)
− (1 − an)1/2
H2(P1, P2, a; R, µ1, µ2)n
1/2
2 G (ˆP2; R, µ2)
+ (1 − an)1/2
H2(P1, P2, a; R, µ1, µ2)n
1/2
2 G(ˆP2; µ2)
+ oP (1).
The term oP (1) is due to the fact that we have replaced the quantities marked with , †, ‡ by their
true counterparts.
By the central limit theorem for Hilbert spaces (Bosq, 2000, Theorem 2.7), the operators
n
1/2
1 G (ˆP1; R, µ1), n
1/2
1 G(ˆP1; µ1) jointly converge in distribution to a zero-mean Gaussian random
variable in HS(H, H) × H. The asymptotic covariance operator of n
1/2
1 G (ˆP1; R, µ1), i.e.,
an operator on operators on H, can be estimated by the empirical covariance J(ˆP1; ˆR, ˆµ1), where
J(P; R, µ) = EP
ρ { P(X; µ) − R }
P(X; µ) − R
{R − P(X; µ)} − G (P; R, µ)
⊗2
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
8 KRAUS, D., PANARETOS, V. M.
with the notation A ⊗2 = A ⊗ A for A ∈ HS(H, H), the asymptotic covariance operator of
n
1/2
1 G(ˆP1; µ1), i.e., an operator on H, can be estimated by J (ˆP1; ˆµ1), where
J (P; µ) = EP
ρ ( X − µ )
X − µ
(µ − X) − G(P; µ)
⊗2
with f⊗2 = f ⊗ f for f ∈ H, and the asymptotic cross-covariance operator of n
1/2
1 G (ˆP1; R, µ1)
and n
1/2
1 G(ˆP1; µ1), i.e., an operator from H to operators on H, can be estimated by J(ˆP1; ˆR, ˆµ1),
where
J(P; R, µ) = EP
ρ { P(X; µ) − R }
P(X; µ) − R
{R − P(X; µ)} − G (P; R, µ)
⊗
ρ ( X − µ )
X − µ
(µ − X) − G(P; µ) .
Similarly, n
1/2
2 G (ˆP2; R, µ2), n
1/2
2 G(ˆP2; µ2) jointly converge in distribution to a zero-mean
Gaussian random element with covariance estimators analogous to those mentioned above for
the sample from P1. As the samples are independent, all four random variables jointly converge
in distribution.
Finally, it follows by Slutsky’s theorem that the test operator n1/2B(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) is
asymptotically distributed as a zero-mean Gaussian operator whose covariance operator can be
consistently estimated by
W(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2) = anW1(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2)
+ (1 − an)W2(ˆP1, ˆP2, an; ˆR, ˆµ1, ˆµ2),
where
W1(P1, P2, a; R, µ1, µ2)
= H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗
− H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗
− H1(P1, P2, a; R, µ1, µ2)J(P1; R, µ1)∗
H1(P1, P2, a; R, µ1, µ2)∗
+ H1(P1, P2, a; R, µ1, µ2)J (P1; R, µ1)H1(P1, P2, a; R, µ1, µ2)∗
with ∗ denoting adjoint operators, and W2(P1, P2, a; R, µ1, µ2) is deﬁned analogously with
H2, H2 in place of H1, H1, respectively, and P2 instead of P1 in J, J, J .
REFERENCES
BOSQ, D. (2000). Linear Processes in Function Spaces: Theory and Applications. New York: Springer.
GERVINI, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika
95, 587–600.
NELSON, E. (1969). Topics in Dynamics. I: Flows. Princeton: Princeton University Press.
C. Components and completion of partially observed functional
data
By David Kraus
Journal of the Royal Statistical Society. Series B. Statistical Methodology,
77(4):777–801, 2015
DOI: 10.1111/rssb.12087
81
© 2014 Royal Statistical Society 1369–7412/15/77777
J. R. Statist. Soc. B (2015)
77, Part 4, pp. 777–801
Components and completion of partially observed
functional data
David Kraus
University Hospital Lausanne, Switzerland
[Received June 2013. Final revision July 2014]
Summary. Functional data are traditionally assumed to be observed on the same domain.
Motivated by a data set of heart rate temporal proﬁles, we develop methodology for the analysis
of incomplete functional samples where each curve may be observed on a subset of the domain
and unobserved elsewhere.We formalize this observation regime and develop the fundamental
procedures of functional data analysis for this framework: estimation of parameters (mean and
covariance operator) and principal component analysis. Principal scores of a partially observed
function cannot be computed directly and we solve this challenging issue by estimating their
best predictions as linear functionals of the observed part of the trajectory. Next, we propose a
functional completion procedure that recovers the missing part by using the observed part of
the curve. We construct prediction intervals for principal scores and bands for missing parts of
trajectories. The prediction problems are seen to be ill-posed inverse problems; regularization
techniques are used to obtain a stable solution. A simulation study shows the good performance
of our methods. We illustrate the methods on the heart rate data and provide practical
computational algorithms and theoretical arguments and proofs of all results.
Keywords: Functional data analysis; Incomplete observation; Inverse problem; Prediction;
Principal component analysis; Regularization
1. Introduction
Contemporary data sets often consist of data units that are complex objects, such as functions,
curves or images; see, for example, Ramsay and Silverman (2005), Ferraty and Vieu (2006),
Ferraty and Romain (2011) and Horv´ath and Kokoszka (2012). It is standard in the ﬁeld of
functional data analysis to assume that all functions are observed on the same domain. In this
paper, we develop methods of analysis for functional data that are observed incompletely in
the sense that each function might be observed only on a subset of the domain, whereas no
information about the curve is available on the complement of this subset.
Our work is motivated by an ambulatory blood pressure monitoring data set that is part of
the ‘Swiss kidney project on genes in hypertension’ (Pruijm et al., 2013) which is a multicentre
cross-sectional study focusing on the role of kidney function and genes in blood pressure
regulation and hypertension. In ambulatory blood pressure monitoring, participants wear a
calibrated automatic device that is programmed to record systolic and diastolic blood pressure
and heart rate at frequent intervals during 24 h (every 15 min during the day and every 30 min
during the night). Ideally, this design should provide enough information for each continuous
temporal proﬁle to be reconstructed by standard smoothing techniques; the resulting sample
of curves would then be analysed by traditional methods of functional data analysis. In reality,
Address for correspondence: David Kraus, Institute of Social and Preventive Medicine, University Hospital
Lausanne, Route de la Corniche 10, Lausanne 1010, Switzerland.
E-mail: kraus.stat@gmail.com
778 D. Kraus
20 21 22 23 24 25 26
406080100
Time
(a) (b)
20 21 22 23 24 25 26
406080100
Time
Fig. 1. (a) Subset of the sample of heart rate proﬁles and (b) several curves in detail
however, some values have not been measured and the time points corresponding to unobserved
values form series (intervals) of non-negligible length. There are two main reasons why no measurements
are available for certain periods: ﬁrst is the participant’s discomfort (the participants
can remove the device when they feel uncomfortable) and second is the failure of the device to
take measurements. However, there are series of frequent, properly recorded measurements. It
is therefore possible to reconstruct the underlying proﬁles in continuous time on these periods.
Fig. 1(a) displays a subset of 685 heart rate proﬁles (values in beats per minute); we focus on the
time interval [20, 26] (i.e. from 8 p.m. of one day to 2 a.m. next day) that is of particular medical
interest because it is the transition period between the day and night regime. In Fig. 1(b), we
plot separately four proﬁles to illustrate the type of available data: whereas some curves (dotted
and chain curves) are observed completely (on the entire domain [20, 26]), other curves (the
two broken curves) have unobserved periods. The percentage of incomplete functions is 31%
for blood pressure proﬁles and 44% for heart rate proﬁles. This is a considerable fraction of the
data, and we therefore wish to avoid removing the incomplete curves from the analysis.
The partial observation regime that we encounter in this data set is of general interest in
applications as often, despite the failure to observe the curves in some regions, there is enough
observed information in the rest of the domain. The mechanism that causes the absence of
data can be random, like in our data, but the curves may also be partially observed by design.
Moreover, data need not necessarily be curves indexed by time; methods that we develop can be
extended to more general object data subject to incomplete observation, such as partially observed
images, spatial curves or surfaces. Hence this kind of functional data is worth systematic
investigation. Interestingly and surprisingly, this observation pattern, however natural and likely
to occur in many applications it is, has received relatively little attention in the literature. James
et al. (2000) and James and Hastie (2001) used parametric mixed effects models for principal
components analysis and classiﬁcation of partially observed curves. Bugni (2012) developed a
goodness-of-ﬁt test under circumstances that were similar to those of our paper. Delaigle and
Hall (2013) dealt with classiﬁcation of functional data when only fragments of curves are available.
Liebl (2013) studied low rank extensions of curves observed on subdomains. Goldberg et al.
(2014) propose a prediction procedure for the continuation of a low rank functional observation.
In this paper we introduce a formal framework for analysing incompletely observed functional
data and develop basic non-parametric, fully functional (inﬁnite dimensional) inferential
Partially Observed Functional Data 779
procedures. When exploring functional data, one often ﬁnds interesting information in their covariance
structure; see Ramsay and Silverman (2005) for some examples and, for example, Benko
et al. (2009), Sangalli et al. (2009) or Panaretos et al. (2010) for other illustrations. Therefore, we
ﬁrst focus on the main building blocks of the analysis of the second-order properties: estimation
of the covariance operator and principal component analysis. We propose an estimator of the
covariance operator and its eigenvalues and eigenfunctions for partially observed functions and
derive their properties. We deal with the estimation of projections (principal scores) of individual
incomplete functions which is especially challenging. We develop a procedure that enables us
to predict the value of a principal score of a function when only a fragment of the function is
available and direct computation is thus impossible. Next, we propose a method that can recover
the unobserved part of the function from the observed part, using the information about the
distribution of the data that it learns from the sample. We develop automatic procedures for the
selection of the tuning parameter of the method that is based on generalized cross-validation for
incompletely observed functions. We quantify the uncertainty of the predictions of unobserved
quantities and provide approximate prediction regions (intervals and bands) covering the unobserved
random quantity with high probability. Simulations conﬁrm the usefulness and good
performance of the methodology proposed.
Both the prediction of principal scores and the reconstruction of an incomplete function
or its derivatives are important problems. Principal scores are key elements in the exploration
of complex data and can be used as input quantities in many inferential procedures. Their
usefulness in the multivariate setting is well described, for example, in Krzanowski (2000) and
Jolliffe (2002). In the functional context Ramsay and Silverman (2005) provided some real data
examples illustrating how principal scores help to understand the properties of the data. Further
applications can be found in Ramsay and Silverman (2002) and Ramsay et al. (2009). Horv´ath
and Kokoszka (2012) have given a comprehensive account of the utility of principal scores in
procedures like two-sample tests, linear and non-linear regression, clustering and classiﬁcation,
time series analysis or change point analysis. In this paper, we shall see in Section 6 that the
ﬁrst three principal components of the heart rate proﬁles and their derivatives explain a large
proportion of the total variability and are sufﬁciently ﬂexible to describe interesting features of
the curves. Hence the corresponding scores provide an effectively reduced representation of the
complex individual heart rate proﬁles. To perform graphical or formal analyses of the scores, we
need to be able to compute them, which is not straightforward in the partial observation regime.
Also, when an individual curve, surface or image is observed incompletely, one is interested in
visualizing and studying the shape of the missing part, for instance to forecast the continuation
of the natural or social process that is described by the functional variable. Our paper provides
solutions to these problems by developing methods that predict unobserved quantities via their
conditional expectation given the observed data. In addition to their direct application to data,
these methods will be an important tool in future research: for instance, advanced techniques of
missing data analysis in the multivariate setting involve conditional expectations in some form,
and our results will be helpful in extending them to the functional case.
To our knowledge, no results of the kind that we provide here exist for functional data that are
fully (densely in practice) observed on subsets of the domain. A related but different (in terms
of applicability, used methods and achievable results) type of imperfectly observed functional
data was studied by Yao et al. (2005a) who considered sparsely observed functions, i.e. situations
where only a few observed values are available for each function, making it impossible to
reconstruct each curve from these values. Our approach is novel in that it enables us, under the
assumed observation regime, to investigate some genuinely functional aspects of the data. From
the theoretical point of view, exploiting the continuous time nature of the observed data, we can
780 D. Kraus
obtain stronger results than in the sparse regime. For example, the rates of convergence of estimators
of parameters (the covariance operator and eigenelements) are parametric, unlike with
sparsely observed data (see also Hall et al. (2006)). Also, the consistency result for our functional
completion procedure is fully functional, whereas the restrictions of the sparse regime enabled
Yao et al. (2005a) to achieve pointwise or ﬁnite dimensional convergence of the reconstructed
trajectory. From the practical perspective, an important advantage of our method is that derivatives
can be readily analysed in our setting whereas with methods for sparsely observed functions
it is complicated. The method of Liu and M¨uller (2009) is a variant of that of Yao et al. (2005a)
that can deal with derivatives in the sparse regime to some extent. Although the method of Liu
and M¨uller (2009) can reconstruct derivatives, it does not provide insight into their covariance
structure because it neither estimates the covariance operator of the derivatives nor performs
principalcomponentanalysisofthederivatives(itisbasedonderivativesofeigenfunctionsrather
than on eigenfunctions of derivatives). Since derivatives describe the dynamics of the underlying
real world process, the analysis of derivatives, and especially of the principal sources of their
variability, is often revealing in many applications, including the one we consider in this paper.
Mathematically, the problem that we need to solve for the computation of unobserved quantities
(prediction of principal scores or reconstruction of missing parts of trajectories) is seen
to be an ill-posed inverse problem (e.g. Groetsch (1993)), and regularization techniques need
to be applied. Such problems previously appeared in the literature on complete functional data
mainly in the area of functional regression modelling; see, for example, Cardot et al. (1999,
2007), M¨uller and Stadtm¨uller (2005), Cai and Hall (2006), Hall and Horowitz (2007) or He
et al. (2010). Inverse problems similar to those which we encounter here also arise in connection
with functional canonical correlations (e.g. He et al. (2003)) or with tests of hypotheses on
parameters of functional data (e.g. Mas (2007), Horv´ath et al. (2010, 2013), Aston and Kirch
(2012), Kraus and Panaretos (2012) and Jaruˇskov´a (2013)). Our problem is related to the task
of prediction that was previously studied in the literature on functional time series; see, for
example, Bosq (2000), Antoniadis and Sapatinas (2003) or Kargin and Onatski (2008). None
of these references, however, assumes the partial observation pattern that we consider in this
paper.
The paper is organized as follows. In Section 2 we formalize the mechanism of partial observation
of functional data and deal with the estimation of the mean function and covariance
operator. Section 3 develops principal component analysis for incompletely observed functions.
In Section 4, a method is proposed to reconstruct the missing part of a partially observed curve.
Sections 5 and 6 present a simulation study and a data example. Appendix A contains proofs
of the main theoretical results (theorems 1 and 2). A supplementary document available on line
contains proofs of propositions 1–4 and a detailed description of computational procedures.
The programs that were used to analyse the data and some example data can be obtained
from
http://wileyonlinelibrary.com/journal/rss-datasets
2. Partially observed functional data
Functional data X1,:::, Xn are seen as independent identically distributed random variables
in the separable Hilbert space of square integrable functions on a bounded domain. Without
loss of generality, we consider the space L2.[0,1]/ with inner product f,g =
1
0 f.t/g.t/dt,
f, g ∈L2.[0, 1]/ and norm f = f,f 1=2
. It is possible to extend our results to vector-valued
functions or more general domains for applications with spatial curves, surfaces, images etc.
Partially Observed Functional Data 781
Intraditionalfunctionaldataanalysis,itisassumedthatthefunctionsX1,:::,Xn areobserved
on the whole interval [0, 1]. We consider situations where each curve Xi is observed only on a
subset of [0,1]. Speciﬁcally, let the observation periods be Oi ⊂ [0,1], i = 1,:::,n. Then the
observed data for the ith curve are Xi.t/, t ∈Oi. (In practice, the raw data are most often in the
form of possibly noisy observations on a dense grid of points in Oi, which enables us to assume
that the curves are observed fully in Oi, as is explained by Hall et al. (2006).) We collectively
denote the observed part of the curve as XiOi , which can be seen as a random element of the
space L2.Oi/. The values of Xi on the complement of Oi, Mi = [0,1] \ Oi, are not observed;
the missing part of the trajectory is denoted as XiMi . The observation periods Oi, i=1,:::,n,
are modelled as random subsets of [0,1]. We assume that each realization of Oi is the union
of a ﬁnite number of intervals. This assumption is not restrictive for practical applications,
although some generalizations are probably possible. We assume that the observation periods
are independent of the functions X1,:::,Xn, i.e. the data are missing completely at random.
(Under this assumption, the observation periods can also be seen as ﬁxed when inference is
made about the curves.)
The main characteristics of the distribution that generates the data are the mean function
and the covariance operator. Let the mean function be μ = E.X1/. The covariance operator
R :L2.[0, 1]/→L2.[0, 1]/ is deﬁned as
Rf =E{ f,X1 −μ .X1 −μ/}=
1
0
ρ.·,t/f.t/dt,
where ρ.s,t/=cov{X1.s/,X1.t/} is the covariance kernel of the stochastic process X1.
Like in the multivariate case, the mean function μ at point t ∈ [0,1] can be estimated by the
sample mean of observed values at this point. Formally, the estimator can be written as
ˆμ.t/=
J.t/
n
i=1
Oi.t/
n
i=1
Oi.t/Xi.t/,
where the notation Oi.t/ is used for the indicator 1Oi .t/ and J.t/ = 1[Σn
i=1Oi.t/>0]. The values of
Xi.t/ are available only if Oi.t/ = 1; otherwise, the contribution Oi.t/Xi.t/ in the sum above is
zero. The term J.t/ is included to avoid division by 0: if J.t/ = 0, the estimate of the mean is 0
(or arbitrary, as such situations vanish asymptotically).
The estimator ˆR of the covariance operator R is deﬁned through an estimator of its covariance
kernel ρ. We estimate ρ.s, t/ by the sample covariance computed from all complete pairs
of functional values at s and t. The estimator equals
ˆρ.s,t/=
I.s,t/
n
i=1
Ui.s, t/
n
i=1
Ui.s,t/{Xi.s/− ˆμst.s/}{Xi.t/− ˆμst.t/}, .1/
where Ui.s,t/=Oi.s/Oi.t/ and I.s,t/=1[Σn
i=1Ui.s,t/>0]. The estimator of the mean function used
here is
ˆμst.s/=
I.s,t/
n
i=1
Ui.s,t/
n
i=1
Ui.s,t/Xi.s/,
i.e., for the computation of the covariance at s,t, functional values are centred at the sample
mean computed from complete pairs. (It is also possible to centre by the estimator ˆμ that was
introduced before; all results remain valid when ˆμ is used in place of ˆμst.)
782 D. Kraus
The sample covariance operator computed from incomplete functions may be indeﬁnite. This
is similar to the multivariate setting. However, unlike with multivariate data, our experience in
the functional context is that this problem is unimportant in practice because negative eigenvalues
occur far in the tail of the spectrum and are small in comparison with the leading eigenvalues.
The corresponding high frequency features of the data are practically never of interest. If needed,
the estimate ˆR can be modiﬁed by setting negative eigenvalues equal to 0.
It is seen that ˆμ.t/ is an unbiased estimator of μ.t/. Similarly, if we subtract 1 in the denominator
of ˆρ.s,t/, the estimator becomes unbiased for ρ.s,t/. For the estimators ˆμ and ˆR to
be consistent, we need to assume that the observation pattern asymptotically provides enough
information. For the mean function, the right assumption is that
there exists δ1 >0 such that sup
t∈[0,1]
P n−1
n
i=1
Oi.t/ δ1 =O.n−2
/ as n→∞: .2/
Similarly, for the covariance operator, we need the stronger assumption that
there exists δ2 >0 such that sup
.s,t/∈[0,1]2
P n−1
n
i=1
Ui.s,t/ δ2 =O.n−2
/ as n→∞: .3/
Assumption (2) is satisﬁed, for example, when the observation sets O1,:::,On are independent
and identically distributed and π0 =inft∈[0,1]P{O1.t/=1}>0. To see this, set δ1 =π0=2 and use
Hoeffding’s inequality to show that
sup
t∈[0,1]
P n−1
n
i=1
Oi.t/ δ1 exp.−π2
0n=2/:
Analogously, assumption (3) is satisﬁed when we further assume that inf.s,t/∈[0,1]2 P{U1.s,t/=
1}>0. Under these weak assumptions, we obtain a consistency result as follows.
Proposition 1.
(a) LetE. X1
2/<∞andassumption(2)besatisﬁed.ThenE. ˆμ−μ 2/=O.n−1/forn→∞.
(b) Let E. X1
4/ < ∞ and assumption (3) be satisﬁed. Then E. ˆR − R 2
2/ = O.n−1/ for
n→∞ (here · 2 denotes the Hilbert–Schmidt norm).
Note that the properties of the estimators are unaffected by the fact that the functions are
observedonlypartially.Thefull(dense)observationregime,albeitonlyonsubsetsofthedomain,
preserves the convergence rates that are known for complete functional data (see Bosq (2000)
or Horv´ath and Kokoszka (2012) for results in the traditional setting).
3. Principal component analysis
3.1. Estimation of eigenfunctions and eigenvalues
Probably the most fundamental method for functional data is functional principal component
analysis. It provides insight into the complex covariance structure of functional data and is
used to identify main sources of variability and to quantify their importance and to reduce the
dimension of the data.
The theoretical foundation of functional principal component analysis is the Karhunen–
Lo`eve theorem (e.g. Bosq (2000), theorem 1.5) stating that there are random variables βij and
non-random functions ϕj such that the stochastic process Xi admits the decomposition
Partially Observed Functional Data 783
Xi.t/=μ.t/+
∞
j=1
βij ϕj.t/, t ∈[0,1],
where the series converges in mean square, uniformly in t. Here ϕj,j = 1,2,:::, are the orthonormal
eigenfunctions of the operator R and βij, j = 1,2,:::, are uncorrelated mean 0
variables with variances λj, where λ1 λ2 :::>0 are the eigenvalues of R. Functional principal
component analysis is the empirical version of the Karhunen–Lo`eve expansion that aims to
estimate the elements involved in the expansion. For background information on this classical
topic, we refer to Ramsay and Silverman (2005), chapter 8, for an introduction from an applied
perspective, and to Dauxois et al. (1982), Bosq (2000) or Hall and Hosseini-Nasab (2006) for
theoretical studies.
In the case of completely observed functional data, to estimate the eigenvalues λj and eigenfunctions
ϕj, one performs eigendecomposition of the usual sample covariance operator.
When the functions are observed partially, we can proceed similarly and deﬁne the estimators
ˆλj and ˆϕj as the eigenvalues and eigenfunctions of the operator ˆR given by the kernel ˆρ in
equation (1).
It turns out that the asymptotic properties of the empirical eigenvalues and eigenfunctions
remain unchanged by the incompleteness of the observed functions. The following proposition
shows that, ﬁrst, the empirical eigenvalues are consistent estimators of the true eigenvalues
and this consistency is uniform over all indices and, second, the empirical eigenfunctions are
consistent estimators of the true eigenfunctions, up to the usual sign ambiguity.
Proposition 2. Let E. X1
4/ < ∞ and assumption (3) be satisﬁed. Then E[supj∈N{| ˆλj −
λj|2}] = O.n−1/. If moreover all eigenvalues of R have multiplicity 1, then E. ˆϕj − ˆsjϕj
2/
=O.n−1/ forall j ∈N, where ˆsj = sgn ˆϕj,ϕj .
The rates of convergence are parametric because of the full observation regime on subsets;
the situation is different from that of sparsely observed functions, where the estimators of eigenelements
(constructed differently) converge at non-parametric rates (Yao et al., 2005a; Hall
et al., 2006).
3.2. Estimation of principal component scores
In principal component analysis, one is usually interested not only in estimating the eigenfunctions
and eigenvalues but also in the estimation of the principal component scores
βij = Xi −μ,ϕj , i=1,:::,n, j =1,2,:::,
representing the individual co-ordinates of each curve with respect to the eigenbasis (the expression
of the feature ϕj for the ith observation). The leading principal scores provide the
optimal ﬁnite dimensional representation of each curve and can be further analysed by traditional
techniques.
In the standard situation of complete functional data, the scores are easily estimated by
ˆβij = Xi − ˆμ, ˆϕj . When the functional observations are incomplete, the direct computation of
Xi − ˆμ, ˆϕj is impossible because the last term in the expression
Xi − ˆμ, ˆϕj = XiOi − ˆμOi
, ˆϕjOi
+ XiMi − ˆμMi
, ˆϕjMi
is not available. In this equation the subscript Oi or Mi denotes the restriction of the corresponding
function to the ith observed or missing period respectively. We develop a procedure to
estimate the missing quantity XiMi − ˆμMi
, ˆϕjMi
from the observed data.
784 D. Kraus
First, we consider the population version of the problem. Let the function X with mean 0
and covariance operator R be observed on the set O and missing on M. For the following
considerations, the sets O and M, which are independent of X, can be regarded as non-random;
equivalently, derivations can be made conditionally on them. The goal is to predict βjM
= XM, ϕjM from the observed part XO. It is a standard fact that, in terms of the meansquared
prediction error, the best approximation of βjM by a functional of XO is the conditional
expectation E.βjM|XO/. The conditional expectation may be a non-linear functional of
the condition and thus difﬁcult to estimate. Therefore, we propose to look for the best linear
prediction corresponding to a continuous linear functional of the observed curve. This is equivalent
to the best linear approximation of the conditional expectation. By the Riesz representation
theorem, a continuous linear functional takes the form aj,XO , where aj is an element of
L2.O/. The best continuous linear prediction of βjM equals ˜βjM = ˜aj,XO , where ˜aj solves the
inﬁnite dimensional optimization problem
min
aj ∈L2.O/
E{.βjM − aj,XO /2
}: .4/
The objective functional can be rewritten as
E{.βjM − aj,XO /2
}=E{ ϕjM,XM
2
−2 ϕjM,XM aj,XO + aj,XO
2
}
= ϕjM, RMMϕjM −2 ϕjM, RMOaj + aj, ROOaj ,
where ROO is the covariance operator of XO and RMO is the cross-covariance operator of
XM and XO. It is obvious that the objective functional is convex. If a minimizer exists, it can
be found by setting the derivative equal to 0. The derivatives in this context are in the Fr´echet
sense. In particular, we see that
@
@aj
E{.βjM − aj,XO /2
}=−2rj +2ROOaj,
where rj = ROMϕjM with ROM = RÅ
MO (the asterisk denotes the adjoint operator). Thus we
need to solve the equation
ROOaj =rj: .5/
We recognize that this is a linear inverse problem where we need to recover the function aj ∈
L2.O/ from its image through the linear operator ROO.
Let λOOk, k = 1,2,:::, be the decreasing positive eigenvalues and ϕOOk the corresponding
orthonormal eigenfunctions of the operator ROO. By comparing the coefﬁcients of the leftand
right-hand side of equation (5) with respect to the basis ϕOOk, we arrive at the system of
equations λOOk aj,ϕOOk = rj,ϕOOk ,k =1,2,:::. This suggests that a candidate for the solution
is
˜aj =
∞
k=1
rj,ϕOOk
λOOk
ϕOOk, .6/
i.e. ˜aj =R−1
OOrj. This is a valid solution, if it is an element of L2.O/, i.e. if
∞
k=1
rj,ϕOOk
2
λ2
OOk
<∞: .7/
Partially Observed Functional Data 785
This condition is known in the theory of inverse problems as Picard’s condition. A solution to
the inverse problem (5) exists if and only if condition (7) is satisﬁed.
Condition (7) is equivalent to the condition
∞
k=1
corr.βjM, XO,ϕOOk /2
var. XO,ϕOOk /
<∞, .8/
which has a clear interpretation. It states that the missing variable βjM must not be strongly
correlated with complicated, high frequency components of the observed function. The variability
of these components must be sufﬁciently large to provide enough information for the
prediction of βjM. The precise balance between the complexity of the correlation of the unobserved
score with the predictor components and the variability of the predictor components is
quantiﬁed by the requirement on the series above to converge.
In the Gaussian case, the conditional expectation of βjM given the principal scores
XO, ϕOOk ,k=1,2,:::, is an inﬁnite linear combination of these scores (an almost surely convergent
inﬁnite series). One can show this by conditioning on ﬁnitely many components (this
multivariate conditional expectation is linear) and applying L´evy’s 0–1 law (Kallenberg (2002),
theorem 7.23) to obtain the limit. The inﬁnite sum of variances of terms in this series converges,
which is equivalent to the convergence of Σ∞
k=1 rj,ϕOOk
2 λOOk or Σ∞
k=1 corr.βjM,
XO, ϕOOk /2. If, moreover, condition (7) or (8) is satisﬁed, then the coefﬁcients in the inﬁnite
linear combination for the conditional expectation form an l2-sequence; hence the conditional
expectation is continuous in the condition.
From now on, to guarantee the existence of a continuous solution to condition (5), we
assume that condition (7) holds. If it is a priori known that the conditional expectation
E.βjM|XO/ is a continuous linear functional of XO, then condition (7) is automatically satisﬁed.
The operator ROO is a compact operator with inﬁnite dimensional range; therefore, its inverse
R−1
OO is not bounded (i.e. not continuous). Consequently, small perturbations of rj may
lead to large perturbations of ˜aj = R−1
OOrj. It is seen from equation (6) that an overall small
change of rj may result in an arbitrarily large change of ˜aj, if the change of rj occurs on a
coefﬁcient with a sufﬁciently high index k; the division by a sufﬁciently small eigenvalue may
enormously magnify the perturbation. In other words, the solution ˜aj = R−1
OOrj is extremely
unstable and the inverse problem (5) is ill posed. It is important for a solution to be stable with
respect to perturbations of the right-hand side rj because rj is unknown and needs to be estimated.
With estimated right-hand side, the solution to the inverse problem may be arbitrarily
far from the true solution no matter how accurate the estimate is. This is true even when ROO
is known. Moreover, the operator ROO is not known either; its estimate has ﬁnite rank and
therefore is not invertible in L2.O/.
To obtain a stable solution, one needs to use regularization, i.e. to modify the ill-posed inverse
problem in such a way that it becomes well posed with a stable solution. We use ridge
regularization. Instead of problem (5), we solve the problem R
.α/
OOaj =rj with R
.α/
OO =ROO +
αIO, where α > 0 and IO is the identity operator on L2.O/. The inverse R
.α/−1
OO of the
bounded operator R
.α/
OO is bounded and therefore the solution ˜a
.α/
j = R
.α/−1
OO rj is stable.
Denote the regularized best linear prediction of βjM by ˜β
.α/
jM = ˜a
.α/
j ,XO . The stability of the
solution increases with α but the bias of the solution increases also because the problem becomes
more different from the original problem; conversely, with α decreasing, the solution
becomes closer to the exact but unstable solution of the original problem.
We now turn to the practical, empirical version of the problem of computation of principal
scores from partially observed functional data. We have a sample of n functions X1O1
,:::,XnOn
observed on the sets O1,:::, On. The mean function μ and the covariance operator R are
786 D.Kraus
estimated by ˆμ and ˆR introduced in Section 2. The principal score of the ith curve with respect
to the jth eigenfunction is estimated by ˆβ
.α/
ij = ˆβijOi
+ ˆβ
.α/
ijMi
, where ˆβijOi
= XiOi − ˆμOi
, ˆϕjOi
and ˆβ
.α/
ijMi
= ˆa
.α/
ij , XiOi − ˆμOi
. Here the function ˆa
.α/
ij = ˆR
.α/−1
OiOi
ˆrij solves the empirical regularizedinverseproblem
ˆR
.α/
OiOi
aij = ˆrij,where ˆR
.α/
OiOi
= ˆROiOi +αIOi with ˆROiOi beinganintegral
operator on L2.Oi/ with kernel equal to the restriction of the kernel ˆρ of ˆR (see equation (1)) to
Oi ×Oi, and ˆrij = ˆROiMi ˆϕjMi
with ˆROiMi deﬁned analogously by restriction of ˆρ to Oi ×Mi.
We are ready to state the main convergence result that justiﬁes this method. The difference
between the regularized estimator ˆβ
.α/
ijMi
and the best linear prediction ˜βijMi
can be decomposed
into the sum of the estimation error for the regularized prediction and the approximation error
due to regularization, i.e. ˆβ
.α/
ijMi
− ˜βijMi
= ˆβ
.α/
ijMi
− ˜β
.α/
ijMi
+ ˜β
.α/
ijMi
− ˜βijMi
. We show that, when the
amount of regularization decreases at a suitable rate as the sample size increases, both terms
converge to 0 in L2.P/ and thus the regularized estimator of the prediction is consistent.
Theorem 1. Let E. X1
4/ < ∞, assumption (3) be satisﬁed, all eigenvalues of R have multiplicity
1 and condition (7) be satisﬁed for Oi and Mi in place of O and M respectively. Then
E{. ˆβ
.α/
ijMi
− ˜βijMi
/2
} O.α−3
/O.n−1
/+O.α/
as α → 0 and n → ∞. Hence, if α = αn such that αn → 0 and αnn1=3 → ∞ as n → ∞, then
ˆβ
.αn/
ijMi
is a consistent estimator of the best linear prediction ˜βijMi
of βijMi .
Sometimes one is interested in estimating other linear functionals than the principal score
Xi −μ, ϕj . Our consistency results remain valid when ˆϕjOi
is replaced by an arbitrary random
or ﬁxed function ˆfOi
∈ L2.Oi/ such that E. ˆfOi
− fOi
2/ = O.n−1/ for some deterministic
fOi ∈L2.Oi/.
Note that theorem 1 has no strong assumptions. Picard’s condition (7) is a basic assumption
that is required in all inverse problems to guarantee the existence of a solution. Except this standard
requirement, no other condition on the rate of decrease of the eigenvalues λOiOik is needed.
This is because we estimate the prediction ˜aij,XiOi rather than the prediction functional ˜aij
itself. Intuitively, the integration in ˜aij,XiOi brings additional smoothness; the exact way that
this happens is seen in the proof of theorem 1. In a related context of prediction in functional
linear regression, it was observed by Cai and Hall (2006) and Cardot et al. (2007) that weaker assumptions
are needed and stronger results can be obtained when the focus is on prediction rather
than on the estimation of the regression functional. The inverse problem is similar to that solved
in the functional linear model (Cardot et al., 1999, 2007; Hall and Horowitz, 2007). However,
the way that we arrive at it differs from the functional linear model because, for instance, of the
incompleteness of observations there is no collection of response–covariate pairs in the present
situation.
As an alternative to ridge regularization, one may consider the spectral truncation approach.
Both methods have their advantages and disadvantages. For instance, it is known that the behaviour
of spectral cut-off methods depends on the spacings between the eigenvalues of the operator
to be inverted which makes them less robust with respect to situations with similar or even
identicaleigenvalues(seeHallandHorowitz(2007)).Indeed,inapreliminaryanalysisofourmotivating
data set we observed some very similar estimated eigenvalues. There is also an important
computational advantage of the ridge method. For this method, one needs to solve only a linear
equation with ˆR
.α/
OiOi
which is very easy and fast. In contrast, the spectral truncation approach
requires computing the eigendecomposition of ˆROiOi and projecting on the corresponding subspace.
This is computationally more demanding, especially since it must be done repeatedly for
each function because different suboperators ˆROiOi of ˆR corresponding to different functions
Partially Observed Functional Data 787
have different spectral decompositions. Yet another approach may be based on smoothing, for
instance, by penalizing the roughness of the solution of the inverse problem.
3.3. Regularization parameter selection
Theorem 1 shows that, for an appropriate choice of αn, the estimator ˆβ
.αn/
ijMi
is consistent for the
best prediction ˜βijMi
. Theorem 1, however, does not give a practical recommendation on how
to select the regularization parameter. It is desirable to have an automatic, data-driven selection
procedure.
Since the parameter α is difﬁcult to understand, we ﬁrst translate it into more comprehensible
values. By analogy with ridge regression or various standard smoothing techniques, we deﬁne
the number of effective degrees of freedom as the trace of the covariance of the predictors
composed of its regularized inverse, i.e.
dfi.α/=tr. ˆR
.α/−1
OiOi
ˆROiOi /=
∞
k=1
ˆλOiOik
ˆλOiOik +α
, .9/
which is a decreasing function of α. Unlike in standard situations the covariance operator here
is computed from partially observed data. Another way to measure the amount of regularization
is the proportion of retained variability like in classical principal component analysis
using, for example,
tr. ˆROiOi
ˆR
.α/−1
OiOi
ˆROiOi
ˆR
.α/−1
OiOi
ˆROiOi /
tr. ˆROiOi /
=
∞
k=1
ˆλ
3
OiOik=. ˆλOiOik +α/2
∞
k=1
ˆλOiOik
.10/
or a similar quantity. One can determine α such that the effective degrees of freedom equal
some value or the proportion of retained variability exceeds some threshold. These quantities,
however, do not measure the predictive performance of the regularized solution.
A universal recipe for situations of this type is to use generalized cross-validation. In traditional
settings, the generalized cross-validation score is the residual sum of squares (a measure
of goodness of ﬁt) divided by a decreasing function of the effective degrees of freedom
(a penalty included to avoid underregularization). The residual sum of squares is the sum
of squared differences of the response variables and their predictions, which in our case are
ˆβkjMi
= XkMi − ˆμMi
, ˆϕjMi
and ˆβ
.α/
kjMi
= ˆa
.α/
ij ,XkOi − ˆμOi
, k =1,:::,n, respectively. In the situation
of partially observed functions, the pair of the response variable ˆβkjMi
and the explanatory
variable XkOi is not available for all individuals k =1,:::,n. The idea is, therefore, to consider
the set of completely observed functions with indices C ={k :1 k n,
1
0 Ok.t/dt =1}. If this
set is reasonably large, we can compute the residual sum of squares over the complete functions
rssij.α/=
k∈C
. ˆβkjMi
− ˆβ
.α/
kjMi
/2
:
The cross-validation score for the regularized estimation of the jth score of the ith function is
gcvij.α/=
rssij.α/
{1−.1=|C|/dfi.α/}2
,
where |C| is the number of complete functions. One selects the value of α that minimizes this
quantity. Separate values of the regularization parameter are used for each function and each
score.
788 D. Kraus
3.4. Prediction uncertainty
For a statistical procedure to be useful, it is important to quantify its uncertainty, i.e. to assess
how far ˆβ
.αn/
ijMi
can be from βijMi . The following proposition answers these questions.
Proposition 3. Let the assumptions of theorem 1 be satisﬁed and let αn →0 and αnn1=4 →∞
as n→∞. Then ˆβ
.αn/
ijMi
−βijMi is asymptotically distributed as ˜βijMi
−βijMi , which is a zero-mean
random variable with variance that can be consistently estimated by
ˆv2
ij = ˆϕjMi
,. ˆRMiMi − ˆRMiOi
ˆR
.αn/−1
OiOi
ˆROiOi
ˆR
.αn/−1
OiOi
ˆROiMi / ˆϕjMi
:
If the distribution of the data is Gaussian, then the limiting variable is Gaussian.
The assumptions of this proposition are similar to those of the consistency result of theorem
1, except that a slower rate of convergence of the regularization parameter to 0 is needed to
estimate the limiting variance consistently.
The prediction uncertainty, as expressed by the variance ˆv2
ij, does not converge to 0 as the
sample size converges to ∞. This is because the situation is a prediction problem rather than
an estimation problem in the sense that we try to recover a random variable rather than a
non-random parameter. Thus, although increasing the sample size eventually removes the uncertainty
due to unknown estimated quantities (the mean function and covariance operator)
and regularization, there is a fundamental uncertainty that cannot be removed asymptotically.
In other words, the knowledge of the principal score will never be precise, if the functional
observation is incomplete, and the limits of accuracy of the prediction are given by the
asymptotic variance v2
ij. We refer to Didericksen et al. (2012) for an interesting discussion of
similar questions in somewhat related prediction problems in the context of functional time
series.
Proposition 3 immediately enables us to construct a prediction interval for the score. Assume
that a Gaussian distribution is a good approximation for the distribution of the data. Then
Iij;η =. ˆβ
.αn/
ij −z1−η=2 ˆvij, ˆβ
.αn/
ij +z1−η=2 ˆvij/, .11/
where z1−η=2 is the .1−η=2/-quantile of the standard normal distribution, is a prediction interval
for βij with asymptotic coverage probability 1−η, i.e. P.βij ∈Iij;η/→1−η as n→∞.
Since principal component analysis is often used as a dimension reduction procedure and
the resulting principal scores are subsequently analysed by traditional techniques, it is useful to
have a measure of reliability of the computed scores. The true score βij is a random variable with
variance estimated by ˆλj. The predicted score ˆβ
.αn/
ij can be seen as the true score contaminated
by error with variance estimated by ˆv2
ij. One can deﬁne the relative error
ˆvij= ˆλ
1=2
j , .12/
which is the ratio of the error variability and the natural intrinsic variability of the score. This
value, lying between 0 and 1, can be used as an indicator of observations that are too uncertain,
and the scores whose relative error exceeds a certain threshold (e.g. 0.2) can be excluded from
the subsequent analysis. The uncertainty will be high when the association between the missing
part of the score and the observed fragment is weak.
The high uncertainty of predictions due to a small amount of observed information is one
example of situations where we must be cautious. Another such case could be when missingness
is very frequent in certain regions or the overlap of observation periods is not sufﬁciently
frequent because then the precision of the estimation of the covariance function will be locally
reduced, and consequently the prediction procedure may be less accurate. The performance
Partially Observed Functional Data 789
of generalized cross-validation may also be negatively inﬂuenced. Yet another problem could
arise when the data are not missing at random (e.g. when missingness is more likely to occur
when functional values are high). In such cases, missing functional chunks may be indeed very
insidious because important features of the data distribution may be lost. Furthermore, the
presence of functional outliers can be a complication as they may be more difﬁcult to detect
when only fragments are available.
4. Functional completion
4.1. Reconstruction of incomplete functions
It is natural to ask whether it is possible to recover not only the missing part of a principal
score (and thus to compute the score of an incomplete function) like in Section 3 but also the
whole missing part of the trajectory (and thus to reconstruct the whole functional variable).
The answer is positive.
In the population version of the problem, the best prediction of XM by a function of XO in the
sense of the mean integrated prediction squared error is the conditional expectation E.XM|XO/.
It is in general a non-linear operator from L2.O/ to L2.M/ and, similarly to the case of principal
scores, we consider its best continuous linear approximation. Assuming for simplicity that the
functional variable has mean 0, the minimization problem to be solved is
min
A: A ∞<∞
E. XM −AXO
2
/,
where the solution is looked for in the class of continuous (bounded) linear operators from
L2.O/ to L2.M/ (by · ∞ we denote the operator norm). We see (by Fr´echet differentiation or
direct computation) that solving this minimization is equivalent to solving the (normal) equation
AROO =RMO. This suggests the solution ˜A=RMOR−1
OO and the best linear prediction of XM
in the form ˜XM = ˜AXO. From now on, we assume the existence of a bounded solution, i.e. we
assume that RMOR−1
OO ∞ <∞. Similarly to the case of principal scores, the inverse problem
to be solved is ill posed. Using ridge regularization we obtain the solution ˜A
.α/
=RMOR
.α/−1
OO .
The regularized best linear prediction equals ˜X
.α/
M = ˜A
.α/
XO.
Practically, when the sample X1O1
,:::,XnOn is observed on the subsets O1,:::,On, we replace
the covariance operator by its estimate and set ˆA
.α/
i = ˆRMiOi
ˆR
.α/−1
OiOi
. The mean function needs
to be estimated as well. For the ith curve, the best linear prediction of XiMi is estimated by
ˆX
.α/
iMi
= ˆμMi
+ ˆA
.α/
i .XiOi − ˆμOi
/:
To prove the consistency, we assume not only that the solution to the inverse problem (the
prediction operator) is bounded but that it is Hilbert–Schmidt. We have a result as follows.
Theorem 2. Let E. X1
4/<∞, assumption (3) be satisﬁed and RMiOi R−1
OiOi 2 <∞. Then
E. ˆX
.α/
iMi
− ˜XiMi
2
/ O.α−3
/O.n−1
/+O.α/
as α → 0 and n → ∞. Hence, if α = αn such that αn → 0 and αnn1=3 → ∞ as n → ∞, then
ˆX
.αn/
iMi
is a consistent estimator of the best linear prediction ˜XiMi of XiMi .
Note that our consistency result is genuinely functional. It is different from theorem 3 of
Yao et al. (2005a) where it was possible to obtain only a pointwise consistent estimator of
the functional variable. The reason is that we assume that the functions are observed fully (or
densely in practice) on subsets of the domain whereas Yao et al. (2005a) worked in a sparse
790 D. Kraus
observation regime. In other words, we can achieve stronger results because our data contain
more information.
The assumption that the prediction operator ˜Ai =RMiOi R−1
OiOi
is Hilbert–Schmidt ( ˜Ai 2 <
∞) which is needed for the proof is a strengthening of the basic assumption on the continuity
of ˜Ai ( ˜Ai ∞ < ∞). Assumptions of this type were used in related contexts of, for example,
prediction in functional time series (Bosq (2000), chapter 8, and Kargin and Onatski (2008))
and the functional linear model (Yao et al., 2005b; He et al., 2010). It seems possible to replace
this assumption by a combination of the condition ˜Ai ∞ <∞ and a condition on the eigenvalue
sequence λOiOik such that the regularization error can be controlled.
The condition ˜Ai 2 < ∞ can be written explicitly in terms of the covariance structure of
the principal scores of the observed and unobserved part of the function. If the eigendecompositions
of ROiOi and RMiMi are
ROiOi =
∞
k=1
λOiOikϕOiOik ⊗ϕOiOik,
RMiMi =
∞
k=1
λMiMikϕMiMik ⊗ϕMiMik
(where ‘⊗’ stands for the tensor product: .f ⊗g/u= g,u f), then we can write
RMiOi =
∞
j=1
∞
k=1
γMiOijkϕMiMij ⊗ϕOiOik,
where γMiOijk = ϕMiMij, RMiOi ϕOiOik = cov. XMi −μMi ,ϕMiMij , XOi −μOi ,ϕOiOik /. Then
the operator ˜Ai is Hilbert–Schmidt whenever
∞
j=1
∞
k=1
γ2
MiOijk
λ2
OiOik
<∞,
which is equivalent to
∞
j=1
λMiMij
∞
k=1
corr. XMi −μMi ,ϕMiMij , XOi −μOi ,ϕOiOik /2
λOiOik
<∞:
It is seen that this condition combines conditions for the prediction of XMi −μMi ,ϕMiMij ,
j =1, 2,::: (compare the inner series above with condition (8)).
4.2. Selection of the regularization parameter
To understand the amount of regularization corresponding to α, we can use the effective
degrees of freedom or the proportion of retained variability as deﬁned in equations (9) and
(10) respectively. For the selection of α automatically balancing the stability and accuracy of
the prediction of XiMi , we propose a similar cross-validation procedure to that in Section 3.3 for
principal scores. The residual sum of squares for the prediction of trajectories on Mi computed
for the completely observed curves in the sample is
rssi.α/=
k∈C
XkMi − ˆX
.α/
kMi
2
:
The value of α that is used for the prediction of a function on Mi from its observation on Oi
minimizes
Partially Observed Functional Data 791
gcvi.α/=
rssi.α/
{1−.1=|C|/dfi.α/}2
:
4.3. Uncertainty and prediction bands
Theorem 2 shows that ˆX
.αn/
iMi
consistently estimates the best linear prediction ˜XiMi . We are
now interested in the variation of ˆX
.αn/
iMi
around the target quantity: the unobserved function
XiMi .
Proposition 4. Let the assumptions of theorem 2 be satisﬁed and let αn →0 and αnn1=4 →∞
as n→∞. Then ˆX
.αn/
iMi
−XiMi is asymptotically distributed (in the sense of weak convergence of
probability measures on L2.[0,1]// as the mean 0 stochastic process ˜XiMi −XiMi . The limiting
covariance operator is consistently estimated (with respect to the Hilbert–Schmidt norm) by
ˆVi = ˆRMiMi − ˆRMiOi
ˆR
.αn/−1
OiOi
ˆROiOi
ˆR
.αn/−1
OiOi
ˆROiMi :
If the data are Gaussian, then the limiting stochastic process is Gaussian.
The trace of ˆVi quantiﬁes the total amount of uncertainty of the linear prediction of XiMi . It
approaches 0 as the Lebesgue measure of the missing region Mi approaches 0, i.e. as we approach
a completely observed function. When the measure of the observation period Oi converges to 0,
the total prediction uncertainty converges to the trace of ˆR, which corresponds to the situation
of no information about the ith curve. The scale invariant ratio
tr. ˆVi/1=2
=tr. ˆR/1=2
.13/
measures the relative prediction error, i.e. the amount of uncertainty about the ith curve as a
proportion of the total spread of the distribution of the functional random variable. 1 minus this
value corresponds to the reduction of uncertainty that is achieved by the best linear prediction
and can be seen as a measure of performance of the completion procedure. Alternatively, we
can use ˆRMiMi instead of ˆR in the denominator in the relative prediction error, leading to the
ratio of the uncertainty about the missing trajectory when the prediction method is used versus
the uncertainty that there would be about XiMi if we ignored the observed part.
We use the asymptotic distribution of ˆX
.αn/
iMi
−XiMi for the construction of prediction bands
for the unobserved part of the trajectory, i.e. regions containing the curve XiMi with high probability.
We consider bands of the form
{.t,x/: ˆX
.αn/
iMi
.t/−c1−η
ˆh.t/ x ˆX
.αn/
iMi
.t/+c1−η
ˆh.t/, t ∈Mi}, .14/
where ˆh is a function that consistently estimates some limiting function h that is bounded
away from zero, and c1−η is the .1 − η/-quantile of the random variable supt∈Mi
| ˜XiMi .t/ −
XiMi .t/|=h.t/. This band has asymptotic coverage 1 − η. One can choose ˆh = 1, leading to a
band with constant width, but typically one prefers a band whose width at time t reﬂects the
uncertainty of the prediction of the missing function at t. We use ˆh.t/=max{ ˆh0, ˆvi.t/} where ˆvi.t/
is the estimated standard deviation of the limiting predictive distribution at time t, i.e. the square
root of the diagonal of the kernel of ˆVi, and ˆh0 is a threshold guaranteeing that the limiting
function h is bounded away from 0. For example, the choice ˆh0 = 0:2 supt∈Mi
ˆvi.t/ works well
in practice. If the distribution of the data can be considered as Gaussian, the quantile c1−η can
be computed by simulation as follows. Generate a large number of independent realizations of
the Gaussian process with mean 0 and covariance operator ˆVi, divide them by ˆh.t/, compute
the maxima of their absolute values and determine the .1 − η/-quantile of this sample. The
792 D. Kraus
simulation of the trajectories and the computation of the maxima are performed on a ﬁne grid
of points. Note that the width of the band does not converge to 0 because it is a prediction band,
i.e. it must contain, with high probability, a random function.
We conlude this section with a theoretical remark. Although the prediction bands proposed
work well in practice, as is documented in the simulation study in Section 5, for a strictly rigorous
justiﬁcation arguments based on proposition 4 (which is a consequence of theorem 2)
need to be extended. Proposition 4 guarantees the convergence in distribution in the sense of
the topology of the L2-norm of the Hilbert space L2.[0,1]/. This justiﬁes the construction of
prediction regions in the form of balls in L2.[0,1]/ which, however, are not practical because
they cannot be plotted. For prediction bands, the convergence is needed in the sense of the
uniform topology. For this, we need to leave the geometric world of L2.[0,1]/ and to switch to
the space of continuous functions C.[0, 1]/. Under modiﬁed assumptions (which would include
conditions on sample paths, such as H¨older continuity), it seems possible to prove the convergence
in the uniform topology. We do not pursue this theoretical study but give arguments
indirectly justifying the use of the bands. Suppose that the asymptotic approximation that is
suggested by theorem 2 and proposition 4 is considered applicable if the L2-distance from the
limiting variable is sufﬁciently small. The probability that this L2-distance exceeds some ">0 is,
in light of Chebyshev’s inequality, bounded as P. ˆX
.αn/
iMi
− ˜XiMi
2
2 >"/ "−2E. ˆX
.αn/
iMi
− ˜XiMi
2
2/.
However, convergence in the L2-norm does not imply uniform convergence because large deviations
may occur on a small set of arguments. Let us compute the Lebesgue measure γ of the
set where | ˆX
.αn/
iMi
− ˜XiMi | deviates more than " from 0. We compute γ.{t : | ˆX
.αn/
iMi
.t/ − ˜XiMi .t/| >
"}/ "−2 ˆX
.αn/
iMi
− ˜XiMi
2
2 by using Chebyshev’s inequality. Taking expectations on both sides,
we obtain on the right-hand side the same bound as before. Hence, if the bound is considered
to be sufﬁciently small for the asymptotic approximation in the L2-norm to be applicable,
then also the expected Lebesgue measure of the set of large pointwise deviations will be
negligible.
5. Simulations
A simulation study was designed to address the following goals: to investigate the performance
of generalized cross-validation as a selector of the regularization parameter, to verify the validity
and accuracy of the prediction intervals and bands and to explore the effect of the observation
pattern.
We generate random samples of curves of the form
X.t/=
100
k=1
21=2
ν
1=2
k ξk cos.2πkt/+
100
k=1
21=2
ω
1=2
k ηk sin.2πkt/, t ∈[0,1],
where ξk and ηk are independent standard normal variables and the eigenvalues are of the form
νk =3−.2k−1/ and ωk =3−2k. The three most important components represent 67%, 22% and 7%
of the total variability. For each curve we generate independently a random period on which this
curve is not observed. The functional values on this period are removed. For the ith function,
the missing period Mi is simulated in the form Mi =[Ci −Ei,Ci +Ei]∩[0,1] with Ci =dU
1=2
i,1 and
Ei =fUi,2, where d and f are parameters and Ui,1 and Ui,2 are independent variables uniformly
distributed on [0,1]. The performance of our procedures is measured on one curve in the sample,
say X1. For this curve, we use a ﬁxed (non-random) missing period to guarantee that values
computed from different simulation runs have the same meaning. In all simulations, we use
L=1000 repetitions.
Partially Observed Functional Data 793
Table 1. Performance of the generalized cross-validation selection procedure†
Target quantity n MSPE for α=cαgcv Median
(and its variability) and the following values of c: degrees of
freedom
0.04 0.2 1 5 25
for α=αgcv
Score 1 (333) 100 1.91 1.55 1.32 1.61 3.78 7.68
500 0.60 0.44 0.36 0.42 1.07 12.73
Score 2 (111) 100 0.46 0.37 0.35 0.44 0.80 8.61
500 0.16 0.13 0.12 0.15 0.27 13.71
Score 3 (37) 100 1.45 1.13 0.95 1.08 2.00 8.62
500 0.48 0.34 0.28 0.29 0.53 13.71
Missing trajectory (500) 100 10.07 7.90 6.95 8.24 15.16 7.98
500 4.04 2.79 2.24 2.30 3.48 15.02
†MSPE and the variability of the target quantity are multiplied by 1000.
For the ﬁrst two sets of simulations, we set d =1:4 and f =0:2. This leads to an observation
pattern with similar characteristics to those in our motivating data set. The cross-sectional probability
of observation ranges from 99% at time 0 to 79% at time 1. The percentage of complete
curves is 39%. The median length of the missing period (given the curve has a missing period)
is 0:15. For the curve X1, on which the performance is measured, we set M1 =.0:4,0:7/.
First, we investigate the performance of generalized cross-validation based on complete
curves. As a measure of quality of the prediction of a missing quantity, we use the meansquared
prediction error MSPE which is the average over all simulation runs of the squared
distances of the predicted value and the true value, i.e. L−1ΣL
l=1 . ˆβ
.α/[l]
1jM1
− ˆβ
[l]
1jM1
/2 for the jth
score and L−1 ΣL
l=1
ˆX
.α/[l]
1M1
− X
[l]
1M1
2 for the missing part of the trajectory, where the superscript
[l] indicates that the value pertains to the lth generated sample. Table 1 shows values of
the mean-squared prediction error for the ﬁrst three principal scores and for the missing part
of the trajectory. Table 1 also includes the variability of the target quantities (i.e. the true eigenvalues
for the scores and the trace of the true covariance operator R for the trajectory) to put
the values into context. The mean-squared prediction error is reported for α set to the value
selected by generalized cross-validation and to values slightly smaller or bigger in the form of
multiples of the selected value. We see that the method successfully approximates the best value
of α and can be recommended as the tuning parameter selector. The accuracy increases with
increasing sample size n; however, it should be noted that the mean-squared prediction error
cannot converge to 0 because there is always some uncertainty due to the randomness of the target
quantity, as discussed in Sections 3.4 and 4.3. The last column of Table 1 reports the median
of the effective degrees of freedom corresponding to the selected value of α. It is seen that in all
cases the typical number of degrees of freedom is in a reasonable relation to the sample size.
The second set of simulations explores the properties of the approximate distribution of the
deviation of the prediction from the predicted quantity that is established in propositions 3 and
4. We simulate from the same distribution and observation pattern as before. The regularization
parameterisselectedbygeneralizedcross-validation.Weconsiderpredictionintervalsandbands
of the form (11) and (14) respectively, with nominal coverage 95%. We compute bands with both
constant and variable width, as discussed in Section 4.3. Empirical coverage probabilities (i.e.
the percentage of cases when the unobserved quantity was covered by the constructed region)
are reported in Table 2. We see that the intervals and bands proposed have coverage that is close
794 D. Kraus
Table 2. Empirical coverage of prediction regions (intervals for scores; bands with constant and variable
width for curves) and the median relative error measure
n Results for Results for Results for Results for missing trajectory
score 1 score 2 score 3
Coverage Coverage Median
Coverage Median Coverage Median Coverage Median (constant (variable relative
(%) relative (%) relative (%) relative width) (%) width) (%) error
error error error
100 97.2 0.073 95.2 0.056 94.5 0.143 94.3 96.7 0.123
500 97.4 0.042 95.0 0.036 96.3 0.092 94.2 98.4 0.07
Table 3. Standardized mean-squared prediction error for different observation patterns
n Observation Results for Results for Results for Results for
pattern (X1) score 1 and score 2 and score 3 and missing trajectory and
the following the following the following the following
observation observation observation observation
patterns (sample): patterns (sample): patterns (sample): patterns (sample):
A B A B A B A B
100 I 0.022 0.045 0.052 0.093 0.035 0.067 0.040 0.075
II 0.039 0.073 0.078 0.128 0.107 0.155 0.076 0.136
500 I 0.006 0.013 0.018 0.031 0.010 0.023 0.013 0.024
II 0.019 0.027 0.037 0.051 0.060 0.076 0.035 0.051
to the nominal coverage and, therefore, provide useful information on the probable values of the
scores or the missing trajectory. Table 2 also reports the median of relative error measures (12)
and (13). For instance, we can see that the approximate distribution is relatively more spread
for less variable (higher index) scores. This is in line with conclusions from Table 1 where we
observed a similar relationship between MSPE and the variability of the target quantity. Hence
the relative error measures (12) and (13), which can be computed from the data, seem to be
valuable indicators of the accuracy of the reconstruction procedure.
In the last set of simulations, we study the effect of the observation pattern on the accuracy
of our methods. We vary the amount of observed information both for X1 (whose
characteristics are to be reconstructed) and for the whole sample (which is used to learn
the reconstruction procedure). Two settings are used for the missing period of X1: I, M1 =
.0:4, 0:7/; II, M1 = .0:4,0:9/. For the simulation of the missing periods of other curves in
the sample, we simulate Mi of the form given earlier in this section, with parameter pairs
A, d = 1:4 and f = 0:2, and B, d = 1:4 and f = 0:5. Basic characteristics of the observation
pattern for A were discussed before; for B, the cross-sectional observation probability varies
from 95% at t = 0 to 50% at t = 1, 21% of curves are complete and the average length of
missing periods (among incomplete curves) is 0.29. Conﬁguration IA was used in the ﬁrst
two sets of simulations; other combinations contain less observed information. Results are
reported in Table 3 where mean-squared prediction errors are presented after standardization
by the true variance of the predicted quantity, i.e. by the variance of the missing part of the
score, var.β1jM1
/, or by the trace of the covariance operator of the missing part of the trajec-
Partially Observed Functional Data 795
tory, tr.RM1M1 /; after this standardization it is possible to compare values under pattern I with
their counterparts computed under II. We see that the precision of estimation decreases as the
amount of observed information (either on the curve of interest or on the sample) decreases.
6. An illustration: ambulatory blood pressure monitoring data
Heart rate proﬁles displayed in Fig. 1 and their ﬁrst derivative plotted in Fig. 2 were obtained
from raw observations by penalized spline smoothing described in the supplementary ﬁle that
is available on line. The curves were registered by shifting the individual timescales so that every
person’sbedtimeis23(i.e.11p.m.);individualbedtimeswereavailablefromaquestionnaire.The
methodology that is developed in this paper requires that the observation periods be independent
of the curves. The expert opinion is that this is a realistic assumption; in addition, we performed
exploratory graphical checks that did not indicate any problem with regard to this assumption.
From the shape of the mean functions of the proﬁles and their ﬁrst derivatives it is obvious that
on average heart rate proﬁles have a decreasing shape in this part of the day and they decrease
fastest around the bed time. We wish to understand the main sources of variability between individual
heart rate proﬁles. In Fig. 3 we plot the ﬁrst three eigenfunctions of the proﬁles and of their
derivatives as perturbations of the mean shape (see Ramsay and Silverman (2005), section 8.3.1)
i.e. we plot the mean proﬁle plus and minus a suitable multiple of each eigenfunction (the eigenfunctions
are multiplied by 0:9 ˆλ
1=2
j ). For the proﬁles, we see that the most important component
is the global level of heart rate, followed by a component describing the difference between the
day and night values and a component that can be interpreted as a time shift. In terms of the ﬁrst
derivative, the ﬁrst component quantiﬁes the global level of the speed of decrease, the second
component captures a shift in time and the third characterizes whether the individual’s heart rate
decreases rather suddenly or more gradually. The ﬁrst three components explain a large proportion
of the total variability and provide enough ﬂexibility to capture individual shape features,
e.g. the increasing trend of some curves in regions where the mean and most curves decrease.
Let us now focus on the individual level. To illustrate our prediction method for principal
scores, we ﬁrst consider the curve that is plotted as short dashes in Figs 1(b) and 2(b). The functional
values are missing on a subset of the time interval and hence the principal scores cannot
be computed directly. They can, however, be predicted. We give the results for the proﬁle only
20 21 22 23 24 25 26
−8−6−4−2024
Time
(a) (b)
20 21 22 23 24 25 26
−8−6−4−2024
Time
Fig. 2. (a) Subset of the sample of the ﬁrst derivatives of heart rate proﬁles and (b) several curves in detail
796 D. Kraus
20 22 24 26
60708090
Time
(a) (b) (c)
(d) (e) (f)
20 22 24 26
607080
Time
20 22 24 26
6575
Time
20 22 24 26
−4.0−2.5−1.0
Time
20 22 24 26
−3.5−2.0−0.5
Time
20 22 24 26
−3.0−2.0
Time
Fig. 3. (a)–(c) First three eigenfunctions of heart rate proﬁles and (d)–(f) of their ﬁrst derivative plotted
as perturbations (- - - - -, ) of the mean ( ):(a) principal component 1, 87.2%;(b) principal component
2, 9.3%; (c) principal component 3, 2.1%; (d) principal component 1, 59.5%; (e) principal component 2,
33.8%; (f) principal component 3, 4.5%
(one can proceed analogously for the ﬁrst derivative). The predicted values for the ﬁrst three
components are .−28:7, 2:9, −1:9/. Their prediction standard deviations quantifying the uncertainty
are .1:7,2:3,1:8/. Mainly for the ﬁrst two components they are relatively small compared
with the standard deviations of the intrinsic variability .24:0,7:8,3:7/ (the square root of the
eigenvalues); the corresponding relative errors are .0:07,0:29,0:48/. It is not surprising that the
best precision is achieved for the ﬁrst component: this component dominates the spectrum and
is quite simple (constant), so even a fraction of the curve provides relatively much information
about the score. Next, we illustrate the method on the completely observed function plotted
as the chain curves in Figs 1(b) and 2(b) from which we artiﬁcially remove observations in the
time interval [23.75,26]. Using the remaining part for the prediction, we estimate the scores by
.5:84,4:43,4:18/ (with prediction standard deviations .2:12,2:68,2:01/), which is quite close to
the true values .5:76,4:55,4:32/ computed from the complete curve (recall, however, that there
will always be some random non-vanishing discrepancy between the predicted and true values
because we predict random variables by their conditional expectations).
Finally, we illustrate the functional reconstruction procedure. In Fig. 4 we plot the two curves
(and their derivatives) that we considered before and the reconstructed missing parts along with
95% prediction bands. For the originally complete function (Figs 4(b) and 4(d)), we chose a difﬁcult
scenario: the missing period is relatively large (2.25 h) and it contains a non-trivial change
of shape of the curve mainly in terms of the ﬁrst derivative which is decreasing in the observed
region and increasing in the missing period. However, it is seen that the completion procedure
can recover the missing part of information as the predicted curve (thick) approximates very well
the true function (thin). It is interesting that our method captures to some extent the presence
Partially Observed Functional Data 797
20 21 22 23 24 25 26
406080100
Time
(a) (c)
(b) (d)
20 21 22 23 24 25 26
−6−4−202
Time
20 21 22 23 24 25 26
406080100
Time
20 21 22 23 24 25 26
−6−4−202
Time
Fig. 4. (a), (b) Observed (———) and reconstructed ( ) heart rate proﬁles and (c), (d) derivatives
along with 95% prediction bands for (a), (c) an incompletely observed curve and (b), (d) a complete curve
with an artiﬁcially introduced missing period
of a local minimum in the ﬁrst derivative. This illustrates the usefulness of the reconstruction
procedure: without it important shape features like this would be concealed from the analyst. At
ﬁrst glance, some of the bands may seem to be wide but one needs to keep in mind that they are
prediction (not conﬁdence) bands and, therefore, must cover the random trajectory (rather than
a non-random function) with a high probability. The uncertainty of the completion is in fact not
big in proportion to the intrinsic variability of the stochastic process: the relative error is 0.10
and 0.11 for the curves in Figs 4(a) and 4(c), and 4(b) and 4(d) respectively. A referee pointed
out that the prediction bands for the derivatives are narrower than those for the curves. This
is not a general phenomenon: it is possible to construct simple examples with prediction
bands for derivatives that are wider than those for curves or examples with no such inequality.
Differentiation is an operation that changes the covariance structure of functional data in a
complex manner.
We compared our method with that of Yao et al. (2005a) applied to the raw heart rate values
(not preprocessed by smoothing). Although their method was primarily developed for sparsely
observed curves, it can be also used in our situation. Main results regarding the covariance
structure of the proﬁles were similar for both methods. The proportion of variance explained by
the ﬁrst three principal components was 82.9%, 10.8% and 3.4%. The ﬁrst three eigenfunctions
had a similar shape and interpretation with both methods. There was a high degree of agreement
between principal scores that were obtained by the two methods. The method of Liu and M¨uller
(2009) can reconstruct derivatives. However, our method seems to be the only currently available
798 D. Kraus
method that can perform principal component analysis of derivatives under incompleteness.
This is an important asset of our method over the other approach provided that the data are
sufﬁciently dense on subsets of the domain.
Acknowledgements
This work was done within the ‘Swiss kidney project on genes in hypertension’, which is a
collaboration between Murielle Bochud (Principal Investigator), M. Burnier, O. Devuyst, P.-Y.
Martin, M. Mohaupt, F. Paccaud, A. P´ech`ere-Bertschi, B. Vogt, D. Ackermann, H. Alwan,
Y. Bouatou, N. Dhayat, G. Ehret, I. Guessous, P. Monney, M.-E. Mueller, B. Ponte, M. Pruijm,
S. Reverdin, P. Vuistiner, Z. Kutalik and S. Estoppey. The project was funded by the Swiss
National Science Foundation. Special thanks are given to Murielle Bochud for her support
and interest, and for her understanding of the importance of methodological developments
in statistics. The hospitality of the Institute of Social and Preventive Medicine Lausanne is
gratefully acknowledged. I am also grateful to the Joint Editor, the Associate Editor and two
referees for their interesting comments and encouragement.
Appendix A: Main proofs
Here we prove theorems 1 and 2. Propositions 1–4 are proven in the supplementary document that is
available on line. Recall that we denote by · the L2
-norm of square integrable functions on a domain S
that is obvious from the context (S will be [0, 1] or Oi or Mi). For linear operators, the symbols · ∞ and
· 2 are used for the operator norm and the Hilbert–Schmidt norm respectively, where the operator will
be a mapping between L2
.S1/ and L2
.S2/ with S1 and S2 that is obvious from the context. For deﬁnitions
of basic notions from operator theory, we refer to Bosq (2000).
A.1. Proof of theorem 1
We neglect the fact that the data are centred by the estimated mean function and assume that the mean is
known and equal to 0. The result remains valid when the curves are centred empirically, as the additional
terms are negligible. It is enough to prove the inequality in the statement of the theorem; the remaining
assertions follow easily. We write | ˆβ
.α/
ijMi
− ˜βijMi
| | ˆβ
.α/
ijMi
− ˜β
.α/
ijMi
|+| ˜β
.α/
ijMi
− ˜βijMi
|, which is a decomposition
into the estimation error and approximation error. If we show that both errors converge in L2
.P/ to 0, the
result will follow.
We denote the approximation error A1 =| ˜β
.α/
ijMi
− ˜βijMi
| and compute
E.A2
1/=E{ XiOi , ˜a.α/
ij − ˜a2
iji }
= R
1=2
OiOi
. ˜a.α/
ij − ˜aij/ 2
= R
1=2
OiOi
.R.α/−1
OiOi
−R−1
OiOi
/rij
2
=
∞
k=1
λOiOik
1
λOiOik +α
−
1
λOiOik
2
rij, ϕOiOik
2
=α
∞
k=1
αλOiOik
.λOiOik +α/2
rij, ϕOiOik
2
λ2
OiOik
=O.α/,
where λOiOik and ϕOiOik are the eigenvalues and eigenfunctions of ROiOi and the result follows from the
fact that αλOiOik=.λOiOik +α/2
1 and Picard’s condition (7).
Let us turn to the estimation error | ˆβ
.α/
ijMi
− ˜β
.α/
ijMi
|. The computation of expectations is complicated by
the fact that the quantities ˆROiOi and ˆrij are obtained from the whole sample including the ith function
and thus are dependent on the ith function. We overcome this complication by ﬁrst considering a
modiﬁed problem with estimates of ROiOi and rij independent of the ith function and then showing
Partially Observed Functional Data 799
that this modiﬁcation is asymptotically negligible. Speciﬁcally, we introduce ˆβ
.α/
ijMi.−i/ = ˆR
.α/−1
OiOi.−i/ ˆrij.−i/ with
ˆR
.α/
OiOi.−i/ = ˆROiOi.−i/ +αIOi and ˆrij.−i/ = ˆROiMi.−i/ ˆϕjMi.−i/. Here ˆROiOi.−i/ and ˆROiMi.−i/ are suboperators
of the estimated covariance operator ˆR.−i/ that is computed from all functions except the ith, and
ˆϕjMi.−i/ is a subfunction of the jth eigenfunction ˆϕj.−i/ of ˆR.−i/. We decompose | ˆβ
.α/
ijMi
− ˜β
.α/
ijMi
| as follows:
| ˆβ
.α/
ijMi
− ˜β
.α/
ijMi
| | ˆβ
.α/
ijMi
− ˆβ
.α/
ijMi.−i/|+| ˆβ
.α/
ijMi.−i/ − ˜β
.α/
ijMi
|, .15/
and we show that both terms converge in L2
.P/ to 0.
For the second term on the right-hand side in inequality (15), A2 =| ˆβ
.α/
ijMi.−i/ − ˜β
.α/
ijMi
|, we have
E.A2
2/=E{E.| ˆβ
.α/
ijMi.−i/ − ˜β
.α/
ijMi
|2
|{XkOk
:k =i}/}
=E{E.| XiOi , ˆa.α/
ij.−i/ − ˜a.α/
ij
2
|{XkOk
:k =i}/}
=E{ R
1=2
OiOi
. ˆa.α/
ij.−i/ − ˜a.α/
ij / 2
}:
Using the deﬁnitions of ˆa.α/
ij.−i/ and ˜a.α/
ij and the triangle inequality, we obtain
R
1=2
OiOi
. ˆa.α/
ij.−i/ − ˜a.α/
ij / R
1=2
OiOi
ˆR
.α/−1
OiOi.−i/. ˆROiMi.−i/ −ROiMi / ˆϕjMi.−i/
+ R
1=2
OiOi
ˆR
.α/−1
OiOi.−i/ROiMi . ˆϕjMi.−i/ − ˆsjϕjMi /
+ R
1=2
OiOi
. ˆR
.α/−1
OiOi.−i/ −R.α/−1
OiOi
/ROiMi.−i/ϕjMi
with ˆsj =sgn ˆϕj.−i/, ϕj . Denote these three terms A21, A22 and A23 respectively. We see that
A21 R
1=2
OiOi ∞
ˆR
.α/−1
OiOi.−i/ ∞
ˆROiMi.−i/ −ROiMi ∞ ˆϕjMi.−i/ :
Here, R
1=2
OiOi ∞ is a ﬁnite constant, ˆR
.α/−1
OiOi.−i/ ∞ α−1
and ˆϕjMi.−i/ ˆϕj.−i/ =1. Using proposition
1 we obtain E.A2
21/ α−2
O.n−1
/. For the term A22 we have the bound
A22 R
1=2
OiOi ∞
ˆR
.α/−1
OiOi.−i/ ∞ ROiMi ∞ ˆϕjMi.−i/ − ˆsjϕjMi :
In light of proposition 2, we see that E. ˆϕjMi.−i/ − ˆsjϕjMi
2
/ E. ˆϕj.−i/ − ˆsjϕj
2
/=O.n−1
/. This implies
that E.A2
22/ α−2
O.n−1
/. For the term A23, ﬁrst note that
ˆR
.α/−1
OiOi.−i/ −R.α/−1
OiOi
=R.α/−1
OiOi
.R.α/
OiOi
− ˆR
.α/
OiOi.−i// ˆR
.α/−1
OiOi.−i/
=R.α/−1
OiOi
.ROiOi − ˆROiOi.−i// ˆR
.α/−1
OiOi.−i/:
Therefore, we see that
A23 R
1=2
OiOi
R.α/−1
OiOi ∞
ˆROiOi.−i/ −ROiOi ∞
ˆR
.α/−1
OiOi.−i/ ∞ ROiMi ∞ ˆϕjMi.−i/ :
The ﬁrst, third and ﬁfth term are dominated by α−1=2
, α−1
and 1 respectively. The fourth term is a ﬁnite
constant. Using these bounds and proposition 1 we obtain E.A2
23/ α−3
O.n−1
/. Hence with the help of
the Cauchy–Schwarz inequality we ﬁnally obtain that E.A2
2/ α−3
O.n−1
/.
It remains to analyse the ﬁrst term on the right-hand side of inequality (15). It reﬂects the effect of
omitting the ith observation in the estimation. As this effect is of order O.n−2
/ in terms of mean-squared
difference, this term is negligible compared with the second term. In particular, it can be shown that
E{. ˆβ
.α/
ijMi
− ˆβ
.α/
ijMi.−i//2
} α−3
O.n−2
/. We omit the technical details.
A.2. Proof of theorem 2
To simplify the proof of theorem 2 we assume that the mean is known to be 0 and no centring is performed.
The difference due to the estimation of the mean is of negligible order in comparison with other terms.
Similarly to the proof of theorem 1, we split the prediction error into the estimation error and regularization
error as follows:
ˆX
.α/
iMi
− ˜XiMi
ˆX
.α/
iMi
− ˜X
.α/
iMi
+ ˜X
.α/
iMi
− ˜XiMi :
For the regularization error we compute
800 D. Kraus
E. ˜X
.α/
iMi
− ˜XiMi
2
/= . ˜A
.α/
i − ˜Ai/R
1=2
OiOi
2
2
= αRMiOi R−1
OiOi
R.α/−1
OiOi
R
1=2
OiOi
2
2
α RMiOi R−1
OiOi
2
2 α1=2
R.α/−1
OiOi
R
1=2
OiOi
2
∞
=α ˜Ai
2
2 sup
k∈N
α1=2
λ
1=2
OiOik
λOiOik +α
2
O.α/:
We turn to the estimation error. Similarly to the proof of theorem 1 we avoid the dependence between
ˆA
.α/
i and XiOi in ˆX
.α/
iMi
= ˆA
.α/
i XiOi by considering ˆX
.α/
iMi.−i/ = ˆA
.α/
i.−i/XiOi , where the estimator of the covariance
operator in the prediction operator is replaced by its analogue based on all curves except the ith. The
difference is negligible in comparison with the remaining terms; for an analogous discussion see the proof
of theorem 1. The modiﬁed estimation error equals
E. ˆX
.α/
iMi.−i/ − ˜X
.α/
iMi
2
/=E{ . ˆRMiOi.−i/
ˆR
.α/−1
OiOi.−i/ −RMiOi R.α/−1
OiOi
/R
1=2
OiOi
2
2}
E{ . ˆRMiOi.−i/ −RMiOi / ˆR
.α/−1
OiOi.−i/R
1=2
OiOi 2
+ RMiOi . ˆR
.α/−1
OiOi.−i/ −R.α/−1
OiOi
/R
1=2
OiOi 2}2
:
The proof is complete on computing
E{ . ˆRMiOi.−i/ −RMiOi / ˆR
.α/−1
OiOi.−i/R
1=2
OiOi
2
2} E. ˆRMiOi.−i/ −RMiOi
2
2
ˆR
.α/−1
OiOi.−i/
2
∞ R
1=2
OiOi
2
∞/
E. ˆRMiOi.−i/ −RMiOi
2
2α−2
λOiOi1/
=α−2
O.n−1
/,
E{ RMiOi . ˆR
.α/−1
OiOi.−i/ −R.α/−1
OiOi
/R
1=2
OiOi
2
2} E. RMiOi
2
∞
ˆR
.α/−1
OiOi.−i/
2
∞
× ˆROiOi.−i/ −ROiOi
2
2 R.α/−1
OiOi
R
1=2
OiOi
2
∞/
RMiOi
2
∞α−2
E. ˆROiOi.−i/ −ROiOi
2
2α−1
/
=α−3
O.n−1
/:
References
Antoniadis, A. and Sapatinas, T. (2003) Wavelet methods for continuous-time prediction using Hilbert-valued
autoregressive processes. J. Multiv. Anal., 87, 133–158.
Aston, J. A. D. and Kirch, C. (2012) Detecting and estimating changes in dependent functional data. J. Multiv.
Anal., 109, 204–220.
Benko, M., H¨ardle, W. and Kneip, A. (2009) Common functional principal components. Ann. Statist., 37, 1–34.
Bosq, D. (2000) Linear Processes in Function Spaces. New York: Springer.
Bugni, F. A. (2012) Speciﬁcation test for missing functional data. Econmetr. Theor., 28, 959–1002.
Cai, T. T. and Hall, P. (2006) Prediction in functional linear regression. Ann. Statist., 34, 2159–2179.
Cardot, H., Ferraty, F. and Sarda, P. (1999) Functional linear model. Statist. Probab. Lett., 45, 11–22.
Cardot, H., Mas, A. and Sarda, P. (2007) CLT in functional linear regression models. Probab. Theor. Reltd Flds,
138, 325–361.
Dauxois, J., Pousse, A. and Romain, Y. (1982) Asymptotic theory for the principal component analysis of a vector
random function: some applications to statistical inference. J. Multiv. Anal., 12, 136–154.
Delaigle, A. and Hall, P. (2013) Classiﬁcation using censored functional data. J. Am. Statist. Ass., 108, 1269–1283.
Didericksen, D., Kokoszka, P. and Zhang, X. (2012) Empirical properties of forecasts with the functional autoregressive
model. Computnl Statist., 27, 285–298.
Ferraty, F. and Romain, Y. (eds) (2011) The Oxford Handbook of Functional Data Analysis. Oxford: Oxford
University Press.
Ferraty, F. and Vieu, P. (2006) Nonparametric Functional Data Analysis. New York: Springer.
Goldberg, Y., Ritov, Y. and Mandelbaum, A. (2014) Predicting the continuation of a function with applications
to call center data. J. Statist. Planng Inf., 147, 53–65.
Partially Observed Functional Data 801
Groetsch, C. W. (1993) Inverse Problems in the Mathematical Sciences. Braunschweig: Vieweg.
Hall, P. and Horowitz, J. L. (2007) Methodology and convergence rates for functional linear regression. Ann.
Statist., 35, 70–91.
Hall, P. and Hosseini-Nasab, M. (2006) On properties of functional principal components analysis. J. R. Statist.
Soc. B, 68, 109–126.
Hall, P., M¨uller, H.-G. and Wang, J.-L. (2006) Properties of principal component methods for functional and
longitudinal data analysis. Ann. Statist., 34, 1493–1517.
He, G., M¨uller, H.-G. and Wang, J.-L. (2003) Functional canonical analysis for square integrable stochastic
processes. J. Multiv. Anal., 85, 54–77.
He, G., M¨uller, H.-G., Wang, J.-L. and Yang, W. (2010) Functional linear regression via canonical analysis.
Bernoulli, 16, 705–729.
Horv´ath, L., Huˇskov´a, M. and Kokoszka, P. (2010) Testing the stability of the functional autoregressive process.
J. Multiv. Anal., 101, 352–367.
Horv´ath, L. and Kokoszka, P. (2012) Inference for Functional Data with Applications. New York: Springer.
Horv´ath, L., Kokoszka, P. and Reeder, R. (2013) Estimation of the mean of functional time series and a twosample
problem. J. R. Statist. Soc. B, 75, 103–122.
James, G. M. and Hastie, T. J. (2001) Functional linear discriminant analysis for irregularly sampled curves. J. R.
Statist. Soc. B, 63, 533–550.
James, G. M., Hastie, T. J. and Sugar, C. A. (2000) Principal component models for sparse functional data.
Biometrika, 87, 587–602.
Jaruˇskov´a, D. (2013) Testing for a change in covariance operator. J. Statist. Planng Inf., 143, 1500–1511.
Jolliffe, I. T. (2002) Principal Component Analysis. New York: Springer.
Kallenberg, O. (2002) Foundations of Modern Probability. New York: Springer.
Kargin, V. and Onatski, A. (2008) Curve forecasting by functional autoregression. J. Multiv. Anal., 99, 2508–2526.
Kraus, D. and Panaretos, V. M. (2012) Dispersion operators and resistant second-order functional data analysis.
Biometrika, 99, 813–832.
Krzanowski, W. J. (2000) Principles of Multivariate Analysis. Oxford: Oxford University Press.
Liebl, D. (2013) Modeling and forecasting electricity spot prices: a functional data perspective. Ann. Appl. Statist,
7, 1562–1592.
Liu, B. and M¨uller, H.-G. (2009) Estimating derivatives for samples of sparsely observed functions, with application
to online auction dynamics. J. Am. Statist. Ass., 104, 704–717.
Mas, A. (2007) Testing for the mean of random curves: a penalization approach. Statist. Inf. Stoch. Processes, 10,
147–163.
M¨uller, H.-G. and Stadtm¨uller, U. (2005) Generalized functional linear models. Ann. Statist., 33, 774–805.
Panaretos, V. M., Kraus, D. and Maddocks, J. H. (2010) Second-order comparison of Gaussian random functions
and the geometry of DNA minicircles. J. Am. Statist. Ass., 105, 670–682.
Pruijm, M., Ponte, B., Ackermann, D., Vuistiner, P., Paccaud, F., Guessous, I., Ehret, G., Eisenberger, U.,
Mohaupt, M., Burnier, M., Martin, P.-Y. and Bochud, M. (2013) Heritability, determinants and reference
values of renal length: a family-based population study. Eur. Radiol., 23, 2899–2905.
Ramsay, J. O., Hooker, G. and Graves, S. (2009) Functional Data Analysis with R and MATLAB. New York:
Springer.
Ramsay, J. O. and Silverman, B. W. (2002) Applied Functional Data Analysis. New York: Springer.
Ramsay, J. O. and Silverman, B. W. (2005) Functional Data Analysis. New York: Springer.
Sangalli, L. M., Secchi, P., Vantini, S. and Veneziani, A. (2009) A case study in exploratory functional data
analysis: geometrical features of the internal carotid artery. J. Am. Statist. Ass., 104, 37–48.
Yao, F., M¨uller, H.-G. and Wang, J.-L. (2005a) Functional data analysis for sparse longitudinal data. J. Am.
Statist. Ass., 100, 577–590.
Yao, F., M¨uller, H.-G. and Wang, J.-L. (2005b) Functional linear regression analysis for longitudinal data. Ann.
Statist., 33, 2873–2903.
Supporting information
Additional ‘supporting information’ may be found in the on-line version of this article:
‘Supplementary document: Components and completion of partially observed functional data’.
Supplementary document: Components and
completion of partially observed functional data
David Kraus
Institute of Social and Preventive Medicine, University Hospital Lausanne, Switzerland
Summary. This supplementary document describes computational details of the proposed
methods and provides proofs of Propositions 1, 2, 3 and 4.
1. Computation
1.1. Preliminary steps
In most applications, functional data are observed at discrete time points and are possibly subject
to measurement error, so it is necessary to preprocess the raw data using smoothing techniques to
obtain functions or their derivatives. In the context of partially observed functional data, the measurement
time points are located only in observation periods Oi, while there are no measurements
in missing periods Mi. We assume that the measurement points are dense in the observation periods,
so that it is possible to apply smoothing techniques to obtain the functional values of the ith
curve from the measured values of this curve. We use spline smoothing with a roughness penalty,
as described in Ramsay and Silverman (2005, Chapter 5), but other methods like kernel smoothing
can be used as well. In our experience, a simple approach works well: we apply the smoothing
procedure to all values measured for the ith curve but use the computed smooth curve only for
t ∈ Oi (ignoring it on Mi where measurements are not available to make it reliable).
In practice, the observation and missing periods are typically not given (because they are not
designed) and one needs to deﬁne them. For instance, one can deﬁne Mi to consist of the periods
before the ﬁrst and after the last measurement time and of all gaps between two consecutive measurement
times that are larger than a certain threshold g. The value of g is the largest length of
intervals without measurements over which we are willing to smooth. The choice of g depends on
the particular setting; in general, if, for example, one considers K equidistant points in [0, 1] (e.g.,
K = 10) as the minimum reliable design for smoothing on the whole domain [0, 1], then g = 1/K
seems reasonable.
Sometimes, registration of functional data is needed. Shift registration (Ramsay and Silverman,
2005, Section 7.2) is easy to implement for incomplete functions: in the registration criterion the
sample mean of partially observed functions is computed by the method described in the next
subsection and the distance of each shifted curve from the sample mean is computed by numerical
integration over the observed period of the curve; the criterion is minimised by the Procrustes
method as usual. Methods based on warping can be modiﬁed similarly but further investigation of
their performance is needed.
1.2. Principal component analysis, functional reconstruction
For practical computation we must use ﬁnite dimensional representations of functions and operators.
Two traditional approaches exist: we can use either basis expansions or evaluation on a grid
2 David Kraus
of points. It is difﬁcult to use the basis approach in our situation because incompletely observed
functions are available on different subsets of the time domain. The grid approach is more suited
for this type of data since it works directly with time arguments. Let tk = (k−0.5)/d, k = 1, . . . , d
be a ﬁne grid of equidistant points on which all functions and kernels of integral operators will be
evaluated. Denote by xi the d-dimensional vector of values of Xi at points tk; this vector contains
missing values on components corresponding to tk ∈ Mi while for tk ∈ Oi, its values are obtained
by evaluation of the spline representation of Xi. Denote by X the (n × d)-dimensional data matrix
with xi, i = 1, . . . , n in rows.
The vector m of values of the mean function µ on the grid is estimated by ˆm equal to the vector
of column means of X computed from available (not missing) data in each column. The covariance
kernel ρ of the operator R evaluated on the grid corresponds to the (d × d)-matrix R with entries
Rkl = ρ(tk, tl) and is estimated by the sample covariance matrix ˆR with entry ˆRkl computed from
the data matrix X using all complete pairs of observations in columns k, l.
To estimate the eigenvalues and eigenfunctions, one performs eigen-decomposition of the matrix
ˆR. Denote ∆ = 1/d, the distance between the points of the grid. If the eigenvalues and eigenvectors
of ˆR are ˆκj and ˆuj, j = 1, . . . , d, then the eigenvalues of the operator ˆR are ˆλj = ˆκj∆ and
the corresponding eigenfunctions ˆϕj evaluated on the grid are ˆfj = ˆuj∆−1/2
. The observed part
ˆβijOi
= XiOi
− ˆµOi
, ˆϕjOi
of the jth principal score of the ith curve is computed by numerical
quadrature as ˆβijOi
= xiOi
− ˆmOi
,ˆfjOi
∆, where the latter inner product is the usual inner product
of vectors and the vectors with subscript Oi are subvectors of the original vectors consisting of
elements with indices k such that tk ∈ Oi.
Within the grid representation, the evaluation of an integral operator B in the sense of numerical
integration corresponds to matrix multiplication: for a function h, Bh is computed as Bh∆, where
the vector h and the matrix B are the values of h and of the kernel of B on the grid. From
a purely computational point of view, even linear operators that have no integral representation
may be represented by matrices. In particular, the identity operator I used in ridge regularisation
is represented by the matrix I equal to the identity matrix divided by ∆; indeed, its value at h is
Ih∆ = h, thus it maps the argument on itself. The regularised operator ˆR
(α)
OiOi
is represented by
the matrix ˆR
(α)
OiOi
= ˆROiOi
+αIOi
, where the subscript Oi denotes the submatrix corresponding to
grid points in Oi. Analogously, the operators ˆRMiMi
, ˆRMiOi
etc. are given by the corresponding
submatrices of ˆR. Then the matrix representation of the prediction operator ˆA
(α)
i is computed as
ˆA
(α)
i = ˆROiMi
ˆR
(α)−1
OiOi
∆−1
. The regularised prediction of the missing part of the principal score
and of the missing part of the trajectory can be computed as
ˆβ
(α)
ij = ˆA
(α)
i (xiOi
− ˆmOi
)∆,ˆfjMi
∆, ˆx
(α)
iMi
= ˆA
(α)
i (xiOi
− ˆmOi
)∆ + ˆmMi
.
The covariance operator ˆVi for the missing trajectory is obtained as
ˆVi = ˆRMiMi − ˆA
(α)
i
ˆROiOi
ˆA
(α)T
i ∆2
and the variance for the score is ˆv2
ij = ˆfjMi , ˆVi
ˆfjMi ∆2
.
The effective degrees of freedom can be computed directly using the series in (9) truncated
at d terms, with the eigenvalues ˆλOiOik of ˆROiOi obtained from the eigenvalues of the matrix
ˆROiOi
like in the case of those of ˆR discussed above. Alternatively, one can use the matrix trace
formula trace( ˆR
(α)−1
OiOi
ˆROiOi ∆−1
)∆. The computation of the residual sum of squares for scores
Supplementary document: Partially observed functional data 3
is straightforward; in the case of trajectories, the squared norms of functions are computed as the
squared norms of vectors, multiplied by ∆.
The generalised cross-validation score can be minimised numerically by a Newton-type iterative
procedure. In particular, we use the method “L-BFGS-B” available in the function optim in
the R package (R Core Team, 2013). For the reliability of the optimisation procedure, we found it
useful to scale the input parameters: the minimisation is run with (xi − m)/s in place of xi (and,
consequently, with ˆR/s2
in place of ˆR, ˆλOiOij/s2
in place of ˆλOiOij etc.); once the optimal value
of α is found, it is multiplied by s2
to return to the original scale and perform other computations
with original data. The value s2
= ˆλOiOi1 works well. The evaluation of the generalised crossvalidation
score can be unstable for very small values of α. Therefore, we run the minimisation
routine with a lower limit for α, namely with α0 = max(ε1/2
, α∗), where ε is the value of machine
epsilon and α∗ is such that the effective degrees of freedom equal n/4 (which is a reasonable upper
bound for the number of free parameters). We initialise the iterative procedure with α equal to
max(¯λOiOi
, α0) where ¯λOiOi
is the average of the eigenvalues ˆλOiOij.
2. Proofs
2.1. Proof of Proposition 1
We use the notation Zi = Xi − µ.
For part (a), denote ¯µ(t) = J(t)µ(t) and write
E ˆµ − µ 2
≤ E( ˆµ − ¯µ + ¯µ − µ )2
= E ˆµ − ¯µ 2
+ 2 E( ˆµ − ¯µ ¯µ − µ ) + E ¯µ − µ 2
. (1)
The ﬁrst term on the right-hand side of (1) equals
E
J
n
i=1 Oi
n
i=1
OiZi
2
= n−2
1
0
n
j=1
n
k=1
E
n2
J(t)
(
n
i=1 Oi(t))2
Oj(t)Zj(t)Ok(t)Zk(t) dt
= n−2
1
0
n
j=1
E
n2
J(t)Oj(t)
(
n
i=1 Oi(t))2
E Zj(t)2
dt,
where the last equality follows from the independence of (O1, . . . , On) and (Z1, . . . , Zn), and from
the independence of Zj and Zk for j = k. Rewrite the ﬁrst expectation in the integrand as
E
n2
J(t)Oj(t)
(
n
i=1 Oi(t))2
1[n−1 n
i=1 Oi(t)>δ1] + E
n2
J(t)Oj(t)
(
n
i=1 Oi(t))2
1[n−1 n
i=1 Oi(t)≤δ1] .
For all t ∈ [0, 1], the ﬁrst summand is bounded from above by δ−2
1 while the second summand is
dominated by n2
supt∈[0,1] P(n−1 n
i=1 Oi(t) ≤ δ1). Hence we see that
E ˆµ − ¯µ 2
≤ n−1
δ−2
1 + n2
sup
t∈[0,1]
P n−1
n
i=1
Oi(t) ≤ δ1 E Z1
2
= O(n−1
).
For the last term in (1), we obtain
1
0
E(J(t) − 1)µ(t)2
dt =
1
0
P
n
i=1
Oi(t) = 0 µ(t)2
dt
4 David Kraus
≤ sup
t∈[0,1]
P n−1
n
i=1
Oi(t) ≤ δ1 µ 2
= O(n−2
).
The second term on the right-hand side of (1) is dominated by 2(E ˆµ− ¯µ 2
)1/2
(E ¯µ−µ 2
)1/2
≤
O(n−1
). Putting these results together completes the proof of part (a).
The proof of part (b) is similar. Rewrite
ˆR − R = ( ˆR − ˇR) + ( ˇR − ¯R) + ( ¯R − R), (2)
where ˇR and ¯R are integral operators with kernels
ˇρ(s, t) =
I(s, t)
n
i=1 Ui(s, t)
n
i=1
Ui(s, t)Zi(s)Zi(t),
and ¯ρ(s, t) = I(s, t)r(s, t). The ﬁrst term on the right-hand side of (2) reﬂects the effect of estimation
of the mean. By direct computation, we see that
E ˆR − ˇR 2
2 = E
[0,1]2
I(s, t){ˆµst(s) − µ(s)}2
{ˆµst(t) − µ(t)}2
dsdt
= E
[0,1]2
I(s, t)
(
n
i=1 Ui(s, t))4
n
i=1
Ui(s, t)Zi(s)
2 n
i=1
Ui(s, t)Zi(t)
2
dsdt.
Developing the sums in the integrand and using the independence of the functions and observation
indicators and the Cauchy–Schwarz inequality, we can show that the above quantity is dominated
by
n−2
[0,1]2
E
n2
I(s, t)
(
n
i=1 Ui(s, t))2
{(E Z1(s)4
E Z1(t)4
)1/2
+ ρ(s, t)2
}dsdt ≤ O(n−2
),
where the last inequality is due to the fact that the ﬁrst expectation in the integrand is bounded by
δ−2
2 +n2
sup(s,t)∈[0,1]2 P(n−1 n
i=1 Ui(s, t) ≤ δ2), which can be shown by manipulations similar
to those in part (a). Next, analogously to part (a) we obtain for the second and third term on the
right-hand side of (2) that
E ˇR − ¯R 2
2 ≤ n−1
δ−2
2 + n2
sup
(s,t)∈[0,1]2
P n−1
n
i=1
Ui(s, t) ≤ δ2 E Z1 ⊗ Z1 − R 2
2
= O(n−1
)
(here ⊗ denotes the tensor product) and E ¯R − R 2
2 ≤ O(n−2
). Combining these bounds we
obtain the assertion of part (b).
2.2. Proof of Proposition 2
Lemma 4.2 of Bosq (2000) and the inequality between the operator norm and Hilbert–Schmidt
norm yield that |ˆλj − λj| ≤ ˆR − R ∞ ≤ ˆR − R 2 for all j. The ﬁrst result then follows
from part (b) of Proposition 1. For the second part, Lemma 4.3 of Bosq (2000) gives the inequality
Supplementary document: Partially observed functional data 5
ˆϕj − ˆsjϕj ≤ aj
ˆR − R ∞, where aj is a constant depending on the eigenvalue spacings.
Note that this lemma is formulated in Bosq (2000) for fully observed functions but an inspection
of the proof shows that the inequality holds for any two compact linear operators in place of ˆR, R.
This inequality, the dominance of the Hilbert–Schmidt norm over the operator norm and part (b) of
Proposition 1 complete the proof.
2.3. Proof of Proposition 3
Rewrite
ˆβ
(αn)
ijMi
− βijMi
= (ˆβ
(αn)
ijMi
− ˜βijMi
) + (˜βijMi
− βijMi
)
and use Theorem 1 to obtain the ﬁrst part of the proposition. Compute
v2
ij = var(˜βijMi − βijMi ) = ϕjMi , RMiMi ϕjMi − ϕjMi , RMiOi R−1
OiOi
ROiMi ϕjMi .
The convergence in probability of ˆϕjMi , ˆRMiMi ˆϕjMi to ϕjMi , RMiMi ϕjMi is a direct consequence
of Propositions 1 and 2. The last term in the expression for v2
ij and the corresponding term
in the estimator ˆv2
ij equal ˜aij, ROiOi ˜aij , ˆa
(αn)
ij , ˆROiOi ˆa
(αn)
ij , respectively. In their difference
ˆa
(αn)
ij , ( ˆROiOi
− ROiOi
)ˆa
(αn)
ij + ( ˆa
(αn)
ij , ROiOi
ˆa
(αn)
ij − ˜aij, ROiOi
˜aij ),
the convergence of the second term to zero was shown in the proof of Theorem 1. For the ﬁrst term
we compute
| ˆa
(αn)
ij , ( ˆROiOi
− ROiOi
)ˆa
(αn)
ij | ≤ ˆROiOi
− ROiOi ∞ ˆa
(αn)
ij
2
≤ OP (n−1/2
)α−2
n
ˆROiMi
2
∞
→ 0.
This completes the proof of the consistency of ˆv2
ij. The remaining assertions are obvious.
2.4. Proof of Proposition 4
We can rewrite ˆX
(αn)
iMi
− XiMi
= ( ˆX
(αn)
iMi
− ˜XiMi
) + ( ˜XiMi
− XiMi
). Due to Theorem 2, the
L2
-norm of the ﬁrst term on the right-hand side converges to 0 in probability. The second term is
the limiting stochastic process. The consistency of the covariance estimator can be proven like in
the proof of Proposition 3. The assertion for the Gaussian case follows immediately from the fact
that the limiting process is a linear function of Xi.
References
Bosq, D. (2000). Linear Processes in Function Spaces. Springer, New York.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.
Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer, New York.
D. Classiﬁcation of functional fragments by regularized linear
classiﬁers with domain selection
By David Kraus and Marco Stefanucci
Biometrika, 106(1):161–180, 2019
DOI: 10.1093/biomet/asy060
112
Biometrika (2019), 106, 1, pp. 161–180 doi: 10.1093/biomet/asy060
Printed in Great Britain Advance Access publication 17 December 2018
Classification of functional fragments by regularized linear
classifiers with domain selection
By DAVID KRAUS
Department of Mathematics and Statistics, Masaryk University, Kotláˇrská 2,
61137 Brno, Czech Republic
david.kraus@mail.muni.cz
AND MARCO STEFANUCCI
Department of Statistical Sciences, Sapienza University of Rome, Piazzale Aldo Moro 5,
00185 Roma, Italy
marco.stefanucci@uniroma1.it
Summary
We consider classiﬁcation of functional data into two groups by linear classiﬁers based on
one-dimensional projections of functions. We reformulate the task of ﬁnding the best classiﬁer as
an optimization problem and solve it by the conjugate gradient method with early stopping, the
principal component method, and the ridge method. We study the empirical version with ﬁnite
training samples consisting of incomplete functions observed on different subsets of the domain
and show that the optimal, possibly zero, misclassiﬁcation probability can be achieved in the limit
along a possibly nonconvergent empirical regularization path. We propose a domain extension
and selection procedure that ﬁnds the best domain beyond the common observation domain of
all curves. In a simulation study we compare the different regularization methods and investigate
the performance of domain selection. Our method is illustrated on a medical dataset, where we
observe a substantial improvement of classiﬁcation accuracy due to domain extension.
Some key words: Classiﬁcation; Conjugate gradient; Domain selection; Functional data; Partial observation;
Regularization; Ridge method.
1. Introduction
We consider classiﬁcation of a functional observation into one of two groups. Classiﬁcation
of functional data is a rich, longstanding topic and is comprehensively surveyed in Baíllo et al.
(2011b). Delaigle & Hall (2012a) showed that depending on the relative geometric positions
of the difference of the group means, representing the signal, and the covariance operator,
summarizing the structure of the noise, certain classiﬁers can have zero misclassiﬁcation probability.
This remarkable phenomenon, called perfect classiﬁcation, is a special property of the
inﬁnite-dimensional setting and cannot occur in the multivariate context, except in degenerate
cases. Delaigle & Hall (2012a) showed that a particularly simple class of linear classiﬁers,
based on a carefully chosen one-dimensional projection of the function to be classiﬁed, can
achieve this optimal error rate either exactly or in the limit along a sequence of approximations.
Berrendero et al. (2018) further elucidated the perfect classiﬁcation phenomenon from the point
c 2018 Biometrika Trust
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
162 D. Kraus AND M. Stefanucci
of view of the Feldman–Hájek dichotomy between mutual singularity and absolute continuity of
two Gaussian measures on abstract spaces with respect to each other.
Motivated by these ﬁndings, we reformulate the problem of determining the best classiﬁer as
a quadratic optimization problem on a function space or, equivalently, a linear inverse problem.
These problems are ill-posed; however, unlike with most inverse problems, this is not a complication
but rather an advantage in the sense that the more ill-posed the problem is, the better
the optimal misclassiﬁcation probability. We use regularization techniques, such as the method
of conjugate gradients with early stopping and ridge regularization, to solve the optimization
problem, obtaining a class of regularized linear classiﬁers. The optimal misclassiﬁcation rate is
the limit along the regularization path of solutions which themselves may not converge.
We study the empirical version of the problem, where the objective function in the constrained
minimization must be estimated from ﬁnite training data, and make two contributions. First,
we show that it is possible to construct an empirical regularization path towards the possibly
nonexistent unconstrained solution such that the classiﬁcation error converges to its best value,
possibly zero. We do this for conjugate gradient, principal component and ridge classiﬁcation in
a truly inﬁnite-dimensional manner, in the sense that the convergence takes place along a path
with decreasing regularization and holds without restrictions on the mean difference between
classes. Second, all our methods and theory are developed in the setting of partially observed
functional data, where trajectories are observed only on subsets of the domain. This type of
incomplete data, also called functional fragments, is increasingly common in applications; see,
for example, Bugni (2012), Delaigle & Hall (2013), Liebl (2013), Goldberg et al. (2014), Kraus
(2015), Delaigle & Hall (2016) and Gromenko et al. (2017). The principal difﬁculty for inference
with fragments is that temporal averaging is precluded by the incompleteness of the observed
functions. Our formulation as an optimization problem enables us to overcome this issue under
certain assumptions, because only averaging across individuals in the training data is needed, and
not individual curves.
Since the observation domains may vary in the training sample and the new curve to be
classiﬁed may be observed on a different subset, it is natural to ask which domain should be used.
We propose a domain selection strategy that looks for the best classiﬁer with domain ranging from
a minimum common domain to the entire domain of the function to be classiﬁed. For various
methods of selecting the best observation points, see Ferraty et al. (2010), Delaigle et al. (2012),
Pini & Vantini (2016), Berrendero et al. (2018) and Stefanucci et al. (2018).
Our simulation study conﬁrms that domain selection can considerably reduce the misclassiﬁcation
rate. Further simulations compare the performances of the three types of regularization.
Among other ﬁndings, this study shows that the principal component and conjugate gradient classiﬁers
often achieve comparable error rates but that the latter usually needs a lower dimension of
the regularization subspace, in agreement with a theoretical result we provide.
Application to a dataset on the geometric features of the internal carotid artery in patients
with and without aneurysm demonstrates the utility of our proposed approach. These data consist
of trajectories observed on intervals of different lengths. Previous analyses of the data used
the common domain of all curves in classiﬁcation. With our results we can include information
beyond this minimum domain, which leads to a substantial drop in the error rate of discrimination
between risk groups.
General references on functional data analysis include Ramsay & Silverman (2005) and
Horváth & Kokoszka (2012). Further relevant references are Cuesta-Albertos et al. (2007) for
other methods based on one-dimensional projections, Berrendero et al. (2016) for variable selection
in classiﬁcation, Bongiorno & Goia (2016) and Dai et al. (2017) for classiﬁcation beyond
the Gaussian setting, and Cuevas (2014) for an overview.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 163
2. Regularized linear classiﬁcation
2.1. Projection classiﬁers
We regard functional observations as random elements of the separable Hilbert space L2(I)
of square-integrable functions on a compact domain I equipped with inner product f , g =
I f (t)g(t) dt and norm f = f , f 1/2. In most applications I is an interval and the observations
are curves, but our results can be extended to other objects, such as surfaces or images. We
consider classiﬁcation of a Gaussian random function, X , into one of two groups of Gaussian
random functions: group 0 has mean μ0; group 1 has mean μ1. Both groups have covariance
operator R deﬁned as the integral operator
(Rf )(·) =
I
ρ(· , t)f (t) dt
with kernel ρ(s, t) = cov{X (s), X (t)}. In this section we assume that μ0, μ1 and R are known,
which corresponds to the asymptotic situation with an inﬁnite training sample. To simplify the
presentation we assume throughout the paper that the new observation to be classiﬁed may come
from either of the two classes with equal prior probability. The general case is treated in the
Supplementary Material.
Like Delaigle & Hall (2012a) we consider the class of centroid classiﬁers that are based on
one-dimensional projections of the form X , ψ , where ψ is a function in L2(I). If X belongs
to group j (j = 0, 1), the distribution of X , ψ is normal with mean μj, ψ and variance
ψ, Rψ . Denote the corresponding Gaussian densities by fψ,j. The optimal classiﬁer based on
X , ψ assigns X to the class Cψ(X ) given by
Cψ(X ) = 1{fψ,1( X ,ψ )/fψ,0( X ,ψ )>1} = 1{ X −μ0,ψ 2− X −μ1,ψ 2>0} = 1{Tψ (X )>0},
where Tψ(X ) = X − ¯μ, ψ μ, ψ with ¯μ = (μ0 + μ1)/2 and μ = μ1 − μ0. The
misclassiﬁcation probability of this classiﬁer is
D(ψ) = P0{Cψ(X ) = 1}/2 + P1{Cψ(X ) = 0}/2 = P0( X − ¯μ, ψ μ, ψ > 0)
= P0( X − μ0, ψ > | μ, ψ |/2) = 1 −
| μ, ψ |
2 ψ, Rψ 1/2
,
(1)
where Pj is the distribution of curves in group j and is the standard normal cumulative
distribution function.
To ﬁnd the best function ψ, one would ideally like to maximize |Z(ψ)|, where
Z(ψ) =
μ, ψ
ψ, Rψ 1/2
.
Similarly to Delaigle & Hall (2012a) and Berrendero et al. (2018), we see that if R−1/2μ < ∞,
then by the Cauchy–Schwarz inequality,
| μ, ψ |
ψ, Rψ 1/2
=
| R−1/2μ, R1/2ψ |
ψ, Rψ 1/2
R−1/2μ R1/2ψ
ψ, Rψ 1/2
= R−1/2
μ . (2)
If, moreover, R−1μ < ∞, then the equality is achieved for ψ = R−1μ. For this choice of ψ, or
anymultipleofit,theprobabilityofmisclassiﬁcationis1− ( R−1/2μ /2),whichispositivedue
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
164 D. Kraus AND M. Stefanucci
to the ﬁniteness of R−1/2μ , which can be seen as the signal-to-noise ratio. If R−1/2μ < ∞,
then regardless of whether R−1μ < ∞ or not, two Gaussian measures with mean difference μ
and covariances R are mutually absolutely continuous and 1− ( R−1/2μ /2) is the Bayes error
for distinguishing them, i.e., the lowest possible misclassiﬁcation probability for this problem
among all possible classiﬁers (Berrendero et al., 2018). If R−1/2μ < ∞ but R−1μ = ∞,
then the Bayes risk cannot be achieved by a projection classiﬁer based on a bounded linear functional
of the form X , ψ for some ψ ∈ L2(I). One can, however, use the theory of reproducing
kernel Hilbert spaces to deﬁne a linear classiﬁer that achieves the Bayes risk. We do not pursue
this line of development here because, as will be seen in § 2.2, approximations in the form of
projections can asymptotically achieve the Bayes risk.
The maximization of |Z(ψ)| can be solved as the task of maximizing μ, ψ subject to
ψ, Rψ = 1. Using Lagrange multipliers μ, ψ + λ(1 − ψ, Rψ ) and taking the Fréchet
derivative with respect to ψ, one obtains the equation 2λRψ = μ. Solutions for all λ > 0, if they
exist, i.e., if R−1μ < ∞, yield the same optimal misclassiﬁcation probability. Without loss
of generality we take λ = 1/2. Thus, minimizing the error rate translates into the unconstrained
quadratic optimization problem to maximize μ, ψ − ψ, Rψ /2, or
minimize ψ, Rψ /2 − μ, ψ , (3)
i.e., into the linear problem Rψ = μ.
2.2. Regularization
Ifψ = R−1μdoesnotexistinL2(I),i.e., R−1μ = ∞,thereisnomaximizerof|Z(ψ)|.One
can instead consider an approximating, regularized problem that can be solved. Regularization
is typically used to solve, in a stable way, ill-posed inverse problems for which a solution exists.
In such contexts, the path of regularized solutions converges to the solution to the problem of
interest. Here it may be that no solution exists, but paths of regularized solutions towards the
possibly nonexistent solution still turn out to be useful, since the misclassiﬁcation probability
converges to the optimal value along these paths.
If a solution exists, one can approximate it by an iterative numerical method. This approach can
also be used when no solution exists. The idea is to construct a sequence of iterations of an appropriate
numerical optimization method. The number of steps taken along this divergent sequence
towards the nonexistent solution can be seen as a regularization parameter. The conjugate gradient
method is particularly suitable for this situation.
The ﬁrst m steps of the conjugate gradient method applied to the linear inverse problem
Rψ = μ, or equivalently to the minimization of the quadratic functional ψ, Rψ /2 − μ, ψ ,
are described in Algorithm 1. This formulation is based on the multivariate version in Phatak
& de Hoog (2002, § 5), where one can ﬁnd further references and details on how applying the
conjugate gradient method to the normal equations in linear regression leads to partial least
squares regression. The functions νj are conjugate directions in the sense that νj, Rνk = 0 for
j |= k, and the functions ζj are called residuals in numerical analysis and are orthogonal, i.e.,
ζj, ζk = 0 for j |= k. In step j, the algorithm moves from the current approximate solution ˆψCG
j
along the conjugate direction νj with step length hj that minimizes the quadratic objective. The
residual is then updated to ζj+1. The new conjugate direction νj+1 is obtained by projecting the
residual ζj+1 onto the orthogonal complement of the span of the previous conjugate directions,
where orthogonality is in the sense of the inner product · , R(·) .
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 165
Algorithm 1. Conjugate gradient regularized classiﬁcation direction.
Initialize ψCG
0 = 0, ν0 = ζ0 = μ
Repeat for j = 0, . . . , m − 1
hj = νj, ζj / νj, Rνj
ψCG
j+1 = ψCG
j + hjνj
ζj+1 = μ − RψCG
j+1 (= ζj − hjRνj)
gj = − ζj+1, Rνj / νj, Rνj
νj+1 = ζj+1 + gjνj
Output ψCG
m
The conjugate gradient approach is an example of dimension reduction regularization. The
method solves the minimization problem (3) with ψ restricted to the Krylov subspace Km(R, μ)
spanned by μ, Rμ, . . . , Rm−1μ and also by the ﬁrst m conjugate directions νj or the ﬁrst m
residuals ζj; that is, it seeks to minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Km(R, μ). The
projection direction that solves this minimization is ψCG
m .
Another popular choice is to minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Em(R), where
Em(R) is the subspace spanned by the ﬁrst m eigenfunctions, ϕ1, . . . , ϕm, of R in the spectral
decomposition
R =
∞
j=1
λjϕj ⊗ ϕj,
with λ1 λ2 · · · > 0 being the eigenvalues. The solution ψPC
m = m
j=1 λ−1
j μ, ϕj ϕj gives
the principal component classiﬁer of Delaigle & Hall (2012a).
In general one can minimize ψ, Rψ /2 − μ, ψ subject to ψ ∈ Sm, where Sm is the mdimensional
subspace generated by some functions s1, . . . , sm such that the sj (j = 1, 2, . . . )
generate the range of R. Let Pm be the projection operator that projects onto Sm, and let Rm =
PmRPm and R−
m = PmR−1Pm. Then the solution of the regularized minimization problem
is ψm = R−
m μ. More explicitly, considering solutions of the form ψm = m
j=1 cjsj leads to the
m-variate minimization of cT
Qc/2 − uT
c where the matrix Q is such that Qjk = sj, Rsk and
the vector u has components uj = μ, sj , i.e., to the solution with coefﬁcients c = Q−1u. In
the case of the Krylov subspace, the iterative conjugate gradient method given in Algorithm 1 is,
however, preferred because the matrix Q is ill-conditioned.
We can also take another approach to regularization, based on ridge regression. Optimizing
the misclassiﬁcation probability in a ball with radius θ1/2 leads to the task of minimizing
ψ, Rψ /2− μ, ψ subject to ψ 2 θ or, equivalently, minimizing ψ, Rψ /2− μ, ψ +
α ψ 2/2, where α 0 is a regularization parameter. The solution is ψR
α = R−1
α μ, where
Rα = R + αI and I denotes the identity operator. Despite its practical performance and
amenability to theoretical analysis, the functional ridge classiﬁer does not seem to have been
considered before.
There is an important difference between the conjugate gradient method and the other
approaches.While the principal component and ridge methods regularize the problem without the
main goal in mind, the conjugate gradient approach greedily follows the goal of optimal classiﬁcation.
Indeed, the conjugate gradient method as an iterative optimization procedure constructs
the regularization path focusing on the minimization of the misclassiﬁcation probability, whereas
the other approaches regularize by modifying the operator to be inverted regardless of the goal.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
166 D. Kraus AND M. Stefanucci
From a computational point of view the conjugate gradient method is simplest because it does
not require inversion or eigendecomposition.
2.3. Properties of regularization paths
While ψm, the solution regularized by a subspace constraint, in general need not converge
as m → ∞ since a solution to the unconstrained minimization problem may not exist, the
misclassiﬁcation probability associated with the linear classiﬁer given by ψm converges along
the regularization path. The following and all other results are proved in the Appendix.
Proposition 1. The misclassiﬁcation probability of the regularized linear classiﬁer based on
ψm = R−
m μ converges to 1 − ( R−1/2μ /2) as m → ∞.
This result holds regardless of whether the unconstrained minimization problem (3) has a
solution, i.e., regardless of whether R−1μ < ∞. The limiting misclassiﬁcation probability is
positive if R−1/2μ < ∞ or zero if R−1/2μ = ∞. As discussed earlier, the optimal error
is achieved exactly by the one-dimensional projection onto ψ = R−1μ, when R−1μ < ∞.
Even when R−1μ = ∞, both of the dimension reduction techniques, namely the conjugate
gradient and principal component methods, and also ridge regularization as we will soon see,
achieve the optimal limiting error rate along a possibly nonconvergent path of one-dimensional
projection directions.
It is natural to investigate and compare how quickly the misclassiﬁcation rate approaches the
limit for the two main types of subspace regularization. It turns out that the conjugate gradient
classiﬁer, being a greedy, goal-oriented procedure, performs as well as or better than the principal
component classiﬁer with the same dimension.
Proposition 2. Regardless of whether the optimal misclassiﬁcation probability can be
achieved exactly or along a regularization path, i.e., whether R−1μ < ∞ or R−1μ = ∞,
and regardless of whether the optimal misclassiﬁcation probability is zero or positive, i.e., whether
R−1/2μ = ∞ or R−1/2μ < ∞, the misclassiﬁcation probability of the principal component
classiﬁer using m components is higher than or equal to the misclassiﬁcation probability of
the m-step conjugate gradient classiﬁer.
Phatak & de Hoog (2002, § 6.2) showed in the multivariate setting that ‘PLS ﬁts closer than
PCR’. In inﬁnite dimensions, in the context of kernel partial least squares, Blanchard & Krämer
(2010, Theorem 1) showed that the partial least squares solution is closer to the true solution of the
inverse problem than is the principal component solution with the same number of components.
Unlike these results, our Proposition 2 does not assume the existence of a solution and instead
focuses on the values of the misclassiﬁcation probability.
Although Proposition 2 suggests that the conjugate gradient method will typically use fewer
components than the principal component method to achieve the best result, the resulting misclassiﬁcation
probability with the best number of components need not be better. We address this
in the simulation study. A similar phenomenon was previously studied in the literature on partial
least squares in ﬁnite dimensions and in the functional setting by Febrero-Bande et al. (2017).
As in the case of subspace regularization, below we obtain the convergence of the error probability
of the ridge classiﬁer, whether or not the unconstrained minimization problem (3) has a
solution, i.e., regardless of whether R−1μ < ∞. The limiting misclassiﬁcation probability is
positive if R−1/2μ < ∞ or zero if R−1/2μ = ∞.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 167
Proposition 3. The misclassiﬁcation probability of the regularized linear classiﬁer based on
ψR
α = R−1
α μ converges to 1 − ( R−1/2μ /2) as α → 0+.
3. Empirical classiﬁers for fragmentary functions
3.1. Construction of classiﬁers with incomplete training samples
So far we have assumed that the parameters of each group are known. We now present the
empirical version with a ﬁnite training dataset, and show that under regularity conditions such
classiﬁers can achieve asymptotically the same optimal error rate as if there were inﬁnite training
data. We aim to do this not only in the case of fully observed functions but also in the case
of incomplete curves. Incompleteness can occur in the training data, with each curve possibly
observed on a different domain, as well as in the new curve that we wish to classify. One strategy
would be to consider all curves on the intersection of their observation domains, if it is nonempty.
However, such a restriction can be too severe and is unnecessary. We will construct classiﬁers
that use the observed new curve on a set I, which may be its entire observation set or a subset
thereof, without requiring that all training curves be completely observed on I.
For group j let there be a training sample consisting of nj curves, Xj1, . . . , Xjnj . The training
data are assumed to be mutually independent. Curves may be observed incompletely, with values
known only on a subset Oji of the domain and with no information about the values on the
complement. The observation domains are assumed to be independent of the curves and consist
of a ﬁnite union of intervals. We let Oji(t) denote the indicator of the curve Xji being observed at
time t. Similarly, let Uji(s, t) indicate observation at times s and t, i.e., Uji(s, t) = Oji(s)Oji(t).
The mean μj of group j can be estimated by the cross-sectional average
ˆμj(t) =
1{Nj(t)>0}
Nj(t)
nj
i=1
Oji(t)Xji(t) (j = 0, 1),
where Nj(t) =
nj
i=1 Oji(t) is the total number of observed curves in group j at time t. The
covariance kernel ρ(s, t) can be estimated by the empirical covariance using pairwise complete
observations of groupwise centred curves. Formally, the estimator is
ˆρ(s, t) =
M1(s, t) ˆρ1(s, t) + M2(s, t) ˆρ2(s, t)
M1(s, t) + M2(s, t)
,
where Mj(s, t) =
nj
i=1 Uji(s, t) and
ˆρj(s, t) =
1{Mj(s,t)>0}
Mj(s, t)
nj
i=1
Uji(s, t){Xji(s) − ˆμjst(s)}{Xji(t) − ˆμjst(t)}
with ˆμjst(s) = 1{Mj(s,t)>0}Mj(s, t)−1 nj
i=1 Uji(s, t)Xji(s). If Nj(t) = 0 or Mj(s, t) = 0, the estimators
are deﬁned as ˆμj(t) = 0 or ˆρj(s, t) = 0, respectively. This happens with asymptotically
vanishing probability under Assumption 1 below.
Suppose that the new independent curve to be classiﬁed, Xnew, is observed on the domain
Onew. Let us ﬁx the target domain I ⊆ Onew on which we aim to apply the classiﬁer to Xnew. The
empirical classiﬁer ˆC ˆψ trained on partially observed curves is deﬁned like the theoretical one,
with unknown quantities replaced by their estimators. It assigns Xnew restricted to I to the class
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
168 D. Kraus AND M. Stefanucci
ˆC ˆψ(Xnew) = 1{ ˆT ˆψ (Xnew)>0}, where ˆT ˆψ(Xnew) = Xnew − ˜μ, ˆψ ˆμ, ˆψ . Here ˜μ = ( ˆμ0 + ˆμ1)/2
and ˆμ = ˆμ1 − ˆμ0, with ˆμj being the estimators deﬁned above restricted to I. The projection
direction ˆψ is one of ˆψCG
m , ˆψPC
m or ˆψR
α , constructed respectively by conjugate gradient, principal
component or ridge regularization applied to ˆμ and ˆR, where ˆR is the integral operator with
kernel ˆρ(s, t) introduced above, restricted to I × I.
All methods discussed in the previous section can be formulated in terms of the population
parameters, i.e., the mean difference and covariance operator, and not in terms of individual
observations in the training set. The population parameters can be consistently estimated by
averaging individual observations, whereas temporal averaging of individual curves, for example
in inner products, is impossible due the incompleteness of the observed functions. In particular,
the conjugate gradient method can be applied to fragmentary training data, whereas the usual
algorithms for multivariate or functional partial least squares, such as those in De Jong (1993),
Hastie et al. (2009,Algorithm 3.3) and Delaigle & Hall (2012b, § 4.2 andAppendixA.2), involve
the computation of certain scores, i.e., inner products, for individual curves.
3.2. Asymptotic behaviour along the empirical regularization path
We aim to study the behaviour of classiﬁers on incomplete training samples of increasing
size with decreasing amounts of regularization. Previous asymptotic results in related settings
include those of Delaigle & Hall (2013), who established the consistency of empirical principal
component classiﬁers based on partially observed training data. In the setting of complete curves,
Berrendero et al. (2018) used dimension reduction regularization by evaluation of curves at
a ﬁnite set of arguments; they proved consistency of the empirical version but did not study
the asymptotics for decreasing amounts of regularization, i.e., they did not consider letting the
dimension grow. Baíllo et al. (2011a) studied optimal classiﬁers for Gaussian measures based on
Radon–Nikodym derivatives and investigated the performance of their empirical version in the
special class of processes with triangular covariance functions. In contrast, all of our methods,
including the ridge approach not considered previously, have been developed for fragmentary
training samples and shown to achieve the Bayes error rate for general Gaussian processes along
the empirical regularization path, as we now explain.
The following assumptions will be needed for the derivation of asymptotic properties of
empirically trained regularized linear classiﬁers.
Assumption 1. The distributions in groups j = 0, 1 satisfy EPj ( X 4) < ∞.
Assumption 2. For a domain I, there exists δ > 0 such that the observation patterns in training
samples j = 0, 1 satisfy, as nj → ∞,
sup
(s,t)∈I×I
pr n−1
j Mj(s, t) > δ = O(n−2
j ).
Assumption 1 guarantees the consistency of the empirical mean and covariance operator for
samples of completely observed curves; see, for example, Bosq (2000) or Horváth & Kokoszka
(2012). Kraus (2015, Proposition 1) showed, under the additional Assumption 2 with I equal to
the entire domain of the curves, that the root-n consistency of the sample mean and covariance
restricted to I continues to hold in the fragmentary setting. In particular, it follows that ˆμj −
μj = Op(n
−1/2
j ) and hence ˆμ − μ = Op(n−1/2) for n = min(n0, n1) → ∞, and also that
ˆR − R ∞ = Op{(n0 + n1)−1/2}, where · ∞ is the operator norm. When I is a subset of
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 169
the domain, analogous results hold for the restrictions of the functions and integral kernels to
I. Assumption 2 means that at all pairs of time-points there is an asymptotically nonnegligible
fraction of observed values. Assumption 2 is less restrictive than the requirement that there be
complete curves in the sample. It can be satisﬁed, for example, in situations where the observed
curves consist of several shorter fragments. If the assumption is not satisﬁed because the data
contain only one short fragment per curve, other estimation methods can be used; see, for example,
Delaigle & Hall (2016) and Descary & Panaretos (2019).
We now study the asymptotic behaviour of the empirical classiﬁer when the number mn of
steps of the conjugate gradient algorithm grows as the training sample size grows. Under certain
conditions on the regularization path, we establish the convergence of the misclassiﬁcation
probability of the conjugate gradient classiﬁer trained on collections of functional fragments to
the same optimal limit as for the theoretical conjugate gradient classiﬁer with an inﬁnite training
sample, regardless of whether the limiting error rate is zero or positive and regardless of whether
the limit can be theoretically achieved exactly or along the path.
Theorem 1. Suppose that Assumption 1 holds. Assume that n = min(n0, n1) → ∞ and
mn → ∞ in such a way that mn Cn1/2 for some C > 0 and
n−1/2
ω−1
mn
γ (mn)
+ n−1
ω−3
mn
→ 0, (4)
where ωmn is the smallest eigenvalue of the mn×mn matrix H with entries hjk = κj, Rκk for κj =
Rj−1μ and the mn-vector γ (mn) is deﬁned as γ (mn) = H−1d with d being the mn-vector having
components dj = μ, κj . Then the misclassiﬁcation probability of the empirical regularized
linearclassiﬁerbasedon ˆψCG
mn
convergesinprobabilitytotheoptimalmisclassiﬁcationprobability
1 − ( R−1/2μ /2).
Condition (4) guarantees that the number of components does not grow too fast in relation to
the growing number of training observations and to the increased ill-conditioning of the theoretical
problem. Condition (4) is analogous to (5.10) in Delaigle & Hall (2012b) for partial least
squares. The vector γ (mn) contains the coefﬁcients of the theoretical regularized solution ψCG
mn
with respect to the non-orthogonal basis κ1, . . . , κmn of the Krylov subspace Kmn (R, μ), i.e.,
ψmn = mn
j=1 γ
(mn)
j κj. The eigenvalues of H are called the Ritz values in numerical analysis. For
details on connections with partial least squares see Lingjærde & Christophersen (2000).
In the proof given in the Appendix we use the results of Delaigle & Hall (2012b) on the
consistency of partial least squares regression for functional data. These results were obtained
for situations that differ from our setting in several ways. In particular, we work with functional
fragments instead of complete curves, the conjugate gradient path differs from partial least squares
regression, e.g., inthe groupcentringinthe estimation ofthe covariance, and we do notrequirethat
the population inverse problem, Rψ = μ in our context, have a solution. However, inspection of
the underlying technical arguments in Delaigle & Hall (2012b) shows that appropriate analogous
results can be obtained and used in our setting, as we explain in the proof.
Next, we show that the empirically trained principal component classiﬁer with an increasing
number of components asymptotically achieves the optimal misclassiﬁcation probability.
Theorem 2. Suppose thatAssumption 1 holds.Assume that n = min(n0, n1) → ∞ and mn →
∞ in such a way that λ4
mn
n → ∞ and λ2
mn
n( mn
j=1 aj)−2 → ∞, where a1 = 23/2(λ1 − λ2)−1
and aj = 23/2max{(λj−1 − λj)−1, (λj − λj+1)−1} for j = 2, 3, . . . . Then the misclassiﬁcation
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
170 D. Kraus AND M. Stefanucci
probability of the empirical regularized linear classiﬁer based on ˆψPC
mn
converges in probability
to the optimal misclassiﬁcation probability 1 − ( R−1/2μ /2).
The conditions on the principal component regularization path are the same as in the case of
functional principal component regression (Cardot et al., 1999). Unlike in the functional linear
model, it is not assumed that the inverse problem has a solution, since the goal is not to estimate
the possibly nonexistent bounded linear regression functional.
Finally, the empirical ridge classiﬁer with ﬁnite training data asymptotically attains the same
optimal error rate as its theoretical counterpart. Unlike for the conjugate gradient and principal
component classiﬁers, the conditions on the ridge path classiﬁer do not involve parameters of the
distributions because no subspace is constructed.
Theorem 3. Suppose that Assumption 1 holds. Assume that n = min(n0, n1) → ∞ and
αn → 0+ in such a way that α4
nn → ∞. Then the misclassiﬁcation probability of the empirical
regularized linear classiﬁer based on ˆψR
αn
converges in probability to the optimal misclassiﬁcation
probability 1 − ( R−1/2μ /2).
3.3. Selection of the regularization parameter
The regularization parameter can be selected by minimizing an estimate of the misclassiﬁcation
probability. We use leave-one-out crossvalidation. The Supplementary Material provides details
of crossvalidation in the presence of incomplete curves. The best value of the regularization
parameter is searched for over a grid of values, such as the values corresponding to integer
degrees of freedom up to some maximum value. The number of degrees of freedom for the
subspace methods is the dimension of the subspace, and for the ridge method it is deﬁned as
the trace of ( ˆR + αI )−1 ˆR, i.e., n0+n1
j=1
ˆλj/(ˆλj + α) where ˆλj are the eigenvalues of ˆR. The
maximum number of degrees of freedom we use is one ﬁfth of the number of curves.
4. Domain selection
To classify the new curve Xnew observed on Onew, we apply the classiﬁer on the target domain
I ⊆ Onew, the choice of which we now consider. One possibility would be to restrict attention
to the intersection of the observation domains of all curves, say I0, if it is nonempty. An obvious
drawback of this approach is that one can lose discriminatory power because any differences
between the classes may be more pronounced outside I0. An advantage of our approach is its
capability of working with incomplete curves, since the empirical construction of the projection
direction requires only the estimation of μ and R on the target domain. Hence one can look at a
domain larger than I0. A natural choice is the largest subset of Onew that contains enough data
for estimation of the classiﬁer, i.e., satisﬁes Assumption 2, and contains enough functions for
validation in the crossvalidation procedure, i.e., has a sufﬁciently large set V. In this way one
hopes to capture the widest range of shapes of the group difference. On the other hand, it could
be that not even this maximal domain, Imax, will lead to the best classiﬁcation accuracy, because
one includes more uncertainty in the estimation due to the missing values and because the mean
difference may not be important in the added part of the domain. Therefore, it seems reasonable
to also consider intermediate choices between I0 and Imax.
Here we present a domain selection strategy for the most common case of interval observation
sets. The idea, worked out in detail in Stefanucci et al. (2018), is to construct the classiﬁer on a
series of intervals that range from the common domain I0 to the maximal domain Imax, extending
the working interval by a ﬁxed percentage at each step. More formally, we consider a sequence
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 171
of nested intervals I0 ⊂ I1 ⊂ · · · ⊂ Ik ⊂ · · · ⊂ IK = Imax, starting from I0 and ending in
IK = Imax, and build the classiﬁer on each interval. The regularization parameter for the kth
domain is selected by crossvalidation as described in the Supplementary Material. Among these
K + 1 candidates we select the one that minimizes the crossvalidation estimate of error.
The search strategy can be extended by considering larger systems of candidate domains; for
example, one could vary the two endpoints independently. The idea can be generalized to other
situations, such as non-interval observation sets, multivariate functional data or functions indexed
by multivariate arguments. In each situation one needs to deﬁne a meaningful system of domains
and optimize the crossvalidation score over the system.
5. Simulations
5.1. Behaviour of regularized classiﬁers on complete data
In this section we illustrate the behaviour of the three estimators of ψ in different settings.
We consider Gaussian processes on [0, 1] with covariance kernel ρ(s, t) = exp(−|s − t|2/0.01)
and mean function depending on the group label. Group 0 has mean μ0(t) = 0 in each setting.
Group 1 has mean μ1(t) = μ(t), for which we consider eight different forms: (i) ct, (ii) c(t−0.5)2,
(iii) c(t−0.5)3, (iv) c sin(20t), (v) cϕ1(t), (vi) cϕ10(t), (vii) cb(t; 5, 5), and (viii) cb(t; 2, 6), where
ϕj is the jth eigenfunction of the kernel ρ and b(t; α, β) = tα−1(1 − t)β−1 is the beta density. In
each case the parameter c is selected to yield a reasonable misclassiﬁcation rate.
In each of 5000 repetitions we generated 50 curves from each group and evaluated them
on a grid of 100 equispaced points in [0, 1]. We also generated a new observation that could
arise from group 0 or group 1 with equal probability. Then we constructed the regularized classiﬁcation
direction by the principal component, conjugate gradient and ridge methods with m
degrees of freedom and predicted the label of the new observation. We considered m = 1, . . . , 20,
corresponding to a reasonable minimum of ﬁve observations per degree of freedom.
Figure 1 shows the misclassiﬁcation proportion over the 5000 repetitions as a function of m for
the eight different choices of μ(t).As expected, the conjugate gradient method performs well in all
settingsandisnotmuchaffectedbytheshapeofμ(t).Bycontrast,theperformanceoftheprincipal
component classiﬁer depends strongly on μ(t). To see this, consider the two extreme situations in
settings (v) and (vi).The classiﬁcation error of the principal component approach is close to that of
the conjugate gradient method in case (v), where μ(t) is the ﬁrst eigenfunction, but is much higher
at lower dimensions in case (vi), where μ(t) is the tenth eigenfunction. In the latter case, the principal
component method reaches the same level of error as the conjugate gradient method only when
m = 10 or more. These ﬁndings agree with Proposition 2 and with the conclusions of Delaigle &
Hall (2012a) and Febrero-Bande et al. (2017), who pointed out that principal components need
more degrees of freedom than partial least squares to achieve good performance. In this regard
ridge regularization seems to lie between the two subspace methods, but is more similar to the
conjugate gradient method in most cases. In particular, it does not completely fail at low degrees of
freedom in case (vi), because it does not construct a subspace that could miss the important information;
however, it also suffers in this situation, where μ(t) is on the tail of the spectrum, because
ridge penalization shrinks higher-index spectral components more than lower-index components.
Nevertheless, with sufﬁciently many degrees of freedom, the three methods behave similarly.
Additional simulation results, reported in the Supplementary Material, show that similar conclusions
can be drawn when functions have nonsmooth trajectories and that the capability to
discriminate between two groups with different means is robust with respect to the assumption
of equal covariances. Results for increased training sample size are also provided in the
Supplementary Material.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
172 D. Kraus AND M. Stefanucci
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(i)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(ii)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(iii)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(iv)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(v)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(vi)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(vii)
5 10 15 20
0
10
20
30
40
50
Degrees of freedom
Misclassificationrate(%)
(viii)
Fig. 1. Misclassiﬁcation rate (%) versus degrees of freedom for different forms of μ(t): (i) linear, (ii) quadratic,
(iii) cubic, (iv) sinusoidal, (v) ﬁrst eigenfunction, (vi) tenth eigenfunction, (vii) symmetric beta, and (viii) asymmetric
beta. The different curves represent the principal component (solid), conjugate gradient (dotted) and ridge (dashed)
classiﬁers.
Table 1. Misclassiﬁcation rates (%), with standard errors in parentheses, achieved by classiﬁers
with degrees of freedom selected by crossvalidation in the different settings; for each
classiﬁer the numbers in the second row are the minimum misclassiﬁcation rates
(i) (ii) (iii) (iv) (v) (vi) (vii) (viii)
PC 13.0(0.34) 8.3(0.28) 1.3(0.11) 2.5(0.16) 7.2(0.26) 7.6(0.27) 10.7(0.31) 26.2(0.44)
8.1 6.1 0.1 2.2 2.4 7.4 6.1 20.4
CG 8.6(0.28) 6.5(0.25) 0.7(0.09) 2.1(0.14) 2.6(0.16) 7.8(0.27) 6.1(0.24) 20.9(0.41)
8.1 5.7 0.1 2.1 2.2 7.2 5.7 19.9
R 8.4(0.28) 7.7(0.27) 0.7(0.09) 2.2(0.15) 2.4(0.15) 7.9(0.27) 6.1(0.24) 20.8(0.41)
7.9 6.5 0.2 2.0 2.3 7.3 5.7 20.0
PC, principal component classiﬁer; CG, conjugate gradient classiﬁer; R, ridge classiﬁer.
5.2. Performance of crossvalidation for selection of degrees of freedom
We used simulation to investigate the performance of leave-one-out crossvalidation in choosing
the correct level of regularization. The settings were the same as in § 5.1, but classiﬁcation was
done using the number of degrees of freedom selected by leave-one-out crossvalidation. We
summarize the classiﬁcation errors in Table 1. Crossvalidation performs well as a selector of the
best level of regularization since the misclassiﬁcation rate in Table 1 is in each case close to the
corresponding minimum error in Fig. 1. The principal component method appears to perform
worst, while the conjugate gradient and ridge methods have comparable performance. The latter
two methods nearly achieve the respective minimum errors. Table 2 reports the mean and median
selected degrees of freedom. The principal component method often uses considerably more
degrees of freedom than the other methods. This is particularly interesting in case (v), where the
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 173
Table 2. Mean and median (in parentheses) degrees of freedom selected by crossvalidation
(i) (ii) (iii) (iv) (v) (vi) (vii) (viii)
PC 8.2 (7) 14.3 (15) 9.9 (9) 10.9 (10) 4.6 (4) 11.9 (11) 5.3 (4) 8.6 (6)
CG 5.4 (3) 10.7 (11) 3.4 (2) 4.5 (2) 2.4 (1) 4.9 (3) 2.7 (1) 8.6 (7)
R 6.4 (3) 11.6 (13) 6.0 (3) 6.1 (4) 2.7 (1) 9.3 (8) 3.4 (1) 6.7 (3)
PC, principal component classiﬁer; CG, conjugate gradient classiﬁer; R, ridge classiﬁer.
0.5 0.6 0.7 0.8 0.9
0
10
20
30
40
50
(a) (b) (c)
Domain endpoint
Misclassificationrate(%)
0.5 0.6 0.7 0.8 0.9
0
10
20
30
40
50
Domain endpoint
Misclassificationrate(%)
0.5 0.6 0.7 0.8 0.9
0
10
20
30
40
50
Domain endpoint
Misclassificationrate(%)
Fig. 2. Misclassiﬁcation rate (%) plotted as a function of the domain extension, for μ(t) being the (a) Be(2, 6),
(b) Be(5, 5) or (c) Be(6, 2) density for the principal component (solid), conjugate gradient (dotted) and ridge (dashed)
classiﬁers with selected degrees of freedom. Classiﬁcation is performed on the domains [0, u] with u ∈ [0.5, 0.9], and
the error values are plotted against u.
mean difference equals the ﬁrst eigenfunction and so one component should be the best choice
in theory. These results again illustrate the general phenomenon that the principal component
approach is inappropriate for inference about means due to the possible lack of informativeness
of the principal components about the mean and the extra uncertainty associated with their
estimation.
5.3. Missing data and domain extension
We now demonstrate the usefulness of the domain extension approach presented in § 4, using
Gaussian processes on [0, 1] with the same covariance as in § 5.1 and considering three scenarios
for the mean difference in the form of a multiple of a beta density, (a) b(t; 2, 6), (b) b(t; 5, 5) and
(c) b(t; 6, 2), which reﬂect situations where discrimination due to a peak is in the left, central and
right parts of the domain, respectively. We sampled 50 curves from each group on a sequence of
100 equispaced points in [0, 1]. Then we generated endpoints of the observation interval for each
curve from the uniform distribution on (0.5, 1); that is, each curve was observed between 0 and
the endpoint and treated as missing beyond the endpoint. The new observation had an endpoint
sampled between 0.5 and 1. So the ﬁrst half of [0, 1], I0 = [0, 0.5], was the common observation
domain of all curves. We considered extensions of I0 to Ik = [0, 0.5 + 0.05k] (k = 0, . . . , 8).
For each interval of this form that was contained in the observation domain of the curve to be
classiﬁed, we estimated the classiﬁers, choosing the best degrees of freedom via crossvalidation,
and classiﬁed the new curve. This procedure was repeated 1000 times. We plot the behaviour of
the resulting classiﬁcation error as a function of the endpoint of the extended domain in Fig. 2.
When the peak of the mean difference is in the left part of [0, 1], extending the domain does not
lead to better classiﬁcation. In this case the interval where the means mainly differ corresponds
to the part of the domain where all the data are available, and inﬂating the domain only increases
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
174 D. Kraus AND M. Stefanucci
Table 3. Misclassiﬁcation rates (%), with standard errors in parentheses,
achieved by classiﬁers with domain and degrees of freedom selected
by crossvalidation in the different settings; the minimum and maximum
misclassiﬁcation rates are given in square brackets
(a) (b) (c)
PC 18.1 (0.38) [11.3, 33.7] 11.9 (0.32) [11.4, 15.2] 31.1 (0.46) [21.8, 46.0]
CG 19.6 (0.39) [15.4, 25.7] 7.4 (0.26) [5.6, 9.3] 30.4 (0.46) [19.2, 45.7]
R 22.4 (0.42) [17.2, 22.8] 6.9 (0.25) [5.4, 8.6] 28.4 (0.45) [20.7, 45.9]
PC, principal component classiﬁer; CG, conjugate gradient classiﬁer; R, ridge classiﬁer.
the uncertainty due to missing data. In the second case, the peak of the mean difference is
exactly at 0.5, and extending the domain leads to little improvement. The third scenario is the
opposite of the ﬁrst, as the discrimination is mainly in the right part of [0, 1]. In this case,
extending the domain reduces the error considerably because good classiﬁcation is only possible
by employing the right part of the domain. The classiﬁcation error is about 45% when using only
I0, but drops to about 20% when using also the part of the interval where the data are partially
observed.
5.4. Performance with selected domain
Domain extension may or may not improve the performance of classiﬁers, depending on the
interplay between the form of the mean difference, the covariance structure and the missingness
pattern. In practice, the user is not an oracle with access to misclassiﬁcation errors for candidate
subsets whose estimates are plotted in Fig. 2, and hence would select the best domain by crossvalidation.
In Table 3 we report simulation results for classiﬁers with both domain and degrees
of freedom selected by crossvalidation, for the same conﬁgurations as in § 5.3. Selection of the
domain leads to a considerable improvement of the error rate compared with the worst-performing
domain. On the other hand, this improvement has some limitations and a gap remains between the
achieved value and the best value; this can be explained by the fact that crossvalidation provides
only an estimate of the error, not the true value.
6. AneuRisk data example
We apply the proposed method to theAneuRisk dataset from an interdisciplinary project aimed
at investigating the effects of blood vessel morphology, blood ﬂuid dynamics and biomechanical
properties of the vascular wall on the pathogenesis of cerebral aneurysms. An introduction to
the data can be found in Sangalli et al. (2014b). This dataset has previously been analysed in
several works that focused on different methodological aspects, such as function and derivative
estimation (Sangalli et al., 2009b), exploratory analysis and classiﬁcation (Sangalli et al., 2009a),
and alignment and clustering (Sangalli et al., 2014a), among others.
The data consist of measurements of the radius and curvature of the internal carotid artery in
a sample of 65 patients, 33 of which have an aneurysm at the bifurcation of the vessel or after it,
while the other 32 either have an aneurysm before the bifurcation, which is much less dangerous,
or are healthy. The goal is to classify the patients based on the morphology of their internal
carotid artery. In this example we work with only one of the observed variables, the radius. The
data have previously been pre-processed, registered and smoothed, and are observed on a grid of
2000 points in the interval [−100.3, 5.1], where the argument represents the distance between the
observation point and the terminal bifurcation of the internal carotid artery, with positive values
indicating points inside the skull. As we can see in Fig. 3, the data are partially observed because
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 175
−100 −80 −60 −40 −20 0
1
2
3
4
Internal carotid artery arc length (mm)
Radius(mm)
Fig. 3. Radius along the carotid artery from the AneuRisk dataset, along with the
mean of the group of subjects with an aneurysm after the bifurcation (dotted) and the
mean of the group of subjects with an aneurysm before the bifurcation or without
an aneurysm (dashed). Curves for two example subjects are highlighted as solid
lines. Note the different start and end points for different subjects in the study.
the start and end points are different from subject to subject. All subjects are observed on the
subset I0 = [−32.9, −7.4], which corresponds to 24.3% of the whole domain.
We ﬁrst apply the regularized linear classiﬁers to curves restricted to the common domain
I0. The classiﬁcation error estimated by crossvalidation is 29.2% for the principal component
method, 29.2% for the conjugate gradient method, and 32.3% for ridge regularized
classiﬁcation.
We compare the above procedure with a different approach consisting of a multivariate classiﬁcation
method applied to principal component scores. The covariance kernel is estimated
from observations centred to their respective group means, its eigenfunctions are computed,
and quadratic discriminant analysis is applied to the inner products of the uncentred curves
with the eigenfunctions. This procedure is similar to that in Sangalli et al. (2009a). The best
classiﬁer of this type turns out to exhibit a misclassiﬁcation error of 32.3%, obtained with two
eigenfunctions.
These values show that in this dataset, when attention is restricted to the common domain I0,
our proposed method is comparable to the more standard multivariate technique.
Next, we consider classiﬁcation on extended domains including observed values outside the
commondomainI0.WebuildthesequenceofdomainsI0, . . . , IK byenlargingthedomainateach
step by 1.25% of the complement of I0. This step size is a compromise between the ﬁneness of
the grid and the computational cost. We consider extended domains up to K = 40, corresponding
to I40 = [−66.6, −1.2], because not enough subjects have observed values outside this interval
for reliable estimation and crossvalidation. All regularized linear classiﬁcation methods beneﬁt
from the domain extension; in particular, the error rate for the principal component method drops
from 29.2% to 23.2%, for the conjugate gradient method from 29.2% to 25.8%, and for ridge
regularization from 32.3% to 25%. The best domain is I10 = [−41.3, −5.8] for the conjugate
gradient method and I11 = [−42.2, −5.7] for the other two methods.
The alternative method based on multivariate classiﬁcation of scores cannot be applied on
extended domains since the individual scores of incomplete curves cannot be computed, although
they can be predicted (Kraus, 2015). By contrast, the proposed methods are entirely formulated
in terms of distributional parameters, which can be consistently estimated from incomplete data,
unlike individual quantities.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
176 D. Kraus AND M. Stefanucci
Acknowledgement
The AneuRisk data and useful comments were kindly provided by Laura Sangalli. The work
of David Kraus was supported by the Czech Science Foundation. We are grateful to two referees,
an associate editor and the editor for helpful suggestions and corrections.
Supplementary material
Supplementary material available at Biometrika online includes the derivation of classiﬁers
under unequal prior class probabilities, algorithmic details of crossvalidation, and additional
simulation and real-data results.
Appendix
Proof of Proposition 1
The misclassiﬁcation probability for ψm is D(ψm) given in (1). Since ψm ∈ Sm, we compute
| μ, ψm |
ψm, Rψm
1/2
=
μ, R−
m μ
μ, R−
m RR−
m μ 1/2
= (R−
m )1/2
μ .
By Lebesgue’s monotone convergence theorem, the right-hand side converges to R−1/2
μ , ﬁnite or
inﬁnite, and therefore the limiting misclassiﬁcation probability that is attained along the regularization
path ψm, as m → ∞, is 1 − ( R−1/2
μ /2).
Proof of Proposition 2
The conjugate gradient method minimizes the quadratic objective function in the Krylov subspace
Km(R, μ) whose elements are in the form η = m−1
k=0 ck Rk
μ = p(R)μ, where p is a polynomial of order
lower than m. Then η ∈ Km(R, μ) can be written as η = ∞
j=1 p(λj)bjϕj with bj = μ, ϕj . The objective
function at η equals
η, Rη /2 − μ, η = p(R)μ, Rp(R)μ /2 − μ, p(R)μ
=
∞
j=1
b2
j {p(λj)2
λj/2 − p(λj)}
=
∞
j=1
b2
j
2λj
q(λj){q(λj) − 2},
(A1)
where q(λ) = p(λ)λ is a polynomial of degree at most m such that q(0) = 0.The conjugate gradient method
seeks the polynomial with these properties that minimizes the objective function. To prove the proposition
we shall ﬁnd a polynomial q with the required properties such that the objective function above is smaller
than or equal to the objective function for the principal component classiﬁer. The principal component
classiﬁer uses ψPC
m = m
j=1 λ−1
j bjϕj, and the objective function at ψPC
m is
ψPC
m , RψPC
m /2 − μ, ψPC
m = −
m
j=1
b2
j
2λj
. (A2)
Consider the polynomial of degree m,
q(λ) = 1 − (−1)m λ − λ1
λ1
· · ·
λ − λm
λm
,
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 177
with q(0) = 0. We see that q(λj) = 1 for j = 1, . . . , m, so the ﬁrst m summands in the series (A1) and
(A2) are equal. For j > m we have that 0 q(λj) 2 due to the properties of the eigenvalue sequence; so
q(λj){q(λj) − 2} 0 and therefore the corresponding summands in the series (A1) are negative, whereas
they are zero in the series (A2). Hence, for this polynomial,
∞
j=1
b2
j
2λj
q(λi){q(λi) − 2} −
m
j=1
b2
j
2λj
,
and so the objective at the conjugate gradient solution must be smaller than or equal to the objective at
the principal component solution. The inequality between the minima of the quadratic objective function
implies the inequality between the misclassiﬁcation probabilities stated in the proposition.
Proof of Proposition 3
Proceeding as in the proof of Proposition 1, we need to show that
μ, R−1
α μ
μ, R−1
α RR−1
α μ 1/2
=
∞
j=1
b2
j
λj+α
∞
j=1
λjb2
j
(λj+α)2
1/2
−−−→
α→0+
∞
j=1
b2
j
λj
1/2
= R−1/2
μ ,
where bj = μ, ϕj is the coefﬁcient of μ in the eigenbasis. If ∞
j=1 b2
j /λj < ∞, the convergence follows
from Lebesgue’s monotone convergence theorem. Otherwise, we use the inequality ∞
j=1 λjb2
j /(λj + α)2
∞
j=1 b2
j /(λj + α) to bound the left-hand side expression from below by { ∞
j=1 b2
j /(λj + α)}1/2
, which
diverges to inﬁnity again by Lebesgue’s theorem.
Proof of Theorem 1
The probability of misclassifying a new observation using the conjugate gradient classiﬁer based on ˆψCG
mn
is D( ˆψCG
mn
) = 1 − {|Z( ˆψCG
mn
)|/2}. We need to show that the fraction in Z( ˆψCG
mn
) converges in probability
to R−1/2
μ /2 along the regularization path satisfying the assumptions of the theorem. To deal with the
numerator in Z( ˆψCG
mn
), one can show that
μ, ˆψCG
mn
− μ, ψCG
mn
= Op n−1/2
ω−1
mn
γ (mn)
+ n−1
ω−3
mn
. (A3)
This result follows from an analogue of (5.9) in Theorem 5.3 of Delaigle & Hall (2012b) and intermediate
results in the proof of that theorem which can be established in our context. The necessary modiﬁcations of
the proofs of Theorems 5.1, 5.2 and 5.3 in Delaigle & Hall (2012b) are as follows. All results remain valid
for incomplete instead of complete curves, because the proofs depend only on the root-n consistency of the
covariance estimators, which holds also for functional fragments (Kraus, 2015, Proposition 1). Moreover,
the derivations in Delaigle & Hall (2012b) can be repeated without assuming that the theoretical solution
ψ = R−1
μ exists as an element of L2
(I). Indeed, the proofs in Delaigle & Hall (2012b) are based on
stochastic expansions of ˆRj
ψ = ˆRj
R−1
μ, in our notation, about Rj
ψ = Rj
R−1
μ = Rj−1
μ and derived
quantities, but the same steps can be followed for ˆRj−1
ˆμ about Rj−1
μ in our setting. In other words, it can
be shown that ˆψCG
mn
and ψCG
mn
converge to each other without assuming that ψCG
mn
converges. Similarly, for
the denominator in Z( ˆψCG
mn
) we have that
ˆψCG
mn
, R ˆψCG
mn
− ψCG
mn
, RψCG
mn
= Op n−1/2
ω−1
mn
γ (mn)
+ n−1
ω−3
mn
. (A4)
This last result is analogous to (7.27) of Delaigle & Hall (2012b), whose proof can be repeated with the
same modiﬁcations for our situation as before. Therefore, regardless of whether R−1
μ or R−1/2
μ is
ﬁnite or inﬁnite, the theoretical and empirical regularized quantities approach each other at the rates given
in (A3) and (A4). The result on D( ˆψCG
mn
) then follows as in the proof of Proposition 1.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
178 D. Kraus AND M. Stefanucci
Proof of Theorem 2
We show that D( ˆψPC
mn
) = 1− {|Z( ˆψPC
mn
)|/2} converges in probability to 1− ( R−1/2
/2). The strategy
of the proof is similar to that of Theorem 3.1 in Cardot et al. (1999) for the principal component approach
to the functional linear model. The difference lies in the incompleteness of the functional data and in that
we do not assume that the underlying theoretical inverse problem has a solution. We write
ˆψPC
mn
− ψPC
mn
ˆR−
mn
− R−
mn ∞ ˆμ + R−
mn ∞ ˆμ − μ .
Proceeding as in the proof of Lemma 5.1 in Cardot et al. (1999), we can show that
ˆR−
mn
− R−
mn ∞
ˆλ−1
mn
λ−1
mn
ˆR − R ∞ + 2λ−1
mn
ˆR − R ∞
mn
j=1
aj.
Here ˆλj are the eigenvalues of ˆR in descending order and ˆϕj are the corresponding eigenfunctions. In
establishing the above inequality one uses the facts that |ˆλj −λj| ˆR −R ∞ and ˆϕj −sign ˆϕj, ϕj ϕj
aj
ˆR − R ∞, which are known from Bosq (2000, Lemmas 4.2 and 4.3) for the empirical covariance
operator from complete curves but hold also for functional fragments; see the proof of Proposition 2 in
the supplementary document for Kraus (2015). Since ˆR − R ∞ = Op(n−1/2
), we see that ˆλ−1
mn
λ−1
mn
ˆR −
R ∞1[ˆλmn >λmn /2] 2λ−2
mn
ˆR − R ∞ = λ−2
mn
Op(n−1/2
). Since the probability of the event [ˆλmn < λmn /2]
is bounded by λ−2
mn
O(n−1
) and hence converges to 0, it follows that ˆλ−1
mn
λ−1
mn
ˆR − R ∞ = λ−2
mn
Op(n−1/2
).
Combining this with the facts that ˆμ = Op(1), R−
mn
= λ−1
mn
and ˆμ − μ = Op(n−1/2
) gives
ˆψPC
mn
− ψPC
mn
λ−2
mn
Op(n−1/2
) + λ−1
mn
Op(n−1/2
)
mn
j=1
aj.
Similar arguments can be used in the analysis of the denominator in Z( ˆψPC
mn
). In conclusion, we obtain that
the estimation errors for the quantities in the numerator and denominator converge to zero at the rates
μ, ˆψPC
mn
− μ, ψPC
mn
= λ−2
mn
Op(n−1/2
) + λ−1
mn
Op(n−1/2
)
mn
j=1
aj, (A5)
ˆψPC
mn
, R ˆψPC
mn
− ψPC
mn
, RψPC
mn
= λ−2
mn
Op(n−1/2
) + λ−1
mn
Op(n−1/2
)
mn
j=1
aj. (A6)
In light of (A5) and (A6), the asymptotic behaviour of the misclassiﬁcation probability is driven by the
behaviour of the theoretical classiﬁer addressed in Proposition 1.
Proof of Theorem 3
We show that the fraction |Z( ˆψR
mn
)| converges in probability to R−1/2
μ /2 as n → ∞. For the
numerator we write
μ, ˆψR
αn
− μ, R−1
αn
μ = μ, ( ˆR−1
αn
− R−1
αn
) ˆμ + μ, R−1
αn
( ˆμ − μ) . (A7)
For the ﬁrst term on the right we ﬁnd that
| μ, ( ˆR−1
αn
− R−1
αn
) ˆμ | μ ˆR−1
αn
− R−1
αn ∞ ˆμ
= μ ˆR−1
αn
( ˆRαn − Rαn )R−1
αn ∞ ˆμ
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Classiﬁcation of functional fragments 179
μ ˆR−1
αn ∞
ˆRαn − Rαn ∞ R−1
αn ∞ ˆμ
α−2
n Op(n−1/2
),
since ˆR−1
αn ∞ α−1
n , R−1
αn ∞ α−1
n , ˆμ = Op(1) and ˆRαn − Rαn ∞ = ˆR − R ∞ = Op{(n0 +
n1)−1/2
} (Kraus, 2015, Proposition 1). For the second term on the right-hand side of (A7), we obtain
| μ, R−1
αn
( ˆμ − μ) | μ R−1
αn ∞ ˆμ − μ α−1
n Op(n−1/2
).
The quantity in the denominator of Z( ˆψR
mn
) can be rewritten as
ˆψR
αn
, R ˆψR
αn
− ψR
αn
, RψR
αn
= ˆψR
αn
− ψR
αn
, R ˆψR
αn
+ ψR
αn
, R( ˆψR
αn
− ψR
αn
) . (A8)
The ﬁrst term on the right is
ˆψR
αn
− ψR
αn
, R ˆψR
αn
= ˆR−1
αn
ˆμ − R−1
αn
μ, R ˆR−1
αn
ˆμ
= R−1
αn
(Rαn − ˆRαn ) ˆR−1
αn
ˆμ, R ˆR−1
αn
ˆμ + R−1
αn
( ˆμ − μ), R ˆR−1
αn
ˆμ . (A9)
For the ﬁrst summand in (A9) we have
| R−1
αn
(Rαn − ˆRαn ) ˆR−1
αn
ˆμ, R ˆR−1
αn
ˆμ | ˆμ 2 ˆR−1
αn
2
∞ RR−1
αn ∞
ˆR − R ∞
α−2
n Op(n−1/2
),
using properties mentioned previously and the fact that RR−1
αn ∞ 1, and for the second summand
we have
| R−1
αn
( ˆμ − μ), R ˆR−1
αn
ˆμ | RR−1
αn ∞
ˆR−1
αn ∞ ˆμ − μ α−1
n Op(n−1/2
).
Putting these results together, we see that the absolute value of the ﬁrst term on the right-hand side of (A8)
is dominated by α−2
n Op(n−1/2
). The second term on the right-hand side of (A8) can be analysed in a similar
way to the ﬁrst two terms on the right-hand side of (A7) with RR−1
αn
μ in place of μ. Thus we bound the
absolute value from above by α−2
n Op(n−1/2
). These results imply that the estimation errors vanish at rates
μ, ˆψR
αn
− μ, ψR
αn
= α−2
n Op(n−1/2
),
ˆψR
αn
, R ˆψR
αn
− ψR
αn
, RψR
αn
= α−2
n Op(n−1/2
).
Hence the empirical classiﬁer has the same limiting error as the theoretical one addressed in Proposition 3.
References
Baíllo, A., Cuevas, A. & Cuesta-Albertos, J. A. (2011a). Supervised classiﬁcation for a family of Gaussian
functional models. Scand. J. Statist. 38, 480–98.
Baíllo, A., Cuevas, A. & Fraiman, R. (2011b). Classiﬁcation methods for functional data. In The Oxford Handbook
of Functional Data Analysis, F. Ferraty & Y. Romain, eds. Oxford: Oxford University Press, pp. 259–97.
Berrendero, J. R., Cuevas, A. & Torrecilla, J. L. (2016). Variable selection in functional data classiﬁcation: A
maxima-hunting proposal. Statist. Sinica 26, 619–38.
Berrendero, J. R., Cuevas, A. & Torrecilla, J. L. (2018). On the use of reproducing kernel Hilbert spaces in
functional classiﬁcation. J. Am. Statist. Assoc. 113, 1210–8.
Blanchard, G. & Krämer, N. (2010). Kernel partial least squares is universally consistent. In Proceedings of the
Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, Y. W. Teh & M. Titterington, eds.,
vol. 9 of Proceedings of Machine Learning Research. International Joint Conferences on Artiﬁcial Intelligence
(IJCAI) Organization, pp. 57–64.
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
180 D. Kraus AND M. Stefanucci
Bongiorno, E. G. & Goia, A. (2016). Classiﬁcation methods for Hilbert data based on surrogate density. Comp.
Statist. Data Anal. 99, 204–22.
Bosq, D. (2000). Linear Processes in Function Spaces. New York: Springer.
Bugni, F. A. (2012). Speciﬁcation test for missing functional data. Economet. Theory 28, 959–1002.
Cardot, H., Ferraty, F. & Sarda, P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22.
Cuesta-Albertos, J. A., del Barrio, E., Fraiman, R. & Matrán, C. (2007). The random projection method in
goodness of ﬁt for functional data. Comp. Statist. Data Anal. 51, 4814–31.
Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. J. Statist. Plan. Infer. 147, 1–23.
Dai, X., Müller, H.-G. & Yao, F. (2017). Optimal Bayes classiﬁers for functional data and density ratios. Biometrika
104, 545–60.
De Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemomet. Intel. Lab. Syst.
18, 251–63.
Delaigle, A. & Hall, P. (2012a). Achieving near perfect classiﬁcation for functional data. J. R. Statist. Soc. B 74,
267–86.
Delaigle, A. & Hall, P. (2012b). Methodology and theory for partial least squares applied to functional data. Ann.
Statist. 40, 322–52.
Delaigle, A. & Hall, P. (2013). Classiﬁcation using censored functional data. J. Am. Statist. Assoc. 108, 1269–83.
Delaigle, A. & Hall, P. (2016).Approximating fragmented functional data by segments of Markov chains. Biometrika
103, 779–99.
Delaigle, A., Hall, P. & Bathia, N. (2012). Componentwise classiﬁcation and clustering of functional data.
Biometrika 99, 299–313.
Descary, M.-H. & Panaretos, V. M. (2019). Recovering covariance from functional fragments. Biometrika 106,
145–60.
Febrero-Bande, M., Galeano, P. & González-Manteiga, W. (2017). Functional principal component regression
and functional partial least-squares regression: An overview and a comparative study. Int. Statist. Rev. 85, 61–83.
Ferraty, F., Hall, P. & Vieu, P. (2010). Most-predictive design points for functional data predictors. Biometrika 97,
807–24.
Goldberg, Y., Ritov, Y. & Mandelbaum, A. (2014). Predicting the continuation of a function with applications to
call center data. J. Statist. Plan. Infer. 147, 53–65.
Gromenko, O., Kokoszka, P. & Sojka, J. (2017). Evaluation of the cooling trend in the ionosphere using functional
regression with incomplete curves. Ann. Appl. Statist. 11, 898–918.
Hastie, T. J., Tibshirani, R. J. & Friedman, J. H. (2009). The Elements of Statistical Learning. New York: Springer,
2nd ed.
Horváth, L. & Kokoszka, P. (2012). Inference for Functional Data with Applications. New York: Springer.
Kraus, D. (2015). Components and completion of partially observed functional data. J. R. Statist. Soc. B 77, 777–801.
Liebl, D. (2013). Modeling and forecasting electricity spot prices: A functional data perspective. Ann. Appl. Statist. 7,
1562–92.
LingjÆrde, O. C. & Christophersen, N. (2000). Shrinkage structure of partial least squares. Scand. J. Statist. 27,
459–73.
Phatak, A. & de Hoog, F. (2002). Exploiting the connection between PLS, Lanczos methods and conjugate gradients:
Alternative proofs of some properties of PLS. J. Chemomet. 16, 361–7.
Pini, A. & Vantini, S. (2016). The interval testing procedure: A general framework for inference in functional data
analysis. Biometrics 72, 835–45.
Ramsay, J. O. & Silverman, B. W. (2005). Functional Data Analysis. New York: Springer, 2nd ed.
Sangalli, L. M., Secchi, P. & Vantini, S. (2014a). Analysis of AneuRisk65 data: k-mean alignment. Electron.
J. Statist. 8, 1891–904.
Sangalli, L. M., Secchi, P. & Vantini, S. (2014b). AneuRisk65: A dataset of three-dimensional cerebral vascular
geometries. Electron. J. Statist. 8, 1879–90.
Sangalli, L. M., Secchi, P., Vantini, S. & Veneziani, A. (2009a).A case study in exploratory functional data analysis:
Geometrical features of the internal carotid artery. J. Am. Statist. Assoc. 104, 37–48.
Sangalli, L. M., Secchi, P., Vantini, S. & Veneziani, A. (2009b). Efﬁcient estimation of three-dimensional curves
and their derivatives by free-knot regression splines, applied to the analysis of inner carotid artery centrelines.
J. R. Statist. Soc. C 58, 285–306.
Stefanucci, M., Sangalli, L. M. & Brutti, P. (2018). PCA-based discrimination of partially observed functional
data, with an application to AneuRisk65 data set. Statist. Neer. 72, 246–64.
[Received on 22 August 2017. Editorial decision on 2 August 2018]
Downloadedfromhttps://academic.oup.com/biomet/article-abstract/106/1/161/5250873bygueston28February2019
Biometrika, pp. 1–5
Printed in Great Britain
Supplementary material for Classiﬁcation of functional
fragments by regularized linear classiﬁers with domain
selection
BY DAVID KRAUS
Department of Mathematics and Statistics, Masaryk University,
Kotl´aˇrsk´a 2, 611 37 Brno, Czech Republic
david.kraus@mail.muni.cz
AND MARCO STEFANUCCI
Department of Statistical Sciences, Sapienza University of Rome,
Piazzale Aldo Moro 5, 00185 Roma, Italy
marco.stefanucci@uniroma1.it
SUMMARY
The Supplementary Material provides the derivation of classiﬁers under unequal prior class
probabilities, algorithmic details of cross-validation and additional simulation and real data re-
sults.
Some key words: Classiﬁcation; Conjugate gradients; Domain selection; Functional data; Partial observation; Regularization;
Ridge method.
S1. DERIVATIONS UNDER UNEQUAL PRIOR CLASS PROBABILITIES
Let πj be the prior probability of class j (j = 0, 1). The optimal classiﬁer based on the onedimensional
projection X, ψ assigns X to the class Cψ(X) given by
Cψ(X) = 1{π1fψ,1( X,ψ )>π0fψ,0( X,ψ )}
= 1{ X−µ0,ψ 2− X−µ1,ψ 2>2 ψ,Rψ log(π0/π1)}
= 1{ X−¯µ,ψ µ,ψ > ψ,Rψ log(π0/π1)},
where ¯µ = (µ0 + µ1)/2 and µ = µ1 − µ0. The effect of unequal prior class probabilities is
a shift of the decision boundary and the classiﬁer is invariant with respect to multiplication of ψ
by a non-zero constant.
Due to the fact that X − ¯µ, ψ = X − µ0, ψ − µ, ψ /2 = X − µ1, ψ + µ, ψ /2, the
misclassiﬁcation probability for an observation coming from class 0 or 1 with probabilities π0,
C 2016 Biometrika Trust
2 D. KRAUS AND M. STEFANUCCI
π1 is
π0P0{Cψ(X) = 1} + π1P1{Cψ(X) = 0}
= π0P0{( X − µ0, ψ − µ, ψ /2) µ, ψ > ψ, Rψ log(π0/π1)}
+ π1P1{( X − µ1, ψ + µ, ψ /2) µ, ψ < ψ, Rψ log(π0/π1)}
= π0P0
X − µ0, ψ
ψ, Rψ 1/2
>
ψ, Rψ 1/2
| µ, ψ |
log(π0/π1) +
| µ, ψ |
2 ψ, Rψ 1/2
+ π1P1
X − µ1, ψ
ψ, Rψ 1/2
<
ψ, Rψ 1/2
| µ, ψ |
log(π0/π1) −
| µ, ψ |
2 ψ, Rψ 1/2
= π0 1 − Φ
ψ, Rψ 1/2
| µ, ψ |
log(π0/π1) +
| µ, ψ |
2 ψ, Rψ 1/2
+ π1Φ
ψ, Rψ 1/2
| µ, ψ |
log(π0/π1) −
| µ, ψ |
2 ψ, Rψ 1/2
.
Since the function
π0[1 − Φ{z−1
log(π0/π1) + z/2}] + π1Φ{z−1
log(π0/π1) − z/2}
is decreasing in z > 0, the minimization of the misclassiﬁcation probability is equivalent to the
maximization of
| µ, ψ |
ψ, Rψ 1/2
like in the case of equal prior probabilities discussed in the main body of the paper. If
R−1/2µ < ∞, the upper bound for the above fraction is R−1/2µ and the corresponding
misclassiﬁcation probability equals
π0 1 − Φ
log(π0/π1)
R−1/2µ
+
R−1/2µ
2
+ π1Φ
log(π0/π1)
R−1/2µ
−
R−1/2µ
2
.
When R−1/2µ < ∞, that is, when the Gaussian measures with means µ1, µ2 and covariance
R are mutually absolutely continuous, this is the optimal misclassiﬁcation probability among all
classiﬁers, i.e., the Bayes error, as shown in Theorem 2 in Berrendero et al. (2018). The Bayes
error is achieved by ψ = R−1µ, if R−1µ < ∞.
We can proceed like in the case of equal probabilities and apply regularization techniques to
the inverse problem Rψ = µ. All theoretical results presented for the case of equal probabilities
can be restated and reproved with the above form of the optimal error rate for the general case,
including in the situation with R−1/2µ = ∞, in which case the optimal error rate is zero and
the two Gaussian measure are mutually singular.
In the empirical version of the problem one either estimates the prior class probabilities by
nj/(n0 + n1) if the training sample can be seen as a sample from the mixture of populations
with these probabilities, or uses some ﬁxed values.
S2. SELECTION OF THE REGULARIZATION PARAMETER AND DOMAIN BY
CROSS-VALIDATION
Given the target domain I, regularization method and regularization parameter, Algorithm S1
describes the estimation of the misclassiﬁcation probability by cross-validation.
Supplement for Classiﬁcation of functional fragments 3
Algorithm S1. Estimation of the misclassiﬁcation probability by cross-validation
Set V = {(j, i) : j ∈ {0, 1}, i ∈ {1, . . . , nj}, Oji ⊇ I}
Repeat for (j, i) ∈ V
Estimate the mean and covariance function restricted to I
using all training functions except Xji
Estimate the projection direction ˆψ using the given regularization method
and regularization parameter
Apply ˆC ˆψ to the restriction of Xji to I and save the predicted class label to cji
Set the misclassiﬁcation indicator δji = 1[cji=j]
Output (j,i)∈V δji/|V |
The misclassiﬁcation probability is estimated for a grid of values of the regularization parameter
using Algorithm S1. The value that minimizes the error is selected.
When selecting the domain as well, one repeats the above process for each candidate domain
in place of I.
Once the regularization parameter and possibly domain are selected, the classiﬁer is reestimated
using all training curves and applied to the new curve Xnew.
S3. ADDITIONAL SIMULATION RESULTS
S3·1. Processes with non-smooth trajectories
Fig. S1 presents simulation results to compare the behaviour of classiﬁers on the conjugate
gradient, principal component and ridge regularization path for Gaussian processes with nonsmooth
trajectories. We considered the Ornstein–Uhlenbeck process with covariance function
ρ(s, t) = exp(−|s − t|). We used the same conﬁgurations for the mean difference between the
classes as in Subsection 5.1 in the main body of the paper, except in cases (v) and (vi), where the
mean difference now was the ﬁrst and tenth eigenfunction of the Ornstein–Uhlenbeck covariance
kernel.
The main conclusion from Subsection 5.1 of the paper is still valid for this situation. All three
regularization methods reach about the same best error rate but the conjugate gradient method
does it with less degrees than the other methods. The principal component method appears to
be less stable than in the case of the smooth process of Subsection 5.1 which can probably be
explained by the increased error of the estimation of the eigenfunction.
S3·2. Behaviour under different covariance operators in groups
The methods presented in the paper are derived under the assumption of equal covariance operators
in both groups. Fig. S2 shows simulation results when this assumption is violated. We used
Gaussian processes with covariance function exp(−|s − t|2/0.01) in one group and exp(−|s −
t|) in the other group. We considered the same scenarios for the mean difference as in Subsection
5.1 in the paper, except for scenarios (v) and (vi), where the mean difference was the ﬁrst
and tenth eigenfunction of the mixture covariance 0.5 exp(−|s − t|2/0.01) + 0.5 exp(−|s − t|).
We conclude that the ﬁndings of Subsection 5.1 are robust with respect to the assumption
of equal covariance operators. The principal component classiﬁer again appears to be the least
preferable method. Moreover, the error rates in this situation with different covariances are between
the error rates in situations in which the two groups both have one of the considered
4 D. KRAUS AND M. STEFANUCCI
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
i
5 10 15 2001020304050
Degrees of freedom
Misclassificationrate(%)
ii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iv
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
v
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vi
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
viii
Fig. S1. Misclassiﬁcation rate (%) versus degrees of freedom
for non-smooth processes for different forms of µ(t),
(i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal, (v)
ﬁrst eigenfunction, (vi) tenth eigenfunction, (vii) symmetric
beta, (viii) asymmetric beta, for principal component
(solid), conjugate gradient (dotted) and ridge (dashed)
classiﬁers.
covariance structures. Hence if there is a difference in the means, unequal covariances do not
appear to have a serious negative effect on the performance of the classiﬁers.
S3·3. Performance under increasing training sample size
We performed additional simulations to study the effect of the training sample size. Fig. S3
presents results for the same settings as in Subsection 5.1 in the paper but with 100 training
observations in each group, twice as many as in the paper.
Overall, the misclassiﬁcation rates in Fig. S3 are slightly lower than in Fig. 1 in the paper due
to the reduction of the estimation error. The difference is, however, small, suggesting that at the
considered training sample sizes the estimation error is a relatively unimportant part of the total
misclassiﬁcation error.
S4. PERFORMANCE ON BENCHMARK DATA
We applied the proposed methods to two datasets, referred to as the wheat data and the
phoneme data, on which Delaigle & Hall (2012) and Berrendero et al. (2018) previously compared
functional classiﬁers. See these papers for references to the original sources of the data.
We repeated with our classiﬁers their procedure which consisted of randomly splitting the data
to the training set and test set, building the classiﬁer on the training set and applying it to the test
set to compute the proportion of misclassiﬁed curves, repeating this whole process two hundred
times to estimate the misclassiﬁcation rate. Table S1 reports the results. We can see that misclas-
Supplement for Classiﬁcation of functional fragments 5
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
i
5 10 15 20
01020304050 Degrees of freedom
Misclassificationrate(%)
ii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iv
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
v
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vi
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
viii
Fig. S2. Misclassiﬁcation rate (%) versus degrees of freedom
for processes with unequal covariance operators for
different forms of µ(t), (i) linear, (ii) quadratic, (iii) cubic,
(iv) sinusoidal, (v) ﬁrst eigenfunction, (vi) tenth eigenfunction,
(vii) symmetric beta, (viii) asymmetric beta, for principal
component (solid), conjugate gradient (dotted) and
ridge (dashed) classiﬁers.
siﬁcation rates decrease with increasing training sample size. Overall, on these data all classiﬁers
appear to perform similarly and similarly to other methods studied in Delaigle & Hall (2012) and
Berrendero et al. (2018). The ridge method might seem to perform slightly worse than the other
two on the wheat data but in view of the standard errors we do not over-interpret this and other
differences.
REFERENCES
BERRENDERO, J. R., CUEVAS, A. & TORRECILLA, J. L. (2018). On the use of reproducing kernel Hilbert spaces
in functional classiﬁcation. Journal of the American Statistical Association To appear.
DELAIGLE, A. & HALL, P. (2012). Achieving near perfect classiﬁcation for functional data. Journal of the Royal
Statistical Society. Series B. Statistical Methodology 74, 267–286.
6 D. KRAUS AND M. STEFANUCCI
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
i
5 10 15 2001020304050
Degrees of freedom
Misclassificationrate(%)
ii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
iv
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
v
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vi
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
vii
5 10 15 20
01020304050
Degrees of freedom
Misclassificationrate(%)
viii
Fig. S3. Misclassiﬁcation rate (%) versus degrees of freedom
for 200 training observations for different forms of
µ(t), (i) linear, (ii) quadratic, (iii) cubic, (iv) sinusoidal,
(v) ﬁrst eigenfunction, (vi) tenth eigenfunction, (vii) symmetric
beta, (viii) asymmetric beta, for principal component
(solid), conjugate gradient (dotted) and ridge (dashed)
classiﬁers.
Table S1. Misclassiﬁcation rate (%) and its standard error achieved
for wheat and phoneme data
Training sample size PC CG R
Wheat 30 0.94 (1.89) 0.93 (2.06) 2.48 (2.79)
50 0.36 (1.23) 0.58 (1.84) 1.73 (3.02)
Phoneme 30 24.1 (4.79) 23.3 (3.87) 22.1 (2.90)
50 21.7 (2.76) 21.6 (2.12) 21.0 (2.07)
100 20.1 (1.67) 20.1 (1.51) 20.1 (1.55)
PC, principal components; CG, conjugate gradients; R, ridge.
E. Inferential procedures for partially observed functional data
By David Kraus
Journal of Multivariate Analysis, 173:583–603, 2019
DOI: 10.1016/j.jmva.2019.05.002
139
Journal of Multivariate Analysis 173 (2019) 583–603
Contents lists available at ScienceDirect
Journal of Multivariate Analysis
journal homepage: www.elsevier.com/locate/jmva
Inferential procedures for partially observed functional data
David Kraus
Department of Mathematics and Statistics, Masaryk University, Kotlářská 2, 611 37 Brno, Czech Republic
a r t i c l e i n f o
Article history:
Received 19 September 2018
Received in revised form 14 May 2019
Accepted 15 May 2019
Available online 27 May 2019
AMS 2010 subject classifications:
primary 62M99
secondary 62G10
Keywords:
Bootstrap
Covariance operator
Functional data
K-sample test
Partial observation
Principal components
a b s t r a c t
In functional data analysis it is usually assumed that all functions are completely, densely
or sparsely observed on the same domain. Recent applications have brought attention
to situations where each functional variable may be observed only on a subset of the
domain while no information about the function is available on the complement. Various
advanced methods for such partially observed functional data have already been developed
but, interestingly, some essential methods, such as K-sample tests of equal means
or covariances and confidence intervals for eigenvalues and eigenfunctions, are lacking.
Without requiring any complete curves in the data, we derive asymptotic distributions of
estimators of the mean function, covariance operator and eigenelements and construct
hypothesis tests and confidence intervals. To overcome practical difficulties with storing
large objects in computer memory, which arise due to partial observation, we use the
nonparametric bootstrap approach. The proposed methods are investigated theoretically,
in simulations and on a fragmentary functional data set from medical research.
© 2019 Elsevier Inc. All rights reserved.
1. Introduction
Functional data analysis is an established field [17,28,34,54] with well-developed methodologies for common types of
observation of random curves, i.e., full (or dense) and sparse observation regimes. Due to new applications recent years
have seen the emergence of a new type of observation of functional data, called functional fragments or partially observed
functional data. For various examples see Bugni [6], Delaigle and Hall [14], Liebl [38], Gellar et al. [21], Goldberg et al.
[23], Kraus [35], Delaigle and Hall [15], Gromenko et al. [24], Kneip and Liebl [32], Dawson and Müller [13], Mojirsheibani
and Shaw [45], Stefanucci et al. [55], Descary and Panaretos [16], Kraus and Stefanucci [37] or Liebl and Rameseder [40].
Functional data are collections of observations of random elements of a function space, such as curves, images, surfaces,
spatio-temporal fields. We consider random functions in a separable Hilbert space. Without loss of generality we work
with the space L2
([0, 1]) of square-integrable functions on [0, 1] equipped with inner product ⟨f , g⟩ =
∫ 1
0
f (t)g(t)dt and
norm ∥f ∥ = ⟨f , f ⟩1/2
but our results are applicable to more general spaces. Partially observed functional data consist of
realizations of random functions that are not observed on the entire domain. Each function in the sample may be observed
on a different subset of the domain and no information is available on the function values at arguments in the complement
of this subset. For the ith functional variable Xi ∈ L2
([0, 1]) there is a subset Oi ⊆ [0, 1] such that Xi(t) is observed for
t ∈ Oi and not observed for t ∈ [0, 1]\Oi. The observation sets may be random, corresponding to data that are missing by
happenstance, or non-random for designed experiments. We assume that the observation sets are mutually independent
and independent of the curves. We refer to Liebl and Rameseder [40] for a study of the case of dependent missingness.
Although some advanced procedures, such as goodness-of-fit tests, regression, classification and reconstruction
methods, have been developed for functional fragments, basic methods of inference about the fundamental characteristics
E-mail address: david.kraus@mail.muni.cz.
https://doi.org/10.1016/j.jmva.2019.05.002
0047-259X/© 2019 Elsevier Inc. All rights reserved.
584 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
of functional variables are still missing. In particular, the asymptotic distribution of estimators of the mean function
and covariance operator, K-sample tests of equal means or covariances, and confidence intervals for eigenvalues and
eigenfunctions have not been studied yet in the setting of incomplete functions. Users who wish to perform these
basic tasks currently have the only option: to omit the partially observed functions and apply existing procedures to
the complete data only. This approach is not only clearly sub-optimal due to a possibly large loss of information and
resulting decay of power and accuracy, but also hardly or totally inapplicable in situations where the data contain few or
no complete curves.
In this paper, we address this deficiency of existing methodology and develop essential methods of inference about the
mean and covariance structure of incomplete functional data. Random functions are characterized by the mean function
µ = E X and the covariance operator R : L2
([0, 1]) → L2
([0, 1]) defined as
(Rf )(·) =
∫ 1
0
ρ(·, t)f (t)dt, f ∈ L2
([0, 1]),
where ρ(s, t) = cov{X(s), X(t)} is the covariance function, assuming it exists. The covariance structure is best understood
via principal component analysis or eigendecomposition of R in the form
R =
∞∑
m=1
λmϕm ⊗ ϕm,
where λ1 ≥ λ2 ≥ · · · ≥ 0 are the eigenvalues, ϕm are the corresponding orthonormal eigenfunctions, and (a⊗b)f = ⟨b, f ⟩a
for a, b, f ∈ L2
([0, 1]). For a theoretical background see, e.g., Bosq [5].
We find appropriate assumptions on the observation pattern that enable us to establish the asymptotic distribution
of estimators of µ and R. We develop tests for comparing the mean functions in K populations of functional data based
on samples of fragments. Next, we propose several tests of equal covariance operators in K samples. We also construct
confidence intervals for the eigenvalues and eigenfunctions estimated from incomplete data.
The practical implementation of methods for functional fragments is more complicated than for complete curves. The
main difficulty is that temporal averaging (e.g., in inner products for dimension reduction) is impossible due to missing
values. This leads to asymptotic distributions whose parameters follow rather complicated formulas. More importantly,
since dimension reduction is not possible, the asymptotic distributions are, upon discretization, characterized by large
objects (matrices or arrays) that are difficult or even impossible to store and manipulate in computer memory. The
bootstrap turns out to be a solution to this problem. We provide specific algorithms for resampling functional fragments
for mean and covariance testing and for confidence intervals for eigenelements.
In a simulation study we investigate the performance of the proposed tests, focusing in particular on the impact of
missingness on the different tests and on the effect of the interplay between missingness and the form of differences
between groups. The study shows that the proposed methods are superior to the currently only available approach based
on omitting incomplete curves.
The proposed methodology is applied to a data set of temporal profiles of heart rate. The data consist of several hundred
curves recorded by an automatic device during several hours in the evening during the transition from the day to night
regime of heart activity. The profiles are not observed always available on the entire domain of interest because either
the device did not measure or record measurements, or the person switched off the device. These fragmentary data were
previously analysed in Kraus [35], where further details can be found.
Section 2 develops methods of inference about means in one and K samples. Section 3 deals with tests about covariance
operators and with inference about principal components. Section 4 presents bootstrap approximations. Results of the
simulation study and the data example are reported in Sections 5 and 6. In the Appendix we provide a central limit
theorem for non-identically distributed functional variables needed in the asymptotic analysis of fragments, and proofs
of all theorems. Additional simulation results and further results of the data analysis.
2. Mean inference from incomplete curves
2.1. Estimation of the mean function
In this section we focus on inference about the mean of functional data. Let us first consider estimation of the mean
function µ of a homogeneous population. Let there be n independent functional observations. Each curve Xi, i ∈ {1, . . . , n}
may be observed incompletely, with values known only for arguments in a subset Oi ⊆ [0, 1], with no information on the
complement of Oi. The observation sets may be non-random or random. They are assumed to be mutually independent
and independent of the curves and to consist of a finite union of intervals. We denote by Oi(t) the indicator that the value
of Xi(t) is observed.
The mean function µ(t) can be estimated by the cross-sectional average of available observations
ˆµ(t) =
J(t)
N(t)
n
∑
i=1
Oi(t)Xi(t),
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 585
where N(t) =
∑n
i=1 Oi(t) is the number of available observations at time t and J(t) = 1[N(t)>0]. The estimator is defined
to be zero when N(t) = 0. In Kraus [35, Proposition 1] it was shown that under non-restrictive assumptions on the
observation pattern the estimator ˆµ is consistent for the mean function µ, namely, it was proven that E ∥ ˆµ−µ∥2
= O(n−1
)
as n → ∞. We now aim to provide the asymptotic distribution of the estimator. The result will be essential in the
derivation of the limiting distribution of the test statistics that we construct afterwards.
We denote πi(t) = E Oi(t) = Pr{Oi(t) = 1} and ¯π(t) = n−1
∑n
i=1 πi(t). Furthermore, we denote by Ui(s, t) = Oi(s)Oi(t)
the indicator of observing the function values at the pair of arguments s and t, and define νi(s, t) = E Ui(s, t), ¯ν(s, t) =
n−1
∑n
i=1 νi(s, t) and M(s, t) =
∑n
i=1 Ui(s, t). We need to introduce conditions on the observation pattern as follows.
Condition 1.
(a) Let there be a function π(t) such that π0 = inft∈[0,1] π(t) > 0 and supt∈[0,1] | ¯π(t) − π(t)| → 0 for n → ∞.
(b) Let there be a function ν(s, t) such that ¯ν(s, t) → ν(s, t) for all s, t ∈ [0, 1].
(c) Let there be a value ν0 > 0 such that for each (s, t) ∈ [0, 1]2
either ν(s, t) ≥ ν0 or ν(s, t) = 0, and let the convergence
sup(s,t)∈[0,1]2 |¯ν(s, t) − ν(s, t)| → 0 for n → ∞ hold.
Condition (a) guarantees the consistency of the estimator ˆµ, see Kraus [35]. Condition (b) is needed for the weak
convergence of the estimator. Condition (c) is needed for consistent estimation of the covariance operator of the limiting
distribution. We emphasize that no complete curves are required since these conditions may be satisfied even when the
sample contains only fragments. We illustrate this attractive property in the simulation study in Section 5.
When the observation indicators O1, . . . , On are identically distributed, then Condition (a) is satisfied if π(t) =
P{Oi(t) = 1} is bounded away from zero, Condition (b) is satisfied automatically and Condition (c) is satisfied if for
each (s, t) ∈ [0, 1]2
, ν(s, t) = P{Oi(s) = 1, Oi(t) = 1} is either bounded away from zero or equal to zero. The case
of non-identically distributed observation indicators may be relevant, for example, for designed experiments in which
non-random, designed observation sets may vary across subjects.
By ∥ · ∥2 below we denote the Hilbert–Schmidt norm of an operator.
Theorem 1. Assume that E(∥X1∥2
) < ∞. Let Conditions 1(a) and 1(b) hold. Then
n1/2
{ ˆµ(·) − µ(·)}, N(·)1/2
{ ˆµ(·) − µ(·)}
are asymptotically distributed as mean zero Gaussian processes with covariance operators K ′
, K with kernels
κ′
(s, t) = π(s)−1
π(t)−1
ν(s, t)ρ(s, t), κ(s, t) = π(s)−1/2
π(t)−1/2
ν(s, t)ρ(s, t),
respectively.
If, moreover, Definition 1(c) is satisfied, then K ′
and K can be consistently estimated by the operators ˆK ′
and ˆK with
kernels ˆκ′
(s, t) = ˆπ(s)−1
ˆπ(t)−1
ˆν(s, t) ˆρ(s, t) and ˆκ(s, t) = ˆπ(s)−1/2
ˆπ(t)−1/2
ˆν(s, t) ˆρ(s, t), respectively, i.e., E ∥ ˆK ′
− K ′
∥2
2 → 0
and E ∥ ˆK − K ∥2
2 → 0, where ˆπ(t) = N(t)/n, ˆν(s, t) = M(s, t)/n, ˆρ(s, t) is the empirical covariance based on all complete
pairs of function values at s, t, and the value of the kernels is set to 0 whenever ˆπ(s) or ˆπ(t) is 0.
The proof of this and other theorems is provided in the Appendix. Since the observable functional variables may be
non-identically distributed due to possibly non-identically distributed observation indicators, the proof uses a central limit
theorem for non-identically distributed functional random variables given in the Appendix.
Notice that the covariance kernels κ′
(s, t) and κ(s, t) of the limiting distributions are zero when ν(s, t) = 0 regardless
of the value of ρ(s, t). Therefore, it is not necessary to estimate ρ(s, t) at such points. This is why Definition 1(c) does
not require the function ν(s, t) to be bounded away from zero on the entire domain [0, 1]2
which is needed for the
estimation of R, as will be seen in Section 3, Definition 2(a). This means that the theorem applies also in the context of
short fragments of curves considered, e.g., by Delaigle and Hall [15] or Descary and Panaretos [16], where each curve in
the sample is observed on a short interval and no completely observed curves are available.
2.2. Tests of equality of means in several populations
Let us now consider K independent samples of functional data. Let the jth sample (j ∈ {1, . . . , K}) consist of
independent curves Xj1, . . . , Xjnj
coming from the same distribution with mean µj and covariance operator Rj. The
functions may not be observed completely. It is assumed that for each function Xji its values are available on a subset Oij.
Let the observation subsets be mutually independent and independent of the curves. Our aim is to test the null hypothesis
that µ1 = · · · = µK against the general alternative that the null does not hold. The literature on hypothesis testing for
means of functional data is rich. See, for example, [2,3,8,9,18,28,39,43,49,52,53,56,57,59].
In the literature on complete functional samples there exist two main approaches to comparing mean functions. One
is based on the L2
distance between the means and one uses projections on finite dimensional subspaces.
The assessment of the hypothesis will be based on the contrasts of the group means and a null estimate of the common
mean, i.e., on the differences ˆµj − ˆµ, j ∈ {1, . . . , K}. Here we use ˆµj(t) = Jj(t)Nj(t)−1
∑nj
i=1 Oji(t)Xji(t), j ∈ {1, . . . , K}, with
586 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Nj(t) =
∑nj
i=1 Oji(t) and Jj(t) = 1[Nj(t)>0]. The estimator ˆµ is obtained as a weighted average of the group means in the
form ˆµ(t) =
∑K
j=1 ˆwj(t) ˆµj(t) with weights
ˆwj(t) =
Nj(t)/ˆr2
j
∑K
k=1 Nk(t)/ˆr2
k
,
where ˆr2
j = tr ˆRj is the trace of the estimated covariance operator in the jth sample (the estimators ˆRj are discussed
later). The role of the scaling by ˆr2
j is to account for possibly different covariance structures in the samples. This way of
combining estimated means of heteroscedastic samples is inspired by the univariate case and its standard multivariate
extensions. If the covariance structures are known to be the same in all samples, the factors ˆr2
j can be replaced by the
trace of an estimator of the common covariance operator, which leads to the estimated mean based on the pooled sample
of curves.
The first test we propose is inspired by the method of Cuevas et al. [9] who in the context of fully observed functional
data developed an ANOVA test based on the L2
norms of the contrasts of the group means and the pooled sample mean.
A two-sample version of the test using the nonparametric bootstrap was proposed by Benko et al. [3]. Horváth et al.
[29] studied a two-sample test based on the L2
norm in the context of functional time series. The standardized contrast
processes Nj(·)1/2
{ ˆµj(·) − ˆµ(·)}/ˆrj, j ∈ {1, . . . , K} can be collected into a K-dimensional vector that is a random element
of the product space {L2
([0, 1])}K
with inner product ⟨f , g⟩ =
∑K
j=1⟨fj, gj⟩ for f = (f1, . . . , fK )⊤
, g = (g1, . . . , gK )⊤
. We use
its L2
norm as the test statistic, i.e., base the test on
TL2 =
K
∑
j=1
∥Nj(·)1/2
{ ˆµj(·) − ˆµ(·)}/ˆrj∥2
=
K
∑
j=1
∫ 1
0
Nj(t){ ˆµj(t) − ˆµ(t)}2
/ˆr2
j dt (1)
and reject when the value of the statistic is significantly large.
Another main approach to curve mean testing uses dimension reduction. See, e.g., Aue et al. [2], Horváth and Kokoszka
[28] or Horváth et al. [29]. The idea is to focus on a finite number of important features of the infinite-dimensional
data. The functional observations are projected on a finite-dimensional subspace and multivariate ANOVA or a similar
multivariate procedure is applied to the resulting vectors of Fourier scores. This strategy is not directly applicable in the
situation of incompletely observed curves because, unlike in the fully observed case, Fourier scores of functional fragments
cannot be computed by numerical integration as inner products of the functional variable and the basis function since
the functional variable is not available on the entire domain.
Let ˆψ1, . . . , ˆψd be some linearly independent functions in L2
([0, 1]). Without loss of generality we assume that they
are orthonormal. These functions may be either deterministic or random (estimated from the data). In the construction
of our projection tests we use Fourier scores of the standardized contrast processes with respect to the basis functions
ˆψl. We denote these scores Qjl = ⟨Nj(·){ ˆµj(·) − ˆµ(·)}, ˆψl⟩/(ˆrjn
1/2
j ), j ∈ {1, . . . , K}, l ∈ {1, . . . , d} and collect them in the
score vector Q = (Q11, . . . , Q1d, . . . , QK1, . . . , QKd)⊤
. The score statistic is the quadratic form
Td = Q ⊤ ˆV−
Q , (2)
where ˆV−
is the Moore–Penrose pseudoinverse of the estimated (Kd) × (Kd) covariance matrix of Q whose entry on the
position with index (jl, km) is
ˆVjl,km = ⟨ ˆπ
1/2
j
ˆψl, ˆVjk( ˆπ
1/2
k
ˆψm)⟩ =
∫
[0,1]2
ˆπj(s)1/2 ˆψl(s)ˆvjk(s, t) ˆψm(t) ˆπk(t)1/2
dsdt
for j, k ∈ {1, . . . , K}, l, m ∈ {1, . . . , d}. Here ˆVjk is the covariance operator with kernel
ˆvjk(s, t) =
K
∑
l=1
ˆr−1
j {δjl − Nj(s)1/2
ˆwl(s)Nl(s)−1/2
}ˆκl(s, t){δkl − Nk(t)1/2
ˆwl(t)Nl(t)−1/2
}ˆr−1
k , (3)
where δjk is the Kronecker delta. The test rejects for large values of Td.
Analogously to the case of one group considered in Section 2.1, we denote for j ∈ {1, . . . , K}, i ∈ {1, . . . , nj}
the following quantities characterizing the observation patterns in each group, πji(t) = E Oji(t) = Pr(Oji(t) = 1),
¯πj(t) = n−1
j
∑nj
i=1 πji(t), Uji(s, t) = Oji(s)Oji(t), νji(s, t) = E Uji(s, t), ¯νj(s, t) = n−1
j
∑nj
i=1 νij(s, t) and Mj(s, t) =
∑nj
i=1 Uji(s, t).
Under mild assumptions we obtain the asymptotic distribution of both test statistics.
Theorem 2. For j ∈ {1, . . . , K} assume that nj → ∞, nj/(n1 + · · · + nK ) → aj > 0 and E ∥Xj1∥2
< ∞. Let the observation
patterns in each group satisfy Definition 1. Then under the null hypothesis of equal means we obtain the following results:
(i) The test statistic TL2 is asymptotically distributed as
∑∞
k=1 γkCk, where Ck are independent chi-square distributed variables
with one degree of freedom and γk can be consistently estimated by the eigenvalues of the operator ˆV given in (3).
(ii) Assume that there exist linearly independent non-random functions ψ1, . . . , ψd such that ∥ ˆψl − ψl∥
P
−→ 0 for l ∈
{1, . . . , d}. Then the test statistic Td is asymptotically chi-square distributed with (K − 1)d degrees of freedom.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 587
The test statistic based on the L2
norm is not distribution-free but the critical values can be obtained straightforwardly
by simulation, provided that the eigenvalues of ˆV consistently estimate γk. Similarly, the consistency of ˆV (and hence of
ˆV) is needed for the score statistic. The consistency of ˆV is guaranteed by Definition 1(c). It may sometimes happen that
Mj(s, t) is low for some s, t, making the estimator ˆV less reliable. For this reason, and also for computational reasons, to
avoid the estimation of the limiting covariance one can use the bootstrap method, as we describe in Section 4.
In the literature on complete functional data, the most common choice of the basis functions for the projection test
is derived from principal component analysis (see Horváth and Kokoszka [28] and references therein, or Fremdt et al.
[19]). The approach uses several leading eigenfunctions of the pooled sample covariance operator. The motivation for this
choice is the property that the first eigenfunctions capture the principal modes of variation, the most important features
of random deviations of the functional variables from the mean. Another approach is to use a fixed set of basis functions,
such as several elements of the Fourier basis of sines and cosines or several orthonormal Legendre polynomials.
For several reasons we prefer deterministic bases to the basis of eigenfunctions. One drawback of the latter approach
is that the principal components of variability may be only weakly related or entirely unrelated (orthogonal) to the
differences between the mean functions, resulting in a test that is weak or inconsistent against this alternative. It may
of course happen that the deterministic functions we choose are orthogonal to the alternative too, or that the leading
eigenfunctions capture the mean differences well. However, with fixed functions it is at least possible to say before the
analysis which alternatives can be detected. With principal components it is not known beforehand which departures
from the null can be captured because the eigenfunctions are usually unknown. Moreover, their property of capturing
the largest portion of variability, which is typically the main argument for using them, is not exactly what one wishes in
mean testing. In fact, one would rather wish to maximize the signal-to-noise ratio or non-centrality, which, for example, in
the case of components with equal magnitude of means would mean to minimize variability. In reality, the true interplay
between the magnitude of components of the mean difference and their variability is not known, and we, therefore, prefer
fixed functions.
The choice of the number of basis functions is important with projection methods. For the approach using eigenfunctions,
we follow the recommendation of Horváth et al. [29] to use the smallest number of components needed to explain
at least 85% of the total variability. For the method using fixed functions, in light of the above discussion of the relation
of the power and variability we do not base the choice of d on the explained variability. Instead, we can specify what
shape differences we wish to detect and use the corresponding basis functions. For example, using just d = 3 Legendre
polynomials describing constant, monotonic as well as convex or concave non-monotonic differences seems to be a good
choice in many applications.
3. Covariance inference under partial observation
3.1. Asymptotics for the estimated covariance operator and principal components
Given a collection of independent realizations of curves X1, . . . , Xn with mean function µ and covariance operator
R observed on subsets O1, . . . , On, the covariance function ρ(s, t) can be estimated by the empirical covariance using
pairwise complete observations, that is, by
ˆρ(s, t) =
I(s, t)
M(s, t)
n
∑
i=1
Ui(s, t){Xi(s) − ˆµst (s)}{Xi(t) − ˆµst (t)},
where I(s, t) = 1[M(s,t)>0] and
ˆµst (s) =
1[M(s,t)>0]
M(s, t)
n
∑
i=1
Ui(s, t)Xi(s).
If M(s, t) = 0, we define ˆρ(s, t) = 0 and ˆµst (s) = 0. Under certain assumptions on the observation pattern, the operator
ˆR with kernel ˆρ(s, t) was shown to be a consistent estimator of R in Kraus [35, Proposition 1].
In the theorem below we give the asymptotic distribution under a set of conditions for which we denote Ei(s, t, u, v) =
Oi(s)Oi(t)Oi(u)Oi(v), the indicator that the observation of Xi at points s, t, u, v is available, and set θi(s, t, u, v) =
Pr{Ei(s, t, u, v) = 1}, ¯θ(s, t, u, v) =
∑n
i=1 θi(s, t, u, v)/n and L(s, t, u, v) =
∑n
i=1 Ei(s, t, u, v).
Condition 2.
(a) Let there be a function ν(s, t) such that ν0 = inf(s,t)∈[0,1]2 ν(s, t) > 0 and sup(s,t)∈[0,1]2 |¯ν(s, t) − ν(s, t)| → 0 for
n → ∞.
(b) Let there be a function θ(s, t, u, v) such that ¯θ(s, t, u, v) → θ(s, t, u, v) for all s, t, u, v ∈ [0, 1].
(c) Let there be a value θ0 > 0 such that for each (s, t, u, v) ∈ [0, 1]4
either θ(s, t, u, v) ≥ θ0 or θ(s, t, u, v) = 0, and
let the convergence sup(s,t,u,v)∈[0,1]4 |¯θ(s, t, u, v) − θ(s, t, u, v)| → 0 for n → ∞ hold.
588 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Condition (a) means that there are enough observations at all pairs of arguments. The condition is needed for the
consistency of ˆR, see Kraus [35] for a proof under an essentially equivalent condition. Condition (b) guarantees the weak
convergence in the theorem below, and the additional condition (c) guarantees that the covariance of the asymptotic
distribution can be estimated. We stress that these conditions do not require that the data contain any complete curves.
They may be satisfied even in situations, where all functional observations are fragmentary. When the observation
indicators O1, . . . , On are identically distributed, then Condition (a) is satisfied if ν(t) = P{Oi(s) = 1, Oi(t) = 1} is bounded
away from zero, Condition (b) is satisfied automatically and Condition (c) is satisfied if for each (s, t, u, v) ∈ [0, 1]4
,
θ(s, t, u, v) = P{Oi(s) = 1, Oi(t) = 1, Oi(u) = 1, Oi(v) = 1} is either bounded away from zero or equal to zero.
Theorem 3. Assume that E(∥X1∥4
) < ∞. Let Conditions 2(a) and 2(b) hold. Then n1/2
( ˆR − R) and the operator with kernel
M(·, ·)1/2
{ˆρ(·, ·) − ρ(·, ·)} are asymptotically distributed as mean zero Gaussian operators whose covariance operators H′
, H
have kernels
η′
(s, t, u, v) = ν(s, t)−1
ν(u, v)−1
θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)},
η(s, t, u, v) = ν(s, t)−1/2
ν(u, v)−1/2
θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)},
respectively, where ζ(s, t, u, v) = E[{X(s) − µ(s)}{X(t) − µ(t)}{X(u) − µ(u)}{X(v) − µ(v)}].
If, in addition, Definition 2(c) is satisfied, then H′
and H can be consistently estimated by the operators ˆH′
and ˆH with kernels
ˆη′
(s, t, u, v) = ˆν(s, t)−1
ˆν(u, v)−1 ˆθ(s, t, u, v){ˆζ(s, t, u, v) − ˆρ(s, t) ˆρ(u, v)} and ˆη(s, t, u, v) = ˆν(s, t)−1/2
ˆν(u, v)−1/2 ˆθ(s, t, u, v)
{ˆζ(s, t, u, v)− ˆρ(s, t) ˆρ(u, v)}, respectively, i.e., E ∥ ˆH′
−H′
∥2
2 → 0 and E ∥ ˆH−H∥2
2 → 0, where ˆη′
(s, t, u, v) and ˆη(s, t, u, v) are
set to 0 whenever ˆν(s, t) or ˆν(u, v) is 0, ˆθ(s, t, u, v) = L(s, t, u, v)/n and ˆζ(s, t, u, v) is the empirical fourth central moment
of the functional random variable computed using all complete quadruples of function values at arguments s, t, u, v.
The weak convergence in the theorem above is on the separable Hilbert space of Hilbert–Schmidt operators equipped
with the Hilbert–Schmidt norm ∥ · ∥2. The limiting covariance operator H is an operator that maps a Hilbert–Schmidt
operator F with kernel f (u, v) to an operator with kernel
∫ 1
0
∫ 1
0
η(s, t, u, v)f (u, v)dudv, similarly for other objects in the
theorem.
Next, we study the estimators ˆλm and ˆϕm of the eigenvalues and eigenfunctions of R. The estimators are obtained by
the eigendecomposition of ˆR. Their root-n consistency was established by Kraus [35, Proposition 2]. Here we find the
approximate distribution of the fluctuation of the estimators around their true counterparts (with appropriate sign for
the eigenfunctions as usual).
Theorem 4. Assume that E(∥X1∥4
) < ∞ and R has eigenvalues with multiplicity 1. Let Conditions 2(a) and 2(b) hold. Denote
by H ′∞
a random operator following the limiting Gaussian distribution of n1/2
( ˆR − R) with mean zero and covariance H′
given in Theorem 3. Then, for n → ∞, we obtain the following results:
(i) n1/2
(ˆλm − λm) is asymptotically distributed as ⟨H ′∞
ϕm, ϕm⟩, which is a normal variable with mean zero and variance
∫
[0,1]4
ϕm(s)ϕm(t)η′
(s, t, u, v)ϕm(u)ϕm(v)dsdtdudv.
(ii) n1/2
( ˆϕm − ˆsmϕm), where ˆsm = sign⟨ˆϕm, ϕm⟩, is asymptotically distributed as the Gaussian random function QmH ′∞
ϕm,
where
Qm =
∞∑
k=1
k̸=m
ϕk ⊗ ϕk
λm − λk
.
The limiting covariance operator of n1/2
( ˆϕm − ˆsmϕm) is
∞∑
k=1
k̸=m
∞∑
l=1
l̸=m
ϕk ⊗ ϕl
(λm − λk)(λm − λl)
∫
[0,1]4
ϕk(s)ϕm(t)η′
(s, t, u, v)ϕm(u)ϕl(v)dsdtdudv.
If, additionally, Definition 2(c) is satisfied, then the limiting variance and covariance above can be consistently estimated by
plugging-in estimates from Theorem 3.
The theorem is proved in the Appendix with the help of perturbation theory. The theorem generalizes the classic results
of Dauxois et al. [11] who considered completely observed functions. See Kokoszka and Reimherr [33] for related results
for functional time series. In the case of complete Gaussian curves Dauxois et al. [11] showed that the limiting covariance
structure of the empirical covariance operator simplifies [see also 46] which eventually leads to a simpler form of the
limiting variance of the empirical eigenvalue, namely to 2λ2
m. No such simplification is in general possible in the case of
incomplete curves, even if they are Gaussian. Therefore, to make inference about eigenvalues or eigenfunctions, e.g., to
construct confidence intervals, one possibility is to estimate the function η′
(s, t, u, v) and use the complicated expressions
above for the limiting covariance structure. In Section 4 we provide an alternative approach based on the bootstrap which
enables to avoid the possibly unstable estimation of η′
and computer memory demanding storage and manipulation with
the estimate.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 589
3.2. Testing the equality of covariance operators
We now study tests for equality of covariance operators of several populations. Let there be K independent samples
of partially observed functions with mean µj and covariance Rj in the jth sample, as described in Section 2.2. We aim to
test the null hypothesis that R1 = · · · = RK against the general alternative. The general problem of hypothesis testing
for covariance operators was previously studied in various contexts by various methods. See, e.g., [3,4,7,20,25,26,30,31,
36,44,46,49–51,57,58].
Tests of the null hypothesis of equal covariance operators can be based on the differences between the estimators ˆRj
and the null estimator ˆR which is the pooled covariance operator with kernel
ˆρ(s, t) =
K
∑
j=1
ˆwj(s, t) ˆρj(s, t),
where
ˆwj(s, t) =
Mj(s, t)
∑K
k=1 Mk(s, t)
.
The differences are expressed by the contrast operators with kernels Mj(·, ·)1/2
{ˆρj(·, ·) − ˆρ(·, ·)}. We propose two types
of tests measuring the importance of the contrasts: one approach is based on the Hilbert–Schmidt norm of the contrasts
and one is based on their projections on a subspace.
The first approach is inspired by methods that were previously considered in the case of fully observed functions,
e.g., by Boente et al. [4]. The importance of the contrasts is expressed by the Hilbert–Schmidt norm. The test statistic
takes the form
SHS =
K
∑
j=1
∥Mj(·, ·)1/2
{ˆρj(·, ·) − ˆρ(·, ·)}∥2
2 =
K
∑
j=1
∫
[0,1]2
Mj(s, t){ˆρj(s, t) − ˆρ(s, t)}2
dsdt (4)
(in this notation we identify kernels and the corresponding operators).
The second approach uses projections of the contrasts onto a finite-dimensional subspace of the space of
Hilbert–Schmidt operators. This type of tests was used for complete functions in various settings, e.g., by Horváth et al.
[27], Panaretos et al. [46], Panaretos et al. [47], Kraus and Panaretos [36], Fremdt et al. [20], and Jarušková [30]. It is
natural to project on the subspace generated by the leading eigenfunctions of ˆR because they carry information about
the object of interest, the covariance operator (unlike in the case of mean functions where we prefer to use a fixed basis
for the projection test). Let ˆϕ1, . . . , ˆϕd be the first d eigenfunctions of ˆR. Then the operators
ˆUlm =
{
ˆϕl ⊗ ˆϕl, l = m,
( ˆϕl ⊗ ˆϕm + ˆϕm ⊗ ˆϕl)/21/2
, l < m
with kernels ˆull(s, t) = ˆϕl(s) ˆϕl(t) and ˆulm(s, t) = {ˆϕl(s) ˆϕm(t) + ˆϕm(s) ˆϕl(t)}/21/2
, l < m form an orthonormal basis of
a d(d + 1)/2-dimensional subspace of HS(L2
([0, 1])). The Fourier coefficients of the projection of the jth standardized
contrast on this subspace are
Rjlm = ⟨Mj(·, ·){ˆρj(·, ·) − ˆρ(·, ·)}/n
1/2
j , ˆUlm⟩ =
∫
[0,1]2
Mj(s, t){ˆρj(s, t) − ˆρ(s, t)}ˆulm(s, t)dsdt/n
1/2
j . (5)
Denote by R the Kd(d + 1)/2-dimensional score vector with components Rjlm, j ∈ {1, . . . , K}, 1 ≤ l ≤ m ≤ d. The test
statistic measures the size of the projection of the contrast operators on the subspace. It takes the form
Sd = R ˆW−
R, (6)
where ˆW−
is the Moore–Penrose pseudoinverse of the estimator of the asymptotic covariance matrix whose entry with
indices (jlm, kpq) is
ˆWjlm,kpq = ⟨ˆνj(·, ·)1/2
ˆulm(·, ·), ˆBjk{ˆνk(·, ·)1/2
ˆupq(·, ·)}⟩
=
∫
[0,1]4
ˆνj(s, t)1/2
ˆulm(s, t) ˆβjk(s, t, u, v)ˆupq(u, v)ˆνk(u, v)1/2
dsdtdudv,
(7)
j, k = 1, . . . , K, 1 ≤ l ≤ m ≤ d, 1 ≤ p ≤ q ≤ d. The kernel of ˆBjk is
ˆβjk(s, t, u, v) =
K
∑
l=1
{δjl − Mj(s, t)1/2
ˆwl(s, t)Ml(s, t)−1/2
}ˆηl(s, t, u, v)
× {δkl − Mk(u, v)1/2
ˆwl(u, v)Ml(u, v)−1/2
}.
(8)
We now give the asymptotic distribution of the Hilbert–Schmidt and projection statistics.
590 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Theorem 5. For j ∈ {1, . . . , K} assume that nj → ∞, nj/(n1 + · · · + nK ) → aj > 0, E ∥Xj1∥4
< ∞ and all eigenvalues of
Rj have multiplicity 1. Let the observation patterns in each group satisfy Definition 2. Then under the null hypothesis of equal
covariance operators we obtain the following results:
(i) The test statistic SHS is asymptotically distributed as
∑∞
k=1 δkCk, where Ck are independent chi-square distributed variables
with one degree of freedom and δk can be consistently estimated by the eigenvalues of the operator ˆB given in (8).
(ii) The test statistic Sd is asymptotically chi-square distributed with (K − 1)d(d + 1)/2 degrees of freedom.
The asymptotic distribution of SHS can be approximated by simulation like in Boente et al. [4]. Section 4 presents
a practical bootstrap implementation of these tests in which it is not necessary to compute the operator ˆB.
Tests based directly on covariance operators are not the only option. As an alternative we explore the approach of Pigoli
et al. [50] who argue that although covariance operators are contained in the Hilbert space of Hilbert–Schmidt operators,
they do not form a linear subspace, and propose other distances than those based on the difference of covariances, such
as the Procrustes distance and the square root distance. This direction of research was further investigated by Cabassi
et al. [7] and Masarotto [44]. One of the proposals of Pigoli et al. [50] was to use the Hilbert–Schmidt distance between
square root covariance operators dsqrt(R1, R2) = ∥R
1/2
1 −R
1/2
2 ∥2. They report good power results for a two-sample test of
equal covariances in the setting of complete functions based on this distance between estimated operators, dsqrt( ˆR1, ˆR2).
We extend this approach to K samples consisting of partially observed functions.
Since the data may contain incomplete functions, the empirical covariance operators ˆRj used before may have negative
eigenvalues. To be able to work with empirical square root covariance operators, we need to modify the covariance
estimators to ensure they are non-negative definite. We use
ˆRj+ =
nj
∑
l=1
(ˆλjl)+ ˆϕjl ⊗ ˆϕjl,
where (ˆλjl)+ = max(ˆλjl, 0) is the positive part of the eigenvalue ˆλjl of ˆRj and ˆϕjl is the corresponding eigenfunction. As
discussed in Kraus [35], negative eigenvalues are typically of small magnitude in comparison with leading eigenvalues
and, therefore, are negligible in practice. For a test statistic, we need to use the distance dsqrt to define a null estimator
of R and contrasts between the group estimators ˆRj+ and the null estimator. The common covariance operator can be
estimated by
ˆRsqrt =
(∑K
j=1 nj ˆR
1/2
j+
∑K
j=1 nj
)2
,
which is the weighted Fréchet mean of the group-specific operators, i.e., the minimizer with respect to R of
∑K
j=1 nj
dsqrt( ˆRj+, R)2
.
The attained minimum of this objective function,
Ssqrt =
K
∑
j=1
njdsqrt( ˆRj+, ˆRsqrt)2
=
K
∑
j=1
∥n
1/2
j ( ˆR
1/2
j+ − ˆR
1/2
sqrt)∥2
2, (9)
can serve as a test statistic for comparing covariance operators in K samples. The statistic summarizes the size of the
contrasts between the group and null estimators of the square root covariance operator. Following Pigoli et al. [50] we
use resampling to approximate the null distribution of the statistic.
Notice that the contrasts between the group and null estimators in Ssqrt and SHS are weighted differently. In SHS we
weight the contrast kernels by Mj(s, t)1/2
which in the fragmentary setting reflects the accuracy of the estimation of the
covariance kernel at each point of [0, 1]2
due to the number of observations available at that point. In Ssqrt this would not
be meaningful because the square root covariance operator is a function of the entire covariance operator and thus the
accuracy of the estimation of the square root covariance kernel at one point depends also on the numbers of available
observations at all other points. We therefore simply weight by n
1/2
j reflecting the overall accuracy of the square root
covariance estimator. Both SHS and Ssqrt are the attained minimum of the corresponding objective functional that defines
the null estimator.
4. Practical implementation and bootstrap approximations
Functional data procedures are practically implemented by discretization. Functional observations are evaluated at
q points of a grid in the domain. Functions then correspond to q-vectors (possibly with missing values), operators on
the function space correspond to (q × q)-matrices and operators on operators correspond to four-way arrays with all
dimensions q.
To make inference (tests and confidence intervals), one can use the asymptotic distributions found in the previous
section. However, the implementation of such procedures would be excessively demanding in terms of computer memory,
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 591
especially in the case of covariance inference. For example, when the evaluation grid consists of q = 100 points, arrays
such as the one corresponding to the fourth moment kernel ζ(s, t, u, v) contain q4
= 108
entries. To compare covariances,
e.g., in K = 3 samples, one would have to work with an array with K2
q4
= 9×108
entries whose size already approaches
the memory limits of usual computers, even if symmetry is exploited. In the case of multivariate, spatial or image data
the number of evaluation points q is typically much larger than for functions of a one-dimensional argument. Aston et al.
[1] give an example of acoustic phonetic data with bivariate, time–frequency argument with q = 8100. In conclusion,
the size of objects representing the asymptotic covariance structure for tests or confidence intervals may be far beyond
memory limits.
Projection covariance tests for complete functions can avoid the computation, storage and manipulation with such large
arrays by computing principal scores of each function with respect to the required low number d of eigenfunctions [20,27,
46,47]. The covariance matrix of the score then depends on easy-to-handle d-dimensional four-way arrays instead of large
q-dimensional four-way arrays. This dimension reduction approach is not applicable in the case of incomplete functions
because the principal scores ⟨Xji − ˆµj, ˆϕm⟩ cannot be computed when Xji is available only on a subset of its domain [they
can only be predicted, see 35]. Therefore, even the computation of the projection test statistic (6) is difficult due the large
arrays the matrix ˆW depends on.
The computation of the Hilbert–Schmidt statistic (4) and the square root covariance statistic (9) does not involve
large four-way arrays. However, to use the asymptotic distribution of SHS (see Theorem 5) one needs to estimate the
eigenvalues of an operator on operators. Upon discretization and vectorization, this leads to a large eigenproblem of
dimension (Kq2
) × (Kq2
), e.g., 30 000 × 30 000 for K = 3, q = 100. Again, dimension reduction cannot be used due to
incomplete functions.
To overcome these difficulties we use the bootstrap. For completely observed functional data bootstrap tests of equal
mean functions or covariance operators were studied by Benko et al. [3] and Paparoditis and Sapatinas [48,49]. In our
missing data setting, all bootstrap procedures consist of appropriate resampling of fragmentary curves, which means that
each bootstrap sample is again a collection of partially observed functions. The proposed procedures enable to completely
avoid the computation of each entry of the large four-way covariance array and the storage and decomposition of the
whole array.
The implementation of the tests of equal means is described in Algorithm 1. To correctly reproduce the limiting
distribution of the group mean estimators under the null, the resampling is done separately in each group of groupwise
centred fragmentary observations. The stratification guarantees that neither the missingness patterns nor distributional
characteristics of the functions beyond the means need to be equal in all groups. The L2
statistic is computed directly for
each bootstrap sample and the observed value is then compared with the resampled values. The direct computation of
the projection test statistic from observed or resampled data would require the estimation of the covariance functions
ˆvjk in (3), which may be memory demanding and possibly unstable in regions with few complete pairs. We avoid it by
estimating the covariance matrix of the score vector from the resampled score vectors, calculating the quadratic form
statistic using the observed score vector and the bootstrap estimate of its covariance matrix, and comparing it with its
asymptotic chi-square distribution.
Algorithm 1 Bootstrap approximation for tests of equal mean functions
1: Calculate ˆµj from observed samples of fragments Xj1, . . . , Xjnj
, j = 1, . . . , K, and ˆµ
2: Calculate the test statistic TL2 and the score vector Q
3: Set Xji0 = Xji − ˆµj + ˆµ
4: For b = 1, . . . , B
5: For each j = 1, . . . , K, sample with replacement from fragments Xj10, . . . , Xjnj0
to get fragments X∗
j10, . . . , X∗
jnj0
6: Calculate the statistic T
∗(b)
L2 and score vector Q ∗(b)
from X∗
j10, . . . , X∗
jnj0, j = 1, . . . , K
7: Approximate the p-value of the L2
-test using TL2 and T
∗(1)
L2 , . . . , T
∗(B)
L2
8: Calculate the empirical covariance matrix ˆV∗
of Q ∗(1)
, . . . , Q ∗(B)
and the statistic Td = Q ⊤ ˆV∗−
Q
9: Approximate the p-value of the projection test using Td and the χ2
(K−1)d distribution
Algorithm 2 describes the bootstrap implementation of confidence intervals for eigenelements. Resampling is applied to
fragments and eigenelements are computed. The resampled eigenfunction is possibly reflected about zero so that its sign
agrees with that of the observed data empirical eigenfunction. Standard methods of construction of confidence intervals
can then be used. Since we again wish to avoid the calculation of variance estimates of eigenelements (see Theorem 4),
we use the normal or basic bootstrap method [12, Chapter 5]. Intervals for eigenvalues are constructed on the logarithmic
scale and untransformed. This is appropriate in general because in the case of completely observed Gaussian curves the
asymptotic variance of n1/2
(ˆλm − λm) is 2λ2
m and thus the log-transformation approximately stabilizes variance.
Bootstrap covariance testing is described in Algorithm 3. Unlike in the case of mean testing, it is not possible to
transform the data to the common null covariance structure and use stratified resampling. Bootstrap samples are instead
592 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Algorithm 2 Bootstrap confidence intervals for eigenvalues and eigenfunctions
1: Calculate ˆR from the observed fragmentary functional data X1, . . . , Xn
2: Calculate the eigenvalues ˆλm and eigenfunctions ˆϕm of ˆR
3: For b = 1, . . . , B
4: Sample with replacement from fragments X1, . . . , Xn to get fragments X∗
1 , . . . , X∗
n
5: Calculate ˆR∗
from X∗
1 , . . . , X∗
n and its eigenvalues ˆλ∗(b)
m and eigenfunctions ˆϕ∗(b)
m
6: Replace ˆϕ∗(b)
m by sign⟨ˆϕ∗(b)
m , ˆϕm⟩ˆϕ∗(b)
m
7: Based on ˆλ∗(b)
m , ˆϕ∗(b)
m , b = 1, . . . , B, calculate bootstrap confidence intervals for λm using log-transformation and
pointwise bootstrap confidence intervals for ϕm(t)
drawn from the pooled sample of groupwise centred fragments, similarly to Paparoditis and Sapatinas [49, Subsection 2.2]
for complete curves. Then, under the null hypothesis, if characteristics of observation patterns (θj) and fourth order
moments (ζj) are the same in all groups, the pooled resampling asymptotically replicates the limiting distributions of
interest. The Hilbert–Schmidt norm and square root covariance statistics are computed directly and the significance is
decided upon by comparing the observed statistics with the resampled ones. Like in the case of mean testing, dimension
reduction is impossible due to partial observation, and thus the computation of the covariance matrix of the score vector
would require to compute large four-way arrays. Instead, the bootstrap is used to estimate the covariance matrix of the
score and the quadratic statistic with this matrix is used.
Algorithm 3 Bootstrap approximation for tests of equal covariance operators
1: Calculate ˆµj and ˆRj from observed samples of fragments Xj1, . . . , Xjnj
, j = 1, . . . , K, and ˆR
2: Perform eigendecomposition of ˆR, determine d and calculate ˆUlm, 1 ≤ l ≤ m ≤ d
3: Calculate the test statistics SHS and Ssqrt and the score vector R with respect to ˆUjm
4: Set Xji0 = Xji − ˆµj
5: For b = 1, . . . , B
6: For each j = 1, . . . , K, sample with replacement from the pooled collection of fragments
Xji0, j = 1, . . . , K, i = 1, . . . , nj to get fragments X∗
j10, . . . , X∗
jnj0
7: Calculate the statistics S
∗(b)
HS and S
∗(b)
sqrt and the score vector R∗(b)
with respect to ˆUjm
from X∗
j10, . . . , X∗
jnj0, j = 1, . . . , K
8: Approximate the p-value of the Hilbert–Schmidt norm test using SHS and S
∗(1)
HS , . . . , S
∗(B)
HS and the p-value of the square
root covariance test using Ssqrt and S
∗(1)
sqrt , . . . , S
∗(B)
sqrt
9: Calculate the empirical covariance matrix ˆW∗
of R∗(1)
, . . . , R∗(B)
and the statistic Sd = R⊤ ˆW∗−
R
10: Approximate the p-value of the projection test using Sd and the χ2
(K−1)d(d+1)/2 distribution
While we do not provide formal proofs of the validity of the bootstrap approximations, these could be obtained along
the lines of the proofs in Paparoditis and Sapatinas [48] and Paparoditis and Sapatinas [49] using our asymptotic results
(Theorems 1–5). Note that in our setting the observation sets might be non-identically distributed (e.g., in the case of
designed experiments), and hence the bootstrap is applied to possibly non-identically distributed observed fragments.
Their average characteristics, however, converge under Definitions 1 and 2. It is possible to use the bootstrap even with
mildly non-identically distributed data, as discussed in the general context by Liu [41] who shows that if average moment
characteristics of possibly non-identically distributed variables converge, the bootstrap is still applicable.
The use of the bootstrap for the square root covariance test is based on empirical evidence from simulation studies
(Section 5 and the Supplementary Material). Its theoretical justification would require to first establish the asymptotic
distribution of the estimated square root covariance operator, which is not available even in the case of completely
observed curves [50].
5. Simulation results
The main goal of the study is to investigate the impact of partial observation on the performance of the different mean
and covariance tests and compare the proposed tests using complete and incomplete curves with the simple approach
using complete curves only.
We repeatedly generate three samples of curves of sizes n1 = 80, n2 = 100, n3 = 120. Curves in the jth sample take
the form
X(t) = µj(t) + λ
1/2
j0 βj0hj(t) +
20
∑
k=1
λ
1/2
jk βjk21/2
cos(kπt), t ∈ [0, 1],
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 593
Table 1
Empirical rejection probability (in %) of the L2
test, TL2 , and projection test, Td, of equal means. A dash indicates the
same value as on the preceding row. The observation patterns (1)–(9) and mean configurations A–D are described in
the text.
Observation pattern Mean configuration
A B C D
TL2 Td TL2 Td TL2 Td TL2 Td
Tests using complete and incomplete curves (proposed approach)
(1) 5.6 6.2 69 60 49 56 52 63
(2) 5.4 6.7 59 52 28 29 38 50
(3) – – – – 50 56 44 62
(4) 4.4 6.5 66 58 51 57 51 62
(5) – – – – 44 49 50 58
(6) 5.4 7.1 58 51 50 55 42 49
(7) – – – – 28 34 37 42
(8) 5.4 5.8 55 47 34 37 42 48
(9) 5.4 7.8 37 40 20 23 26 34
Tests using complete curves only (simple approach)
(2), (3) 5.7 7.4 40 34 26 32 27 35
(4), (5) 3.6 7.4 28 27 18 26 19 28
(6), (7) 4.9 26.8 7 31 6 29 6 31
(8) 4.0 11.5 13 22 8 20 10 21
where βjk, j ∈ {1, 2, 3}, k ∈ {0, . . . , 20} are mutually independent standard normal variables. Additional simulations with
t5 distributed coefficients are reported in the supplementary material. In all simulations we use 1000 repetitions of the
test procedures, each based on 500 bootstrap samples. All tests are performed on the nominal level of 5%. All results have
been computed in R 3.4.
The tests are applied to complete trajectories, observation pattern (1), and to fragments obtained by deleting missing
periods following several random or nonrandom patterns. Observation patterns (2) and (3) are nonrandom: under
pattern (2), the period [0, 0.5] is removed from 50% of the curves in the first sample, 50% in the second sample and
60% in the third sample; pattern (3) is symmetric about 0.5, i.e., the period [0.5, 1] instead of [0, 0.5] is missing in the
same subset of curves. Under patterns (4)–(7), a random missing period is generated independently for each curve and
removed from the trajectory. First, we consider random missing periods taking the form M = [C − E, C + E] ∩ [0, 1] with
C = dU
1/2
1 and E = fU2, where U1, U2 are independent variables uniformly distributed on [0, 1] and d, f are parameters. For
missingness pattern (4) we set d = 1.4 and f = 0.2; this gives 39% of completely observed curves and the cross-sectional
percentage of observed values decreases from 99% at time 0 to 79% at time 1. Pattern (5) is symmetric about 0.5. For
pattern (6) we use the same model as for (4) and set d = 1.2 and f = 0.5; this leads to 7% of complete curves and the
cross-sectional probability of observation is 94% at 0 and decreases to about 45% near 1. Pattern (7) is again obtained
by reflecting pattern (6) about 0.5. Pattern (8) consists of observation periods generated independently for each curve
in the form O = [U1, U2] ∩ [0, 1], where U1, U1 are independent variables uniformly distributed on [a, C], [C, 1 − a],
respectively, a = −0.3 and C is uniformly distributed on [0, 1]; the percentage of complete curves in this case is 16%
and the cross-sectional observation probability at 0.5 is 77% and decreases to 44% towards both endpoints of the domain.
Finally, for pattern (9) curves are observed on random intervals generated as [C − 0.2, C + 0.2] ∩ [0, 1], where C is
uniformly distributed in [0, 1]. This corresponds to fragments of curves of length at most 0.4, hence the datasets contain
no complete curves, the median length of observed fragments is 0.3 and the cross-sectional probability of observation is
0.3 in the middle of the domain and decreases towards the endpoints, where it is 0.15.
In the study of mean tests four configurations of the mean functions are considered. Under configuration A the null
hypothesis is satisfied: all mean functions are zero. Under configuration B the mean functions differ by a constant vertical
shift: µ1(t) = 0, µ2(t) = 0.18, µ3(t) = −0.1. Under configuration C there are monotonic differences between the means:
µ1(t) = 0, µ2(t) = 0.35 exp(−4t), µ3(t) = −0.25 exp(−3t). Under configuration D the means differ in a more complex,
nonmonotonic way and they cross: µ1(t) = 0, µ2(t) = 2t exp(−3t), µ3(t) = 0.1 − 8t2
exp(−5t). We set λj0 = 0.5,
λjk = 3−k
and hj(t) = 1, that is, the covariance structure is the same in all three groups. Additional simulations with
unequal covariance structures lead to similar results and are included in the Supplementary Material. We report in the
first part of Table 1 the size and power of the L2
test based on TL2 given in (1) and of the projection test based on Td given
in (2) using d = 3 Legendre polynomials of order zero, one and two. Blank entries in the table correspond to situations
where the true rejection probability is the same as in the entry above; such situations arise when the observation pattern
is obtained by reflecting the preceding pattern and the processes {X(t) : t ∈ [0, 1]} and the time-reversed processes
{X(1 − t) : t ∈ [0, 1]} have the same distribution.
We see in the first part of Table 1 that under the null hypothesis, configuration A, the rejection probability of the L2
tests is close to the nominal level. The size of the projection test seems to be somewhat above the nominal level due to
the sample size, especially under observation pattern (9), where the missingness rate is the highest. Our simulation study
594 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Table 2
Empirical rejection probability (in %) of the Hilbert–Schmidt norm test, SHS, projection test, Sd, and square root covariance test, Ssqrt, of equal
covariance operators. A dash indicates the same value as on the preceding row. The observation patterns (1)–(5) and covariance configurations A–D
are described in the text.
Observation pattern Covariance configuration
A B C D
SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt
Tests using complete and incomplete curves (proposed approach)
(1) 5.4 5.8 4.8 69 82 80 69 58 69 78 62 81
(2) 4.6 6.4 4.9 54 63 41 37 32 38 76 64 54
(3) – – – – – – – – – 46 30 48
(4) 5.0 5.1 5.8 64 74 72 61 53 62 72 56 73
(5) – – – – – – – – – 77 60 77
Tests using complete curves only (simple approach)
(2), (3) 4.1 7.3 4.6 32 38 41 33 28 34 45 30 47
(4), (5) 4.3 5.5 4.2 26 32 33 25 24 28 34 23 36
of power provides raw rejection probabilities in Table 1 and size-adjusted powers (using the method from Subsection 3.2
of Lloyd [42]) in Table S2 in the Supplementary Material. The possibility of size issues should be kept in mind in
applications: especially in marginal cases, users should not simply compare p-values with a single threshold but rather
carefully report them.
Under scenario B the L2
test is more powerful than the projection method. The reason is that the projection method uses
in addition to the constant basis function two other terms (linear and quadratic) that do not contribute to the detection
of the constant difference between the means but on the other hand they increase the degrees of freedom and hence
decrease the power. The L2
method uses infinitely many directions in the space of alternatives but these redundant
features are downweighted by the decreasing eigenvalues (the constant difference of means agrees with the constant
leading eigenfunction which receives the highest weight in the L2
statistic). Most partial observation patterns lead to
a relatively small decrease of power because under this scenario the mean functions differ by a constant vertical shift
which is a very simple, global feature that is easily detected even with reduced, fragmented data. The loss of power is
largest under pattern (9), where also the reduction of observed data is considerably larger than under the other patterns.
Both tests have comparable power under scenario C. Both tests lose power under observation pattern (2) because
a large portion of data is missing on the interval [0, 0.5], where the difference between the means is the largest; on the
other hand, the reflected pattern (3) does not lead to a loss of power because curves are missing only in [0.5, 1], where
the means do not differ much. A similar effect is seen under observation patterns (6) and (7).
Under scenario D the projection test seems to be slightly more powerful than the L2
(even after the size adjustment
in Table S2 in the Supplementary Material) because the nonmonotonic differences between the mean functions are well
captured by both the first three Legendre polynomials and the first three eigenfunctions but the contribution of the latter
is downweighted in the L2
statistic whereas the projection statistic treats all three components equally.
The second part of Table 1 shows for each missingness pattern and mean configuration the performance of the tests
applied to the subset of complete curves only. The complete curve approach would be the only possibility if the tests
developed in this paper were not available. Results for the pairs of patterns (2) and (3), (4) and (5), (6) and (7) are
presented on the same rows of the second part of the table because the subsets of complete curves are the same under
both patterns in each pair. Pattern (9) is omitted because it contains no complete curves and hence inference is impossible
without our methods. Under patterns (2) (or (3)) and (4) (or (5)), the use of complete curves only, which form 46% and
39%, respectively, of the whole sample, leads to a considerable loss of power in most situations. Configuration C under
pattern (2) is an exception. Here removing incomplete curves does not decrease the power because they are observed on
the subdomain [0.5, 1], where the means do not differ much. Under patterns (6) (or (7)) and (8) there are only 7% and
16% complete curves, respectively. With such small sample sizes the projection test becomes unreliable in terms of level
and the L2
test loses almost all power.
Next, we study the behaviour of the tests for comparing covariance operators. Under all scenarios we generate mean
zero trajectories. Configuration A satisfies the null hypothesis with λj0 = 0.5, λjk = 3−k
and hj(t) = 1, j ∈ {1, 2, 3}.
Under configuration B the same parameters are used except for the third sample where the overall scale is larger, namely
λ3,0 = 1.5 × 0.5 and λ3,k = 1.5 × 3−k
. Under scenario C the first two eigenvalues in the third sample are interchanged,
i.e., λ3,0 = 3−1
, λ3,1 = 0.5 and λ3,k = 3−k
, k ∈ {2, . . . , 20}, otherwise the parameters are the same as in A. Scenario D
differs from A in that we set h3(t) = 1 for t ∈ [0, 0.5] and h3(t) = 2.21/2
for t ∈ (0.5, 1]. Table 2 shows the size and
power of the Hilbert–Schmidt norm test based on SHS in (4), projection test based on Sd in (6) with d selected to explain at
least 85% of the total variability of the null covariance estimate, and square root covariance test based on Ssqrt in (9). Like
before, entries where the true rejection probability equals the one above are left blank. We use only observation patterns
(1)–(5). Under the other patterns the amount of missing information is too large for second order inference.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 595
Under the null hypothesis, configuration A, the first part of Table 2 shows that the rejection probability of all tests is
close to the nominal level under all missingness patterns, with the projection test being slightly above the level in some
cases.
It is interesting to notice the different impact of missingness on the power in different situations. We report raw
power in Table 2 and size-adjusted power in Table S4 in the Supplementary Material. While in many situations the loss
of power due to missingness is similar for all three tests, in some situations the square root test appears to be more
sensitive to missingness. For example under scenario B and missingness pattern (2), the square root covariance test loses
almost half of its power relative to no missingness, much more than the other two tests. This can be explained by the
fact that the square root covariance estimator depends on the estimator of the covariance kernel at all arguments which
means that uncertainty due to missingness localized in a certain region in the domain, like under pattern (2), propagates.
Similarly, under scenario D and pattern (2) the Hilbert–Schmidt and projection tests do not lose much power and the
square root test does because the difference between the covariances is due to the differences of hj(t) for t ∈ [0.5, 1]
while missingness occurs for t ∈ [0, 0.5]. For these reasons, under the same scenario, pattern (3) leads to a larger loss
of power than pattern (2) for the Hilbert–Schmidt and projection tests, whereas the loss of the square root covariance is
not much higher than under pattern (2), where it was already high.
The second part of Table 2 shows results for tests applied to the subset of complete curves only. Like before, patterns (3)
and (5) are shown on the same rows as patterns (2) and (4), respectively, because the subsets of complete curves are the
same. We observe a large decrease of power in comparison with the power of the proposed tests in cases, where the
neglected incomplete curves carry information on the difference between covariance operators. When the difference is
mostly in the frequently missing region (e.g., configuration D, pattern (3)), removing incomplete curves affects the power
much less.
These results highlight the usefulness of the proposed methods as an efficient, and often the only viable approach to
testing with incomplete functions. In no situation the proposed methods behaved worse than the simple approach using
complete curves only, and in many cases it behaved dramatically better. Additional results for non-Gaussian curves can
be found in the Supplementary Material.
6. Application to partially observed heart rate temporal profiles
We illustrate our methods on curves describing the evolution of heart rate in 427 male participants in the period
from 8 PM to 2 AM corresponding to the domain [20, 26]. The data come from the Swiss Kidney Project on Genes in
Hypertension. There are three groups of persons according to their age: younger than 40 years (164 persons), between 40
and 65 (180), and older than 65 (83). The curves and their first derivative are plotted in Fig. 1. Although the percentage of
observed values at each time or at each pair of time points is relatively high (Fig. 2), only 58% of the curves are complete.
Plots of the estimated mean functions in Fig. 1 indicate differences between the age groups both in terms the temporal
profiles and their first derivative. We first compare the group means of heart rate profiles. The p-values of the L2
test
and projection test using three Legendre polynomials are 0.006 and less than 0.001, respectively, confirming the clearly
visible differences. To compare the dynamics of heart rate during the transition between day and night we test whether
the means of the first derivative differ. The L2
and projection test have nearly zero p-values, meaning that the mean heart
rate profiles differ between age groups more than by a vertical shift. The plots suggest it may be interesting to compare
some pairs of groups. E.g., while the mean profiles of the middle and oldest group significantly differ (p < 0.01 for both
tests), they appear to be approximately parallel. The difference between the derivatives is indeed insignificant (p = 0.07
for the L2
test, p = 0.09 for the projection test).
Without the methods developed in this paper one would have to use complete curves only. There are 249 complete
functions (43, 110 and 96 in the three age groups). The projection test still detects the differences between the three
groups (p = 0.008) but the L2
test loses significance (p = 0.066). When comparing the second and third group, the
projection test now fails to detect the difference (p = 0.13) and the L2
test gives a marginally significant result (p = 0.048).
This can be explained by a loss of power seen in simulations because the removed incomplete curves are more often
observed at earlier times, where also the difference between the two mean curves is more pronounced.
Estimates of the covariance function, eigenvalues and eigenfunctions of heart rate profiles and of their derivatives for
each age group are plotted in Fig. 3 and Fig. 4. Further plots can be found in the supplementary document. The plots
suggest some differences between the groups. The variance and covariance appears to be higher in younger participants,
especially earlier in the time interval (during the day). We assess the significance of these differences using the proposed
tests. For the projection test we consider up to three principal components (plotted in the supplementary document),
which corresponds to the projection on a subspace of dimension six in the space covariance operators. Table 3 reports
the p-values. None of the tests rejects the null hypothesis on usual significance levels. Similarly, pairwise comparisons
provided no overwhelming evidence of differences. It is of course possible that there are differences between groups that
may be detected with larger samples. To gain further insight into the structure of possible differences one can inspect the
values of the standardized score components Rjlm/ ˆW
1/2
jlm,jlm (see (5) and (7)) whose graphical representation is provided
in the supplement.
596 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Fig. 1. Individual heart rate profiles and their first derivative (left panels) and the corresponding group-specific and null estimates of the mean
(right panels).
Fig. 2. Cross-sectional percentage of observed values (left) and percentage of pairwise complete observations (right).
Table 3
p-values of the Hilbert–Schmidt norm test, SHS, the square root covariance test, Ssqrt, and the projection tests,
Sd, with d = 1, 2, 3, for comparing covariance structures of heart rate profiles and of their first derivative in
three age groups. The fraction of variance explained by the first d principal components of the null covariance
estimate is indicated in parentheses.
SHS Ssqrt S1 S2 S3
Curves 0.338 0.118 0.317 (88.2%) 0.439 (97.3%) 0.275 (99.1%)
First derivative 0.226 0.114 0.322 (62.6%) 0.131 (94.4%) 0.094 (98.7%)
Acknowledgments
We are grateful to all reviewers for their valuable comments and suggestions. This work was supported by the Czech
Science Foundation under Grant GJ17-22950Y. Access to computing and storage facilities owned by parties and projects
contributing to the MetaCentrum National Grid Infrastructure provided under the programme ‘‘Projects of Large Research,
Development, and Innovations Infrastructures’’ (CESNET LM2015042) is greatly appreciated.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 597
Fig. 3. Estimated covariance functions of heart rate profiles (top row) and of their derivatives (bottom row) in age groups.
Fig. 4. Estimated eigenvalues and eigenfunctions of heart rate profiles (top row) and of their derivatives (bottom row) in age groups with pointwise
95% bootstrap confidence intervals.
598 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
Appendix A. A central limit theorem
We provide a general central limit theorem for independent but not necessarily identically distributed random
elements of a separable Hilbert space. It is needed in the proofs, where non-identical distributions arise due to partial
observation, but is of more general interest. It extends the standard result for independent identically distributed
functional variables [5, Theorem 2.7] by relaxing the assumption of identical distributions and by considering triangular
arrays. The notation ∥ · ∥∞ below means the operator norm.
Theorem 6. Let Yni, n ∈ {1, 2, . . . }, i ∈ {1, . . . , n} be random elements of a separable Hilbert space H with mean zero,
E ∥Yni∥2
< ∞ and covariance operators Cni. Let Yn1, . . . , Ynn be mutually independent for each n ∈ {1, 2, . . . }. Denote
Sn = n−1/2
∑n
i=1 Yni and Gn = n−1
∑n
i=1 Cni. Assume that
(i) ∥Gn − G ∥∞ → 0 as n → ∞ for some covariance operator G ,
(ii) for all ε > 0,
n−1
n
∑
i=1
E(∥Yni∥2
1[∥Yni∥>n1/2∥Gn∥∞ε]) → 0
as n → ∞,
(iii) tr Gn → tr G as n → ∞.
Then Sn converges in distribution to a Gaussian random element with mean zero and covariance operator G .
Appendix B. Proofs
Proof of Theorem 1. We rewrite N1/2
( ˆµ − µ) = ˆπ1/2
n1/2
( ˆµ − µ). The main task is to establish the weak convergence of
the process
n1/2
( ˆµ − µ) =
1
π
Sn +
(
J
ˆπ
−
1
π
)
Sn + n1/2
(J − 1)µ, (B.1)
where Sn = n−1/2
∑n
i=1 Oi(Xi − µ). We show that the first term on the right side of (B.1) converges in distribution to
a mean zero Gaussian process with covariance operator with kernel π(s)−1
π(t)−1
ν(s, t)ρ(s, t) that can be consistently
estimated by ˆπ(s)−1
ˆπ(t)−1
ˆν(s, t) ˆρ(s, t), and that the norms of the other two terms converge in probability to 0. The proof
of the weak convergence of N1/2
( ˆµ − µ) then follows from the convergence of ˆπ to π, the consistency of the estimator
of its covariance kernel can be shown analogously.
The weak convergence of Sn is shown with the help of Theorem 6, a central limit theorem for independent nonidentically
distributed Hilbert space variables given in the Appendix. We apply the theorem with Yni = Oi(Xi − µ).
The covariance operator Gn of Sn is given by the kernel ¯ν(s, t)ρ(s, t). Denote by G the covariance operator with kernel
ν(s, t)ρ(s, t). Conditions of the central limit theorem Theorem 6 can be shown using Definition 1(b) as follows. Condition
(i) of Theorem 6 is satisfied because
∥Gn − G ∥2
∞ ≤ ∥Gn − G ∥2
2 =
∫
[0,1]2
{¯ν(s, t) − ν(s, t)}2
ρ(s, t)2
dsdt → 0
as n → ∞ by the dominated convergence theorem. Condition (ii) of Theorem 6 holds because
n−1
n
∑
i=1
E(∥Yni∥2
1[∥Yni∥>n1/2∥Gn∥∞ε]) ≤ n−1
n
∑
i=1
E(∥Xi − µ∥2
1[∥Xi−µ∥>n1/2∥Gn∥∞ε])
= E(∥X1 − µ∥2
1[∥X1−µ∥>n1/2∥Gn∥∞ε]),
which converges to 0 by the dominated convergence theorem. Finally,
∫ 1
0
¯ν(t, t)ρ(t, t)dt →
∫ 1
0
ν(t, t)ρ(t, t)dt by the
dominated convergence theorem again, and thus condition (iii) of Theorem 6 is satisfied. Hence the process Sn is
asymptotically Gaussian with covariance kernel ν(s, t)ρ(s, t).
The expectation of the squared norm of the second term on the right side of (B.1) can be rewritten as
∫ 1
0
E
[{
J(t)
ˆπ(t)
−
1
π(t)
}2
Sn(t)2
1[ ˆπ(t)≥π0/2]
]
dt +
∫ 1
0
E
[{
J(t)
ˆπ(t)
−
1
π(t)
}2
Sn(t)2
1[ ˆπ(t)<π0/2]
]
dt. (B.2)
The first summand above is dominated by
∫ 1
0
E
[
{π(t) − ˆπ(t)}2
π4
0 /4
Sn(t)2
]
dt ≤
∫ 1
0
E
[
{π(t) − ˆπ(t)}2
π4
0 /4
]
ρ(t, t)dt
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 599
which converges to zero by the dominated convergence theorem since E[{π(t)− ˆπ(t)}2
] = {π(t)− ¯π(t)}2
+n−2
∑n
i=1 πi(t)
{1 − πi(t)} → 0 for n → ∞. Next, we first compute
{
J(t)
ˆπ(t)
−
1
π(t)
}2
1[ ˆπ(t)<π0/2] =
[
J(t)
{
π(t) − ˆπ(t)
ˆπ(t)π(t)
}2
+ {1 − J(t)}
1
π(t)2
]
1[ ˆπ(t)<π0/2]
≤ [J(t)n2
/π2
0 + {1 − J(t)}/π2
0 ]1[ ˆπ(t)<π0/2] ≤ n2
/π2
0 1[ ˆπ(t)<π0/2].
Then the second summand in (B.2) is smaller than or equal to
∫ 1
0
E{n2
/π2
0 1[ ˆπ(t)<π0/2]Sn(t)2
}dt ≤
∫ 1
0
n2
/π2
0 Pr{ ˆπ(t) < π0/2}ρ(t, t)dt ≤ n2
sup
t∈[0,1]
Pr{ ˆπ(t) < π0/2}/π2
0 tr R,
which converges to 0 because, in light of Hoeffding’s inequality and Definition 1(a), for all t ∈ [0, 1],
Pr{ ˆπ(t) < π0/2} ≤ exp[−2n{ ¯π(t) − π0/2}2
] ≤ exp
[
−2n
{
π0/2 − sup
t∈[0,1]
| ¯π(t) − π(t)|
}2]
→ 0.
This completes the proof of the convergence in probability of the norm of the second term on the right hand side of (B.1)
to zero. The last term in (B.1) can be shown to converge to zero using similar arguments based on Hoeffding’s inequality.
We now turn to the proof of the consistency of the estimator of the covariance kernel. To show that
E
∫
[0,1]2
{
ˆν(s, t) ˆρ(s, t)
ˆπ(s) ˆπ(t)
−
ν(s, t)ρ(s, t)
π(s)π(t)
}2
dsdt → 0,
we can split the integral into the integrals over A0 = {(s, t) ∈ [0, 1]2
: ν(s, t) = 0} and A1 = {(s, t) ∈ [0, 1]2
: ν(s, t) ≥ ν0}
because Definition 1(c) implies that A0 ∪ A1 = [0, 1]2
. On A0 we obtain
E
∫
A0
{
ˆν(s, t) ˆρ(s, t)
ˆπ(s) ˆπ(t)
}2
{1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]}dsdt
≤
∫
A0
E{ˆν(s, t)2
} E{ˆρ(s, t)2
}dsdt
(
(π0/2)−4
+ n4
sup
(s,t)∈[0,1]2
Pr[min{ ˆπ(s), ˆπ(t)} < π0/2]
)
.
Here the integral converges to zero by the dominated convergence theorem as the integrand can be shown to go to 0 and
the second term in the brackets asymptotically vanishes due to an exponential rate of decrease of the supremum that
can be established with the help of Hoeffding’s inequality as before, hence the whole quantity above converges to 0. We
now focus on A1. We rewrite
ˆν(s, t) ˆρ(s, t)
ˆπ(s) ˆπ(t)
−
ν(s, t)ρ(s, t)
π(s)π(t)
=
ˆν(s, t)
ˆπ(s) ˆπ(t)
{ˆρ(s, t) − ρ(s, t)} +
{
ˆν(s, t)
ˆπ(s) ˆπ(t)
−
ν(s, t)
π(s)π(t)
}
ρ(s, t) (B.3)
and show that the integral over A1 of the expectation of the square of each summand converges to zero. For the first
summand we compute
∫
A1
E
([
ˆν(s, t)
ˆπ(s) ˆπ(t)
{ˆρ(s, t) − ρ(s, t)}
]2
{1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]}
)
dsdt
≤ E
∫
A1
{ˆρ(s, t) − ρ(s, t)}2
dsdt
[
(π0/2)−4
+ n4
sup
(s,t)∈[0,1]2
Pr(min{ ˆπ(s), ˆπ(t)} < π0/2)
]
,
where the integral term converges to 0 by similar arguments to those in the proof of Proposition 1 in Kraus [35] with the
help of Definition 1(c) and the second term goes to 0 by Hoeffding’s inequality again. For the second summand on the
right in (B.3) we can write
∫
A1
E
[
I(s, t)
{
π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t)
ˆπ(s) ˆπ(t)π(s)π(t)
}2]
ρ(s, t)2
dsdt
+
∫
A1
E
[
{1 − I(s, t)}
{
ν(s, t)
π(s)π(t)
}2]
ρ(s, t)2
dsdt.
(B.4)
Like before, we split the first term in (B.4) into two summands by writing
∫
A1
E
[
I(s, t)
{
π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t)
ˆπ(s) ˆπ(t)π(s)π(t)
}2
{1[min( ˆπ(s), ˆπ(t))≥π0/2] + 1[min{ ˆπ(s), ˆπ(t)}<π0/2]}
]
ρ(s, t)2
dsdt.
The first summand is bounded by 16π−8
0
∫
A1
E[{π(s)π(t)ˆν(s, t) − ˆπ(s) ˆπ(t)ν(s, t)}2
]ρ(s, t)2
dsdt, which converges to 0
by the dominated convergence theorem since the expectation in the integrand can be shown to converge to 0; the
600 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
second summand in the displayed expression above is dominated by n4
π−4
0 ∥R∥2
2 sup(s,t)∈[0,1]2 Pr(min{ ˆπ(s), ˆπ(t)} < π0/2),
which converges to 0 by Hoeffding’s inequality. Finally, the second term in (B.4) is dominated by sup(s,t)∈A1
Pr(ˆν(s, t) <
ν0/2)π−4
0 ∥R∥2
2, which converges to 0 again by Hoeffding’s inequality.
Proof of Theorem 2. Denote Zj(·) = Nj(·)1/2
{ ˆµj(·) − ˆµ(·)}/ˆrj and Z = (Z1, . . . , ZK )⊤
. Under the null hypothesis we can
write Z = ˆDH, where H = (H1, . . . , HK )⊤
with Hj = N
1/2
j ( ˆµj − µ) and ˆD is a bounded linear operator from {L2
([0, 1])}K
to {L2
([0, 1])}K
that maps an element f to an element g whose jth component is given by gj(t) =
∑K
l=1( ˆDjlfl)(t) =
∑K
l=1
ˆr−1
j {δjl − Nj(t)1/2
ˆwl(t)Jl(t)Nl(t)−1/2
}fl(t) (here δjl is the Kronecker delta and Jl(t)Nl(t)−1/2
is zero if Jl(t) = 1[Nl(t)>0]
is zero). From Theorem 1 we see that H converges in distribution to the random element H∞
= (H∞
1 , . . . , H∞
K )⊤
whose
components are mutually independent Gaussian processes with mean zero and covariance operators Kj, j = 1 . . . , K
analogous to the operator K in Theorem 1. The operator ˆD converges in probability to the operator D whose elements
are defined by (Djlfl)(t) = r−1
j {δjl − πj(t)1/2
a
1/2
j wl(t)πl(t)−1/2
a
−1/2
l }fl(t) with wl(t) = alπl(t)/r2
l /(
∑K
k=1 akπk(t)/r2
k ) (the
convergence is in the operator norm, i.e., ∥ ˆD − D∥∞
P
−→ 0). Therefore, it follows from Slutsky’s and continuous mapping
theorem that Z = ˆDH converges weakly to Z∞
= DH∞
. This is a K-dimensional mean zero Gaussian random process
with cross-covariance operator between Z∞
j and Z∞
k equal to Vjk =
∑K
l=1 DjlKlD∗
kl, j = 1, . . . , K, k = 1, . . . , K. These
can be consistently estimated by plugging-in the estimators ˆDjl and ˆKl. The kernel of the estimator ˆVjk takes the form
ˆvjk(s, t) =
∑K
l=1
ˆr−1
j {δjl − Nj(s)1/2
ˆwl(s)Nl(s)−1/2
}ˆκl(s, t){δkl − Nk(t)1/2
ˆwl(t)Nl(t)−1/2
}ˆr−1
k .
For (i), the continuous mapping theorem gives that the statistic TL2 = ∥Z∥2
converges weakly to the random variable
∥Z∞
∥2
. The process Z∞
is a Gaussian random element of the separable Hilbert space {L2
([0, 1])}K
. Therefore, it can be
expanded in a Karhunen–Loève series with Gaussian coefficients. Consequently, the distribution of its squared norm is
that of the series given in the theorem. The consistency of ˆV implies the consistency of the estimated eigenvalues.
To prove (ii), notice that the components of the score vector satisfy Qjl = ⟨ ˆπ
1/2
j Zj, ˆψl⟩. The continuous mapping theorem
and Slutsky’s theorem in conjunction with the convergence of ˆψl imply that Q is asymptotically distributed as a Gaussian
vector with mean zero and covariance matrix with entries Vjl,km = ⟨π
1/2
j ψl, Vjk(π
1/2
k ψm)⟩. The consistency of ˆVjl,km follows
from the consistency of ˆVjk and ˆπj and convergence of ˆψl. The process ( ˆπ
1/2
1 Z1, . . . , ˆπ
1/2
K ZK ) lies in a (K − 1)-dimensional
subspace of the K-dimensional product space {L2
([0, 1])}K
and the same holds for its limit. Therefore, the score vector
lies in a (K − 1)d-dimensional subspace of RKd
, leading to (K − 1)d degrees of freedom of the chi-square distribution.
Proof of Theorem 3. The kernel of n1/2
( ˆR − R) is
n1/2
{ˆρ(s, t) − ρ(s, t)} = n1/2
{ˆρ(s, t) − ˇρ(s, t)} +
1
ν(s, t)
σ(s, t) +
{
I(s, t)
ˆν(s, t)
−
1
ν(s, t)
}
σ(s, t)
+ n1/2
{I(s, t) − 1}ρ(s, t),
(B.5)
where ˇρ is defined like ˆρ with the true mean in place of the estimated mean and σ(s, t) = n−1/2
∑n
i=1 Ui(s, t)[{Xi(s) −
µ(s)}{Xi(t) − µ(t)} − ρ(s, t)]. Let us focus on the second summand on the right side of (B.5). All the other terms are
negligible in the appropriate sense as we explain later. The kernel σ(s, t) corresponds to the operator Sn = n−1/2
∑n
i=1 Yni,
where Yni are the integral operators with kernels yni(s, t) = Ui(s, t)[{Xi(s) − µ(s)}{Xi(t) − µ(t)} − ρ(s, t)]. We will apply
Theorem 6 to Yni, which is a triangular array of row-wise independent non-identically distributed zero-mean random
elements of the separable Hilbert space of the Hilbert–Schmidt operators on L2
([0, 1]). The covariance operator of Yni is
the Hilbert–Schmidt operator Cni on Hilbert–Schmidt operators given by
⟨A1, CniA2⟩ = cov(⟨Yni, A2⟩, ⟨Yni, A1⟩) =
∫
[0,1]4
α1(s, t) cov{yni(s, t), yni(u, v)}α2(u, v)dsdtdudv,
where A1, A2 are Hilbert–Schmidt operators with kernels α1, α2, respectively. The kernel of Cni is cni(s, t, u, v) =
cov{yni(s, t), yni(u, v)} = θi(s, t, u, v){ζ(s, t, u, v)−ρ(s, t)ρ(u, v)}. The covariance operator of Sn is Gn = n−1
∑n
i=1 Cni with
kernel ¯θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}. Like in the proof of Theorem 1, one can use the dominated convergence
theorem to show that ∥Gn − G∥2 → 0, where G has kernel θ(s, t, u, v){ζ(s, t, u, v) − ρ(s, t)ρ(u, v)}. Thus condition (i)
of Theorem 6 is verified. Condition (ii) can be verified like in the proof of Theorem 1. Next, condition (iii) is satisfied
because tr Gn =
∫
[0,1]2
¯θ(s, t, s, t){ζ(s, t, s, t)−ρ(s, t)2
}dsdt converges to tr G =
∫
[0,1]2 θ(s, t, s, t){ζ(s, t, s, t)−ρ(s, t)2
}dsdt.
Therefore, Sn is asymptotically distributed as a Gaussian random operator with mean zero and covariance operator G and,
consequently, by the continuous mapping theorem the second term on the right-hand side of (B.5) weakly converges to
the mean zero Gaussian operator with covariance operator H′
given in Theorem 3.
The operators corresponding to the first and fourth summand on the right side in (B.5) were shown to converge to
zero in the proof of Proposition 1 in Kraus [35] in the sense that the expectation of their squared Hilbert–Schmidt norm
converges to zero. Also, the Hilbert–Schmidt norm of the third term on the right in (B.5) converges to zero in mean
square which can be shown by arguments analogous to those used for the second term on the right in (B.1) in the proof
of Theorem 1. Therefore, in view of Slutsky’s lemma these terms are negligible for the weak convergence.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 601
The weak convergence of the operator with kernel M(s, t)1/2
{ˆρ(s, t) − ρ(s, t)} follows from the convergence of ˆν(s, t)
to ν(s, t). The consistency of the estimators of H′
and H can be proved along the lines of the proof for K ′
and K in
Theorem 1.
Proof of Theorem 4. The proof uses perturbation theory in which ˆR is regarded as a perturbed version of R, i.e., ˆR =
R + ( ˆR − R). Recall that the perturbation satisfies E ∥ ˆR − R∥2
2 = O(n−1
) [35, Proposition 1], and, therefore, ∥ ˆR − R∥∞ =
OP (n−1/2
).
Similarly to the proof of Theorem 3.1 in [10], we rewrite n1/2
(ˆλm −λm) = n1/2
(ˆλm −λm)1Ωn +n1/2
(ˆλm −λm)1ΩC
n
, where
Ωn = {ω : ∥ ˆR − R∥∞ < εn} for a numerical sequence εn satisfying n−1/2
≪ εn ≪ n−1/4
. Since Pr(Ωn) → 1 as n → ∞,
the term n1/2
(ˆλm − λm)1ΩC
n
converges to 0 in probability. For ∥ ˆR − R∥∞ sufficiently small, i.e., on Ωn for n large enough,
we have by Corollary 3.4 of [22] that n1/2
(ˆλm − λm)1Ωn = n1/2
⟨( ˆR − R)ϕm, ϕm⟩1Ωn + n1/2
O(∥ ˆR − R∥2
∞)1Ωn . Here the last
term converges to 0 in probability because εn ≪ n−1/4
and the first term on the right side converges in distribution to
the limit given in part (i) of the theorem. Hence the result follows from Slutsky’s theorem. The expression for the limiting
variance is obtained by rewriting var⟨H ′∞
ϕm, ϕm⟩ = var⟨H ′∞
, ϕm ⊗ ϕm⟩ = ⟨ϕm ⊗ ϕm, H′
(ϕm ⊗ ϕm)⟩.
Next, we can write n1/2
(ˆsm ˆϕm −ϕm) = n1/2
(ˆsm ˆϕm −ϕm)1Ωn +n1/2
(ˆsm ˆϕm −ϕm)1ΩC
n
. For n sufficiently large, Corollary 3.3
of [22] gives n1/2
(ˆsm ˆϕm − ϕm)1Ωn = n1/2
Qm( ˆR − R)ϕm1Ωn + n1/2
O(∥ ˆR − R∥2
∞)1Ωn . The first term on the right converges
in distribution to the limiting distribution as claimed in part (ii) and the other terms converge in probability to 0. The
limiting covariance operator is obtained by inspecting the cross-covariance operator for each pair of summands in the
series QmH ′∞
ϕm. The cross-covariance between (ϕk⊗ϕk)H ′∞
ϕm = ⟨ϕk, H ′∞
ϕm⟩ϕk and (ϕl⊗ϕl)H ′∞
ϕm = ⟨ϕl, H ′∞
ϕm⟩ϕl
is
cov(⟨ϕk, H ′∞
ϕm⟩, ⟨ϕl, H ′∞
ϕm⟩)(ϕk ⊗ ϕl) = cov{⟨(ϕm ⊗ ϕk), H ′∞
⟩, ⟨(ϕm ⊗ ϕl), H ′∞
⟩}(ϕk ⊗ ϕl)
= ⟨(ϕm ⊗ ϕk), H′
(ϕm ⊗ ϕl)⟩(ϕk ⊗ ϕl).
The inner product in the last expression above equals the integral in part (ii) of the theorem.
Proof of Theorem 5. Let ˆD be the linear operator on the product space HS(L2
([0, 1]))K
that maps F = (F1, . . . , FK )⊤
,
where Fj are Hilbert–Schmidt operators on L2
([0, 1]) with kernels fj(s, t), to G = (G1, . . . , GK )⊤
where Gj has kernel
gj(s, t) =
∑K
l=1{δjl − Mj(s, t)1/2
ˆwl(s, t)Il(s, t)Ml(s, t)−1/2
}fl(s, t). The mapping ˆD is a random linear operator on
HS(L2
([0, 1]))K
that acts by pointwise multiplication and linear combination of integral kernels; ˆD itself is not an integral
operator but it is bounded because the functions in the braces above are bounded. It converges in probability to the nonrandom
bounded linear operator D that maps F to G with Gj with kernel
∑K
l=1{δjl − νj(s, t)1/2
a
1/2
j wl(s, t)νl(s, t)−1/2
a
−1/2
l }
fl(s, t). The convergence is in the sense of the operator norm on linear operators on HS(L2
([0, 1]))K
, that is, ∥ ˆD−D∥∞
P
−→ 0,
where ∥D∥∞ = sup{∥DF∥2/∥F∥2 : F ∈ HS(L2
([0, 1]))K
} with ∥ · ∥2 being the Hilbert–Schmidt norm on HS(L2
([0, 1]))K
.
Now consider the standardized contrasts Z = (Z1, . . . , ZK )⊤
with kernels zj(s, t) = Mj(s, t)1/2
{ˆρj(s, t) − ˆρ(s, t)}. They
are obtained as Z = ˆDH , where H = (H1, . . . , HK )⊤
with Hj with kernel hj(s, t) = Mj(s, t)1/2
{ˆρ(s, t) − ρ(s, t)}. Under
the null hypothesis Theorem 3 yields that H converges in distribution to H ∞
, a vector of K independent mean zero
Gaussian random operators with covariance operators Hj. Therefore, Z = ˆDH converges in distribution to Z ∞
= DH ∞
by Slutsky’s and continuous mapping theorem.
The covariance operator B of Z ∞
is given by the cross-covariance operators Bjk between the components Zj and Zk
whose estimator ˆBjk has kernel
ˆβjk(s, t, u, v) =
K
∑
l=1
{δjl − Mj(s, t)1/2
ˆwl(s, t)Ml(s, t)−1/2
}ˆηl(s, t, u, v){δkl − Mk(u, v)1/2
ˆwl(u, v)Ml(u, v)−1/2
}.
The test statistic SHS = ∥Z ∥2
2 is asymptotically distributed as ∥Z ∞
∥2
2. The random variable Z ∞
is a Gaussian element
of the separable Hilbert space HS(L2
([0, 1]))K
, therefore it can be expanded in a Karhunen–Loève series with independent
Gaussian coefficients. Therefore, its squared norm is distributed as the series of independent chi-square variables weighted
by the eigenvalues of the covariance operator and part (i) of the theorem follows.
The components of the score vector satisfy Rjlm = ⟨ˆνj(·, ·)1/2
zj(·, ·), ˆUlm⟩. Due to the consistency of the estimated
eigenfunctions [35, Proposition 2], the operator ˆUlm (up to the sign ambiguity for l ̸= m) converges to Ulm defined by
the true eigenfunctions, with kernel ulm(s, t). Therefore, the score vector weakly converges to the mean zero Gaussian
vector with components R∞
jlm = ⟨νj(·, ·)1/2
z∞
j (·, ·), Ulm⟩ = ⟨z∞
j (·, ·), νj(·, ·)1/2
ulm(·, ·)⟩ whose covariance matrix has entries
Wjlm,kpq = ⟨νj(·, ·)1/2
ulm(·, ·), Bjk{νk(·, ·)1/2
upq(·, ·)}⟩, j, k ∈ {1, . . . , K}, 1 ≤ l ≤ m ≤ d, 1 ≤ p ≤ q ≤ d. The
vector of operators with kernels νj(s, t)1/2
z∞
j (s, t) lies in a hyperplane in HS(L2
([0, 1]))K
, thus the matrix W has rank
(K − 1)d(d + 1)/2. The consistency of ˆW follows from the convergence of all quantities involved. Hence the limiting
distribution is the chi-square distribution as claimed in part (ii).
Proof of Theorem 6. First, we prove the convergence in distribution of one-dimensional projections using Lindeberg’s
central limit theorem. It follows from assumption (i) that for f ∈ H such that G f ̸= 0, var⟨Sn, f ⟩ = ⟨f , Gnf ⟩ → ⟨f , G f ⟩ as
602 D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603
n → ∞. To verify Lindeberg’s condition, we compute
n−1
n
∑
i=1
E(⟨Yni, f ⟩2
1[|⟨Yni,f ⟩|>n1/2⟨f ,Gnf ⟩1/2ε]) ≤ n−1
n
∑
i=1
E(∥Yni∥2
∥f ∥2
1[∥Yni∥>n1/2⟨f ,Gnf ⟩1/2∥f ∥−1ε]).
Now in light of assumption (i), there is a positive constant c such that for sufficiently large n, ⟨f , Gnf ⟩1/2
/∥Gn∥∞ > c,
and the above expression is further dominated by n−1
∑n
i=1 E(∥Yni∥2
∥f ∥2
1[∥Yni∥>n1/2∥Gn∥∞c∥f ∥−1ε]), which converges to 0 by
assumption (ii). Hence one-dimensional projections converge, and due to Theorem 2.3 of Bosq [5], all finite-dimensional
projections converge.
To complete the proof, let us prove the tightness of the sequence Sn, n = 1, 2, . . . The idea of the proof is similar to that
of Bosq [5, Theorem 2.7] but in the present situation the variables Yn1, . . . , Ynn are possibly non-identically distributed.
Let vj and δj, j = 1, 2, . . . be the eigenfunctions and eigenvalues of the limiting operator G . Consider a sequence lk,
k = 1, 2, . . . such that lk → ∞ for k → ∞. For ε > 0, let Nk, k = 1, 2, . . . be an increasing sequence of integers such
that
∑∞
k=1 lkr2
Nk
< ε, where r2
N =
∑∞
j=N δj. Define Bk = {x ∈ H :
∑∞
j=Nk
⟨x, vj⟩2
≤ l−1
k }. It follows from assumptions (i) and
(iii) that
Pr(Sn ∈ BC
k ) = P
( ∞∑
j=Nk
⟨Sn, vj⟩2
> l−1
k
)
≤ lk E
( ∞∑
j=Nk
⟨Sn, vj⟩2
)
= lk E
(
∥Sn∥2
−
Nk−1
∑
j=1
⟨Sn, vj⟩2
)
= lk
(
tr Gn −
Nk−1
∑
j=1
⟨vj, Gnvj⟩
)
→ lk
(
tr G −
Nk−1
∑
j=1
⟨vj, G vj⟩
)
= lk
∞∑
j=Nk
⟨vj, G vj⟩ = lkr2
Nk
.
Consider the compact set Kε = ∩∞
k=1Bk and compute
lim sup
n→∞
Pr(Sn ∈ KC
ε ) ≤ lim sup
n→∞
∞∑
k=1
Pr(Sn ∈ BC
k ) ≤
∞∑
k=1
lim sup
n→∞
Pr(Sn ∈ BC
k ) ≤
∞∑
k=1
lkr2
Nk
< ε,
where the second inequality is due to Fatou’s lemma. This proves the tightness.
Appendix C. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jmva.2019.05.002.
The supplementary document available online contains further simulation results and additional graphs for the data
application. R code is available online.
References
[1] J.A.D. Aston, D. Pigoli, S. Tavakoli, Tests for separability in nonparametric covariance operators of random surfaces, Ann. Statist. 45 (4) (2017)
1431–1461.
[2] A. Aue, R. Gabrys, L. Horváth, P. Kokoszka, Estimation of a change-point in the mean function of functional data, J. Multivariate Anal. 100 (10)
(2009) 2254–2269.
[3] M. Benko, W. Härdle, A. Kneip, Common functional principal components, Ann. Statist. 37 (1) (2009) 1–34.
[4] G. Boente, D. Rodriguez, M. Sued, Testing equality between several populations covariance operators, Ann. Inst. Statist. Math. (2017) 1–32.
[5] D. Bosq, Linear Processes in Function Spaces, Springer, New York, 2000.
[6] F.A. Bugni, Specification test for missing functional data, Econom. Theory 28 (5) (2012) 959–1002.
[7] A. Cabassi, D. Pigoli, P. Secchi, P.A. Carter, Permutation tests for the equality of covariance operators of functional data with applications to
evolutionary biology, Electron. J. Stat. 11 (2) (2017) 3815–3840.
[8] G. Cao, L. Yang, D. Todem, Simultaneous inference for the mean function based on dense functional data, J. Nonparametr. Stat. 24 (2) (2012)
359–377.
[9] A. Cuevas, M. Febrero, R. Fraiman, An anova test for functional data, Comput. Statist. Data Anal. 47 (1) (2004) 111–122.
[10] J. Cupidon, D. Gilliam, R. Eubank, F. Ruymgaart, The delta method for analytic functions of random operators with application to functional
data, Bernoulli 13 (4) (2007) 1179–1194.
[11] J. Dauxois, A. Pousse, Y. Romain, Asymptotic theory for the principal component analysis of a vector random function: some applications to
statistical inference, J. Multivariate Anal. 12 (1) (1982) 136–154.
[12] A.C. Davison, D.V. Hinkley, Bootstrap methods and their application, Cambridge University Press, Cambridge, 1997, p. x+582.
[13] M. Dawson, H.-G. Müller, Dynamic modeling of conditional quantile trajectories, with application to longitudinal snippet data, J. Amer. Statist.
Assoc. 113 (524) (2018) 1612–1624.
[14] A. Delaigle, P. Hall, Classification using censored functional data, J. Amer. Statist. Assoc. 108 (504) (2013) 1269–1283.
[15] A. Delaigle, P. Hall, Approximating fragmented functional data by segments of Markov chains, Biometrika 103 (4) (2016) 779–799.
[16] M.-H. Descary, V.M. Panaretos, Recovering covariance from functional fragments, Biometrika 106 (1) (2019) 145–160.
[17] F. Ferraty, Y. Romain (Eds.), The Oxford Handbook of Functional Data Analysis, Oxford University Press, Oxford, 2011, p. xviii+494.
[18] C.B. Fogarty, D.S. Small, Equivalence testing for functional data with an application to comparing pulmonary function devices, Ann. Appl. Stat.
8 (4) (2014) 2002–2026.
[19] S. Fremdt, L. Horváth, P. Kokoszka, J.G. Steinebach, Functional data analysis with increasing number of projections, J. Multivariate Anal. 124
(2014) 313–332.
D. Kraus / Journal of Multivariate Analysis 173 (2019) 583–603 603
[20] S. Fremdt, J.G. Steinebach, L. Horváth, P. Kokoszka, Testing the equality of covariance operators in functional samples, Scand. J. Stat. 40 (1)
(2013) 138–152.
[21] J.E. Gellar, E. Colantuoni, D.M. Needham, C.M. Crainiceanu, Variable-domain functional regression for modeling ICU data, J. Amer. Statist. Assoc.
109 (508) (2014) 1425–1439.
[22] D.S. Gilliam, T. Hohage, X. Ji, F. Ruymgaart, The Fréchet derivative of an analytic function of a bounded operator with some applications, Int.
J. Math. Math. Sci. 2009 (2009).
[23] Y. Goldberg, Y. Ritov, A. Mandelbaum, Predicting the continuation of a function with applications to call center data, J. Statist. Plann. Inference
147 (2014) 53–65.
[24] O. Gromenko, P. Kokoszka, J. Sojka, Evaluation of the cooling trend in the ionosphere using functional regression with incomplete curves, Ann.
Appl. Stat. 11 (2) (2017) 898–918.
[25] J. Guo, B. Zhou, J.-T. Zhang, New tests for equality of several covariance functions for functional data, J. Amer. Statist. Assoc. (2018) To appear.
[26] J. Guo, B. Zhou, J.-T. Zhang, Testing the equality of several covariance functions for functional data: a supremum-norm based test, Comput.
Statist. Data Anal. 124 (2018) 15–26.
[27] L. Horváth, M. Hušková, P. Kokoszka, Testing the stability of the functional autoregressive process, J. Multivariate Anal. 101 (2) (2010) 352–367.
[28] L. Horváth, P. Kokoszka, Inference for functional data with applications, Springer, New York, 2012, p. xiv+422.
[29] L. Horváth, P. Kokoszka, R. Reeder, Estimation of the mean of functional time series and a two-sample problem, J. R. Stat. Soc. Ser. B Stat.
Methodol. 75 (1) (2013) 103–122.
[30] D. Jarušková, Testing for a change in covariance operator, J. Statist. Plann. Inference 143 (9) (2013) 1500–1511.
[31] A. Kashlak, J. Aston, R. Nickl, Inference on covariance operators via concentration inequalities: k-sample tests, classification, and clustering via
rademacher complexities, Sankhya A (2018).
[32] A. Kneip, D. Liebl, On the Optimal Reconstruction of Partially Observed Functional Data, Ann. of Statist. to appear, 2019.
[33] P. Kokoszka, M. Reimherr, Asymptotic normality of the principal components of functional time series, Stochastic Process. Appl. 123 (5) (2013)
1546–1562.
[34] P. Kokoszka, M. Reimherr, Introduction to Functional Data Analysis, CRC Press, 2017.
[35] D. Kraus, Components and completion of partially observed functional data, J. R. Stat. Soc. Ser. B Stat. Methodol. 77 (4) (2015) 777–801.
[36] D. Kraus, V.M. Panaretos, Dispersion operators and resistant second-order functional data analysis, Biometrika 99 (4) (2012) 813–832.
[37] D. Kraus, M. Stefanucci, Classification of functional fragments by regularized linear classifiers with domain selection, Biometrika 106 (1) (2019)
161–180.
[38] D. Liebl, Modeling and forecasting electricity spot prices: a functional data perspective, Ann. Appl. Stat. 7 (3) (2013) 1562–1592.
[39] D. Liebl, Nonparametric testing for differences in electricity prices: The case of the fukushima nuclear accident, Ann. Appl. Stat. (2019) To
appear.
[40] D. Liebl, S. Rameseder, Partially observed functional data: The case of systematically missing parts, Comput. Statist. Data Anal. 131 (2019)
104–115.
[41] R.Y. Liu, Bootstrap procedures under some non-i.i.d. models, Ann. Statist. 16 (4) (1988) 1696–1708.
[42] C.J. Lloyd, Estimating test power adjusted for size, J. Stat. Comput. Simul. 75 (11) (2005) 921–933.
[43] A. Mas, Testing for the mean of random curves: a penalization approach, Stat. Inference Stoch. Process. 10 (2) (2007) 147–163.
[44] V. Masarotto, Procrustes Metric and Optimal Transport for Covariance Operators, Ph. D. thesis, Ecole Polytechnique Fédérale de Lausanne, 2019.
[45] M. Mojirsheibani, C. Shaw, Classification with incomplete functional covariates, Statist. Probab. Lett. 139 (2018) 40–46.
[46] V.M. Panaretos, D. Kraus, J.H. Maddocks, Second-order comparison of Gaussian random functions and the geometry of DNA minicircles, J. Amer.
Statist. Assoc. 105 (490) (2010) 670–682.
[47] V.M. Panaretos, D. Kraus, J.H. Maddocks, Second-order inference for functional data with application to dna minicircles, in: Recent Advances
in Functional Data Analysis and Related Topics, Springer, 2011, pp. 245–250.
[48] E. Paparoditis, T. Sapatinas, Bootstrap-Based K-Sample Testing For Functional Data, arXiv:1409.4317v4, 2016.
[49] E. Paparoditis, T. Sapatinas, Bootstrap-based testing of equality of mean functions or equality of covariance operators for functional data,
Biometrika 103 (3) (2016) 727–733.
[50] D. Pigoli, J.A. Aston, I.L. Dryden, P. Secchi, Distances and inference for covariance operators, Biometrika 101 (2) (2014) 409–422.
[51] A. Pini, L. Spreafico, S. Vantini, A. Vietti, Multi-aspect local inference for functional data: Analysis of ultrasound tongue profiles, J. Multivariate
Anal. 170 (2019) 162–185.
[52] A. Pini, A. Stamm, S. Vantini, Hotelling’s T2
in separable Hilbert spaces, J. Multivariate Anal. 167 (2018) 284–305.
[53] A. Pini, S. Vantini, The interval testing procedure: a general framework for inference in functional data analysis, Biometrics 72 (3) (2016)
835–845.
[54] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, New York, 2005.
[55] M. Stefanucci, L.M. Sangalli, P. Brutti, PCA-Based discrimination of partially observed functional data, with an application to AneuRisk65 data
set, Stat. Neerl. 72 (3) (2018) 246–264.
[56] O. Vsevolozhskaya, M. Greenwood, D. Holodov, Pairwise comparison of treatment levels in functional analysis of variance with application to
erythrocyte hemolysis, Ann. Appl. Stat. 8 (2) (2014) 905–925.
[57] J.-T. Zhang, Analysis of Variance for Functional Data, Chapman and Hall/CRC, 2013.
[58] J.-T. Zhang, X. Liang, One-way ANOVA for functional data via globalizing the pointwise F-test, Scand. J. Stat. 41 (1) (2014) 51–71.
[59] C. Zhang, H. Peng, J.-T. Zhang, Two samples tests for functional data, Commun. Statist. – Theory Methods 39 (4) (2010) 559–578.
Supplementary material for
“Inferential procedures for partially observed functional data”
David Kraus∗
Abstract: This supplementary document contains additional simulation results and further results
of the data analysis.
Key words and phrases: Bootstrap; covariance operator; functional data; K-sample test; partial
observation; principal components.
S1 Extended simulation results
Table S1 is an extended version of Table 1 presented in the main body of the paper. It includes
additional simulation results for tests of equal means for non-Gaussian distributed curves and
for groups with unequal covariance operators. The same model as in the paper is used except
that for the non-Gaussian case independent t5 distributed coeﬃcients are generated and for
the case of unequal covariance operators we set λ3,0 = 0.2. Since the empirical size deviates
from the nominal level in some cases, Table S2 additionally reports size-adjusted powers for
the same settings using the method described by Lloyd (2005, Subsection 3.2).
Table S3 reports results for tests of equal covariance operators. In addition to the results
presented in Table 2 in the main body of the paper it contains results for t5 distributed
coeﬃcients in the model for random curves. Table S4 reports size-adjusted powers for the
same settings.
S2 Additional results for the data analysis
Fig. S1 contains additional plots of the covariance function estimates of the heart rate data
shown in the main body of the paper. Fig. S2 shows the null estimates of the covariance functions
and their leading eigenfunctions that the projection covariance test uses. Components
of the score vector standardized by their estimated standard deviation are plotted in Fig. S3
Acknowledgements
We are grateful to all reviewers for their valuable comments and suggestions. This work
was supported by the Czech Science Foundation under Grant GJ17-22950Y. Access to computing
and storage facilities owned by parties and projects contributing to the MetaCentrum
∗
Department of Mathematics and Statistics, Masaryk University, Kotl´aˇrsk´a 2, 611 37 Brno, Czech Republic;
david.kraus@mail.muni.cz.
1
Table S1
Empirical rejection probability (in %) of the L2
test, TL2 , and projection test, Td, of equal means.
A dash indicates the same value as on the preceding row. The observation patterns (1)–(9) and mean
conﬁgurations A–D are described in Section 5 of the paper.
Distrib. Covar. Observ. Mean conﬁguration
oper. pattern A B C D
TL2 Td TL2 Td TL2 Td TL2 Td
Gaussian Equal (1) 5.6 6.2 69 60 49 56 52 63
(2) 5.4 6.7 59 52 28 29 38 50
(3) — — — — 50 56 44 62
(4) 4.4 6.5 66 58 51 57 51 62
(5) — — — — 44 49 50 58
(6) 5.4 7.1 58 51 50 55 42 49
(7) — — — — 28 34 37 42
(8) 5.4 5.8 55 47 34 37 42 48
(9) 5.4 7.8 37 40 20 23 26 34
Gaussian Unequal (1) 4.2 5.2 79 75 58 63 57 67
(2) 4.0 5.6 66 62 28 32 37 52
(3) — — — — 56 62 47 66
(4) 4.0 5.7 77 72 58 62 55 64
(5) — — — — 50 55 53 63
(6) 3.9 4.9 64 60 55 57 43 52
(7) — — — — 29 36 38 46
(8) 4.5 7.0 64 62 39 42 47 54
(9) 4.0 6.5 42 48 23 25 27 38
t5 Equal (1) 5.4 7.3 72 61 51 58 54 63
(2) 4.7 7.6 58 53 27 30 38 52
(3) — — — — 50 60 44 63
(4) 5.1 6.4 70 60 52 57 51 60
(5) — — — — 46 52 50 60
(6) 3.7 6.1 56 50 50 54 41 50
(7) — — — — 27 32 37 43
(8) 5.1 7.1 58 52 33 36 44 51
(9) 5.4 6.6 38 42 21 24 26 34
t5 Unequal (1) 5.8 7.4 82 77 59 65 60 68
(2) 4.7 6.9 68 64 32 35 44 57
(3) — — — — 60 66 50 68
(4) 5.2 6.7 80 76 62 65 59 66
(5) — — — — 53 60 56 65
(6) 3.9 6.1 65 63 57 61 47 57
(7) — — — — 32 37 42 50
(8) 4.8 7.5 65 64 39 42 50 56
(9) 5.5 6.2 44 50 24 28 30 40
2
Table S2
Size-adjusted empirical power (in %) for the same settings as in Table S1.
Distrib. Covar. Observ. Mean conﬁguration
oper. pattern B C D
TL2 Td TL2 Td TL2 Td
Gaussian Equal (1) 66 56 47 52 49 59
(2) 56 43 25 23 34 41
(3) — — 47 48 40 54
(4) 68 52 52 48 52 54
(5) — — 45 43 51 51
(6) 58 46 50 49 42 45
(7) — — 28 29 37 37
(8) 54 45 34 34 41 45
(9) 36 33 20 17 26 27
Gaussian Unequal (1) 83 73 63 62 62 66
(2) 72 59 35 29 44 49
(3) — — 62 59 56 64
(4) 81 72 63 62 61 63
(5) — — 56 55 59 62
(6) 68 60 60 57 47 53
(7) — — 34 36 43 46
(8) 67 54 42 36 49 48
(9) 45 45 25 23 31 35
t5 Equal (1) 71 55 50 51 52 57
(2) 60 44 28 23 39 42
(3) — — 51 47 46 53
(4) 69 53 51 53 50 56
(5) — — 44 45 49 55
(6) 60 48 53 52 45 48
(7) — — 31 30 40 40
(8) 57 44 32 30 43 44
(9) 38 38 21 20 26 31
t5 Unequal (1) 80 71 58 59 58 62
(2) 68 56 32 27 44 48
(3) — — 61 56 50 60
(4) 80 71 62 60 59 62
(5) — — 53 54 56 61
(6) 70 61 61 57 51 54
(7) — — 37 35 46 47
(8) 66 56 40 35 51 49
(9) 43 45 23 24 28 36
3
Table S3
Empirical rejection probability (in %) of the Hilbert–Schmidt norm test, SHS, projection test, Sd, and
square root covariance test, Ssqrt, of equal covariance operators. A dash indicates the same value as on
the preceding row. The observation patterns (1)–(5) and covariance conﬁgurations A–D are described
in Section 5 of the paper.
Distrib. Observ. Covariance conﬁguration
pattern A B C D
SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt
Gaussian (1) 5.4 5.8 4.8 69 82 80 69 58 69 78 62 81
(2) 4.6 6.4 4.9 54 63 41 37 32 38 76 64 54
(3) — — — — — — — — — 46 30 48
(4) 5.0 5.1 5.8 64 74 72 61 53 62 72 56 73
(5) — — — — — — — — — 77 60 77
t5 (1) 3.6 5.7 4.2 26 32 35 30 26 35 38 41 44
(2) 3.3 6.5 3.4 22 31 18 14 17 16 38 41 23
(3) — — — — — — — — — 16 16 20
(4) 4.0 6.4 4.8 23 32 30 25 25 31 30 33 34
(5) — — — — — — — — — 36 38 40
Table S4
Size-adjusted empirical power (in %)) for the same settings as in Table S3.
Distrib. Observ. Covariance conﬁguration
pattern B C D
SHS Sd Ssqrt SHS Sd Ssqrt SHS Sd Ssqrt
Gaussian (1) 66 79 81 67 56 69 78 60 81
(2) 54 59 42 38 29 38 78 60 55
(3) — — — — — — 47 26 49
(4) 64 73 69 61 52 59 72 56 71
(5) — — — — — — 77 59 74
t5 (1) 32 29 39 36 23 38 44 38 48
(2) 23 24 20 18 14 18 40 33 26
(3) — — — — — — 20 12 23
(4) 29 26 31 31 19 32 37 27 35
(5) — — — — — — 43 32 41
4
20 21 22 23 24 25 26
20212223242526
<=40
Time
Time
60
80
100
120
140
20 21 22 23 24 25 26
20212223242526
(40,65]
Time
Time
60
80
100
120
140
20 21 22 23 24 25 26
20212223242526
>65
Time
Time
60
80
100
120
140
20 21 22 23 24 25 26
20212223242526
<=40
Time
Time
−1
0
1
2
3
4
5
20 21 22 23 24 25 26
20212223242526
(40,65]
Time
Time
−1
0
1
2
3
4
5
20 21 22 23 24 25 26
20212223242526
>65
Time
Time
−1
0
1
2
3
4
5
Fig. S1. Estimated covariance functions of heart rate proﬁles (top row) and of their derivatives
(bottom row) in age groups.
20 21 22 23 24 25 26
20212223242526
Time
Time
60
80
100
120
140
20 21 22 23 24 25 26
−1.0−0.50.00.5
Time
PC1 (88.2%)
PC2 (9.1%)
PC3 (1.8%)
20 21 22 23 24 25 26
20212223242526
Time
Time
−1
0
1
2
3
4
5
20 21 22 23 24 25 26
−1.0−0.50.00.5
Time
PC1 (62.6%)
PC2 (31.9%)
PC3 (4.3%)
Fig. S2. The null estimate of the covariance function (left column) and its three leading principal
components (right column) for heart rate proﬁles (top row) and for their ﬁrst derivative (bottom row).
5
−3
−2.4
−1.8
−1.2
−0.6
0
0.6
1.2
1.8
2.4
3
1 2 3
1
2
3
<=40
1.23 0.24
1.93
0.85
0.24
2.94
q q
q q
−3
−2.4
−1.8
−1.2
−0.6
0
0.6
1.2
1.8
2.4
3
1 2 3
1
2
3
(40,65]
−0.19 0.15
−0.94
−1.46
−0.37
−1.64
q q
q
−3
−2.4
−1.8
−1.2
−0.6
0
0.6
1.2
1.8
2.4
3
1 2 3
1
2
3
>65
−1.32 −0.45
−1.29
0.83
0.15
−1.69
q
q
q
−2.7
−2.16
−1.62
−1.08
−0.54
0
0.54
1.08
1.62
2.16
2.7
1 2 3
1
2
3
<=40
1.21 1.56
0.57
0.98
−0.74
2.61
q
q
q
−2.7
−2.16
−1.62
−1.08
−0.54
0
0.54
1.08
1.62
2.16
2.7
1 2 3
1
2
3
(40,65]
−0.09 −0.8
1.12
−1.34
0.41
−1.94
q q
q
q
−2.7
−2.16
−1.62
−1.08
−0.54
0
0.54
1.08
1.62
2.16
2.7
1 2 3
1
2
3
>65
−1.32 −0.84
−2.13
0.45
0.4
−0.72
Fig. S3. Standardized components of the score vector for testing equal covariances contrasting age
groups against the null for heart rate proﬁles (top row) and for their derivatives (bottom row).
National Grid Infrastructure provided under the programme “Projects of Large Research, Development,
and Innovations Infrastructures” (CESNET LM2015042) is greatly appreciated.
References
Lloyd, C. J. (2005). Estimating test power adjusted for size. Journal of Statistical Computation
and Simulation, 75(11):921–933.
6