MASARYK UNIVERSITY
FACULTY OF SCIENCE
DEPARTMENT OF MATHEMATICS AND STATISTICS
Habilitation Thesis
BRNO 2014 JAN KOLÁČEK
MASARYK UNIVERSITY
FACULTY OF SCIENCE
DEPARTMENT OF MATHEMATICS AND STATISTICS
Theory and Practice
of Kernel Smoothing
Habilitation Thesis
Jan Koláček
Brno 2014
Contents
Abstract (in Czech) 2
Preface 3
1 Introduction 4
2 Assumptions and notations 6
2.1 The univariate case . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The multivariate case . . . . . . . . . . . . . . . . . . . . . . . 7
3 Kernel estimation of a regression function 8
3.1 Choosing the shape of the kernel . . . . . . . . . . . . . . . . 9
3.2 Choosing the optimal bandwidth . . . . . . . . . . . . . . . . 9
3.2.1 Plug-in method . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Iterative method . . . . . . . . . . . . . . . . . . . . . 10
3.3 Kernel regression for correlated data . . . . . . . . . . . . . . 11
4 Boundary eﬀects in kernel estimation 12
4.1 Boundary eﬀects in kernel regression . . . . . . . . . . . . . . 12
4.2 Boundary eﬀects in kernel estimation of a distribution function 12
5 Kernel estimation and reliability assessment 13
6 Multivariate kernel density estimation 14
7 The monograph 15
8 Conclusion and further research 16
References 17
Reprints of articles 24
1
Abstrakt
Habilitační práce je souborem článků [2 – 10] publikovaných v mezinárodních
časopisech, z nichž čtyři jsou evidovány v databázi Web of Science. Většina
těchto článků má spoluautory, jimiž jsou Ivana Horová, Kamila Vopatová,
R.J. Karunamuni a další, přičemž podíl všech autorů na společných článcích
je rovnocenný. Práce také odkazuje na knihu [1], která vyšla v roce 2012
v nakladatelství World Scientiﬁc a je shrnutím získaných poznatků a praktickou
aplikací našich výsledků v jazyce Matlab.
Oblastí našeho výzkumu je teorie jádrového vyhlazování, které zaznamenalo
v posledních dvaceti letech nebývalý rozmach. V současnosti patří jádrové
vyhlazování ke standardním neparametrickým technikám používaných
při zpracování a modelování dat. Základy teorie jádrového vyhlazování jsou
popsány v monograﬁích [48, 74, 79]. Řídícím faktorem při jádrovém vyhlazování
je vyhlazovací parametr, který se v jednorozměrném případě nazývá
šířka vyhlazovacího okna, ve vícerozměrném případě jej nazýváme vyhlazovací
matice. V našem výzkumu jsme se tedy zaměřili především na volbu
tohoto vyhlazovacího parametru.
V případě jádrových odhadů regresní funkce byly navrženy dvě nové
metody. První předpokládá cyklický plán, kdy se data periodicky opakují,
a byla publikována v [10]. Druhá metoda byla představena v článku [7] a její
statistické vlastnosti byly odvozeny v článku [3], který byl loni přijat k publikaci.
V souvislosti s odhady regresní funkce byly také studovány hraniční
efekty (viz [25]) a dále problematika jádrových odhadů regresní funkce pro
korelovaná data (viz [4]).
Neméně zajímavým tématem v této oblasti je problematika hraničních
efektů, které při jádrových odhadech nastávají. Zaměřili jsme se zejména
na hraniční efekty při jádrových odhadech distribuční funkce. V článku
[9] jsme se zabývali potlačením těchto efektů při odhadech ROC křivky.
V článku [8] jsme studovali vliv a potlačení efektů při odhadech rizikové
funkce. Dále jsme se také zabývali využitím jádrových odhadů ve ﬁnancích,
konkrétně při odhadování indexů a křivek, které popisují kvalitu skóringových
modelů (viz [5]).
Velmi významnou část našeho výzkumu tvoří zobecnění principů jádrových
odhadů v jednorozměrném případě na vícerozměrný prostor. Zaměřili
jsme se nejprve na jádrové odhady hustoty. V článku [6] byla představena iterační
metoda pro hledání optimální vyhlazovací matice, zejména její graﬁcká
interpretace ve speciálním případě dvourozměrného prostoru a za předpokladu
diagonální matice. Statistické vlastnosti a zobecnění na plnou matici
pro tuto metodu byly odvozeny v článku [2].
2
Preface
The thesis is a collection of articles [2 – 10]. Four of them have been published
in international journals indexed by Web of Science. The paper [3] was
accepted in December 2013. The thesis also refers to the book [1] which is
a summary of all results in our research area.
Our main research interest lies in the theory of kernel smoothing. Kernel
methods are well-known and intensively used by the community of nonparametricians
because they are a useful tool for local weighting. Kernel
estimators combine two main advantages: simple expression and ease of im-
plementation.
It is well known that the most important factor in kernel estimation is
a choice of smoothing parameters. This choice is particularly important because
of its role in controlling both the amount and the direction of smoothing.
This problem has been widely discussed in many monographs and pa-
pers.
The following overview starts with a motivation of the theory of kernel
smoothing and then brieﬂy describes the main contributions of the book [1]
and the papers [2 – 10]. In order to make the presentation more compact,
the thesis consists of the author’s selected papers in the area. In References
one can ﬁnd the list of other related publications of the author [11 – 28].
Pronouncement
Almost all papers included in this thesis have co-authors, namely I. Horová,
K. Vopatová, R. J. Karunamuni, J. Zelinka, M. Řezáč and D. Lajdová. In
all cases, the contributions of all authors were equivalent, since the results
were based on common discussions. Formally, the author’s contribution to
the paper [10] was 100%, the author’s contribution to the papers [3, 5, 7, 8, 9]
was 50% and the author’s contribution to the monograph [1] and the papers
[2, 4, 6] was 33%.
Acknowledgement
I wish to thank all the co-authors for their friendly and always very helpful
collaboration. I would like to express my gratitude to my colleague
Prof. Ivana Horová for our numerous interesting discussions. And most importantly,
I would like to thank my wife Veronika. Her support, encouragement,
patience and love were the bedrock upon which the past eight years of
my life have been built.
3
1 Introduction
Kernel smoothing belongs to a general category of techniques for nonparametric
curve estimations including nonparametric regression, nonparametric
density estimators and nonparametric hazard functions. These estimations
depend on a smoothing parameter called a bandwidth which controls
the smoothness of the estimate and on a kernel which plays a role of weight
function. As far as the kernel function is concerned, a key parameter is its
order which is related both to the number of its vanishing moments and to
the number of existing derivatives for the underlying curve to be estimated.
As concerns a bandwidth choice – it is the crucial problem in the kernel
smoothing and this is the main topic of our research.
The ﬁrst part of our research includes a methodology for nonparametric
regression analysis, complemented with practical applications. In nonparametric
regression estimation, a critical and inevitable step is to choose the
smoothing parameter (bandwidth) to control the smoothness of the curve
estimate. The smoothing parameter considerably aﬀects the features of the
estimated curve. Although in practice one can try several bandwidths and
choose a bandwidth subjectively, automatic (data-driven) selection procedures
could be useful for many situations; see [73] for more examples. Several
automatic bandwidth selectors were proposed and studied in [37], [50], [49],
[38] with the references included. It is well recognized that these bandwidth
estimates are the subject to large sample variation. The kernel estimates
based on the bandwidths selected by these procedures could have very different
appearances. Due to the large sample variation, classical bandwidth
selectors might not be very useful in practice. This fact has motivated us in
our research to ﬁnd new methods for bandwidth selection which give much
more stable bandwidth estimates.
In connection with the kernel regression analysis we have to mention one
essential fact. The regression model assumes no correlation in measurements.
In the case of independent observations the literature on bandwidth selection
methods is quite extensive. Nevertheless, if an autocorrelation structure of
errors occurs in data, then classical bandwidth selectors have not always
provided applicable results (see [35]). Many real data sets (especially time
series) show the autocorrelation. This has led us to study possibilities for
overcoming the eﬀect of dependence on the bandwidth selection.
The next part of our research is focused on the studying of boundary
eﬀects in kernel estimation. In practical processing we encounter data which
are bounded in some interval. The quality of the estimate in the boundary
region is aﬀected since the “eﬀective” window does not belong to this interval,
therefore the ﬁnite equivalent of the moment conditions on the kernel
4
function does not apply any more. This phenomenon is called the boundary
eﬀect. Although there is a vast literature on boundary correction in density
estimation context, the boundary eﬀects problem in the cumulative distribution
function and the regression function context has been less studied.
Thus, we have focused our research to these areas of kernel smoothing.
As we have already mentioned, kernel smoothing is widely used in many
statistical research areas. One of them is focused on studying discrimination
measures used to determine the quality of models at separating in a binary
classiﬁcation system. There are many possible ways to measure the performance
of the classiﬁcation rules. It is often very helpful to be given a method
for displaying and summarizing performance over a wide range of conditions.
This aim is fulﬁlled, e.g., by the ROC (Receiver Operating Characteristic)
curve, Information value curve, Lift, Kolmogorov-Smirnov statistics and others.
There are many problems in the estimation of these curves in practice
and the kernel smoothing approach seems to be very helpful. Thus, our
research has been directed also to this area.
The important part of our research is devoted to the extension of the
univariate kernel density estimate to the multivariate setting. As we have
already explained, the typical question, motivated by the origins of this research
area, asks to determine the optimal smoothing parameter (matrix).
Some “classical” methods in the multivariate case were developed and widely
discussed in papers [31], [42], [41], [68], [39]. Tarn Duong’s PhD thesis ([39])
provided a comprehensive survey of bandwidth matrix selection methods
for kernel density estimation. Papers [32], [40] investigated general density
derivative estimators, i.e., kernel estimators of multivariate density derivatives
using general (or unconstrained) bandwidth matrix selectors. We have
followed mentioned papers and we have proposed a new data-driven bandwidth
matrix selection method. Ideas similar to this method have been applied
to kernel estimates of multivariate regression functions.
We would like to emphasize a great interest and usefulness of all mentioned
problems in many ﬁelds of applied sciences (environmetrics, chemometrics,
biometrics, medicine, econometrics, . . . ). Thus our works deal not
only with the theoretical background of the considered problems but also
with the application to real data. For example, see [4] where the utility of the
proposed method was illustrated through an application to the time series of
ozone data. For applications of smoothing methods in medicine see [14]. The
wide range of applications in ﬁnance can be found in [5, 16, 17, 20, 21, 18].
The use of some proposed methods for modeling in environmetrics was described
in [22]. See the list in “Other Publications of the Author” at the end
of the thesis for more references.
5
Author’s Contribution
Our interest is focused on an outstanding open problem of the optimal bandwidth
matrix selection in the multivariate case. Although there exist several
classical approaches, it is problematic to implement them in practice because
of their computational diﬃculty. Our results concerning this problem
are described in Section 6. The author considers these results to be the most
valuable part of the thesis since they can potentially constitute a signiﬁcant
step towards a more eﬀective computable solution of the problem.
In Section 3 we overview our results concerning two other related problems.
The main part describes results concerning optimal bandwidth selection
for the univariate kernel regression and the remaining part deals with
the problem of autocorrelated data in kernel regression.
Our investigations of boundary eﬀects in kernel smoothing (Section 4)
serve as a supporting ground for new techniques in reliability assessment
(Section 5), and the results obtained there could be beneﬁcial for applications
in other research areas.
Finally, Section 7 presents a monograph, where all results of our research
are summarized. An integral part of the book is a special toolbox in MATLAB.
The toolbox is described in the book in detail and provides a practical
implementation of presented methods.
2 Assumptions and notations
In this section, we introduce a deﬁnition of the kernel and show notations
and general assumptions used in our research.
Deﬁnition 1. Let ν, k be nonnegative integers, 0 ≤ ν < k. Let K be a real
valued function satisfying K ∈ Sν,k,
where
Sν,k =



K ∈ Lip[−1, 1], support(K) = [−1, 1]
1
−1
xj
K(x)dx =



0, 0 ≤ j < k, j = ν
(−1)ν
ν!, j = ν
βk = 0, j = k.
(1)
Such a function is called a kernel of order k. The integral conditions are
often called moment conditions.
A commonly used kernel function is the Gaussian kernel
K(x) =
1
√
2π
exp(−x2
/2).
6
Nevertheless, this kernel has an unbounded support and thus it does not
belong to the class Sν,k.
2.1 The univariate case
Let us consider a univariate function f (a density function or a regression
function) which should be estimated. We present a short overview of the
notation and assumptions used in our research.
(N1) The positive number h is a smoothing parameter called also a bandwidth.
The bandwidth h is depending on n, h = h(n): {h(n)}∞
n=1 is
a nonrandom sequence of positive numbers.
(N2) Kh(t) = 1
h
K t
h
, K ∈ S0,k, k is even, h > 0.
(N3) V (ρ) =
R
ρ2
(x)dx for any square integrable scalar valued function ρ.
(A1) K ∈ S0,k ∩ Cν
[−1, 1], K(j)
(−1) = K(j)
(1) = 0, j = 0, 1, . . . , ν, ν ∈ N,
i.e., K(ν)
∈ Sν,k+ν (see [46, 60]).
(A2) f ∈ Ck0
, ν + k ≤ k0, f(ν+k)
is square integrable.
(A3) lim
n→∞
h = 0, lim
n→∞
nh2ν+1
= ∞.
2.2 The multivariate case
This part is devoted to the extension of assumptions for the univariate case
to the multivariate setting. Let us consider a d-dimensional space as the
domain of the estimated function f.
(N1) H denotes a class of d × d symmetric positive deﬁnite matrices.
(N2) V (g) =
Rd
g(x)gT
(x)dx for any square integrable vector valued function
g.
(A1) The kernel function K satisﬁes the moment conditions K(x)dx = 1,
xK(x)dx = 000, xxT
K(x)dx = β2Id, Id is the d × d identity matrix.
(A2) H ∈ H, H = Hn is a sequence of bandwidth matrices such that
n−1/2
|H|−1/2
(H−1
)j
, j = 0, 1, . . . , ν, ν ∈ N, and entries of H approach
zero ((H−1
)0
is considered as equal to 1).
(A3) Each partial derivative of f of order j +2, j = 0, 1, . . . , ν, is continuous
and square integrable.
7
3 Kernel estimation of a regression function
One of our research interest includes the methodology for nonparametric
regression analysis, combined with practical applications.
The aim of regression analysis is to produce a reasonable analysis of an unknown
regression function m. By reducing the observational errors it allows
the interpretation to concentrate on important details of the mean dependence
of Y on X. Kernel regression estimates are one of the most popular
nonparametric estimates.
Let us consider a standard regression model of the form
Yi = m(xi) + εi, i = 1, . . . , n, (2)
where m is an unknown regression function, Y1, . . . , Yn are observable data
variables with respect to the design points x1, . . . , xn. The residuals ε1, . . . , εn
are independent identically distributed random variables for which
E(εi) = 0, var(εi) = σ2
> 0, i = 1, . . . , n.
We suppose the ﬁxed equally spaced design, i.e., design variables are not
random and xi = i/n, i = 1, . . . , n. In the case of random design, where
the design points X1, . . . , Xn are random variables with the same density f,
all considerations are similar to the ﬁxed design. A more detailed description
of the random design can be found, e.g., in [79].
The most popular regression estimator was proposed by Nadaraya and
Watson ([64] and [80]) and it is deﬁned as
mNW (x, h) =
n
i=1
Kh(xi − x)Yi
n
i=1
Kh(xi − x)
. (3)
In order to complete the overview of commonly used nonparametric methods
for estimating m(x) we mention these estimators:
• local – linear estimator ([76, 36])
mLL(x, h) =
1
n
n
i=1
{ˆs2(x, h) − ˆs1(x, h)(xi − x)}Kh(xi − x)Yi
ˆs2(x, h)ˆs0(x, h) − ˆs1(x, h)2
, (4)
where
ˆsr(x, h) =
1
n
n
i=1
(xi − x)r
Kh(xi − x), r = 0, 1, 2,
8
• Priestley – Chao estimator ([66])
mPCH(x, h) =
1
n
n
i=1
Kh(xi − x)Yi, (5)
• Gasser – Müller estimator ([44])
mGM (x, h) =
n
i=1
Yi
si
si−1
Kh(t − x)dt, (6)
where
si =
xi + xi+1
2
=
2i + 1
2n
, i = 1, . . . , n − 1, s0 = 0, sn = 1.
One can see from these formulas that kernel estimators can be generally
expressed as
m(x, h) =
n
i=1
W
(j)
i (x, h)Yi, (7)
where weights W
(j)
i (x, h), j ∈ {NW, LL, PCH, GM} correspond to weights
of estimators mNW , mLL, mPCH and mGM deﬁned above.
In the univariate case, these estimators depend on a bandwidth, which
is a smoothing parameter controlling the smoothness of an estimated curve
and a kernel which is considered as a weight function.
3.1 Choosing the shape of the kernel
The choice of the kernel does not inﬂuence the asymptotic behavior of the
estimate so signiﬁcantly as the bandwidth does. We assume K ∈ S0,k and
under the additional assumption that k is even, k > 0. More detailed procedures
for choosing the optimal kernel are described in [1].
3.2 Choosing the optimal bandwidth
The choice of the smoothing parameter is a crucial problem in the kernel
regression. The literature on bandwidth selection is quite extensive, e.g.,
monographs [79, 74, 75], papers [48, 33, 34, 67, 77, 37, 38, 58, 10].
Although in practice one can try several bandwidths and choose a bandwidth
subjectively, automatic (data-driven) selection procedures could be
9
useful for many situations; see [73] for more examples. Most of these procedures
are based on estimating Average Mean Square Error. They are asymptotically
equivalent and asymptotically unbiased (see [48, 33, 34]). However,
in simulation studies ([58]), it is often observed that most selectors are biased
toward undersmoothing and yield smaller bandwidths more frequently than
predicted by asymptotic results.
As a part of our research we developed two methods for the optimal
bandwidth selections.
3.2.1 Plug-in method
In the simulation study of [33], it was observed that standard criterions give
smaller bandwidths more frequently than predicted by the asymptotic theorems.
[33] provided an explanation for the cause and suggested a procedure
to overcome the diﬃculty. By applying the procedure, we have introduced
a method for bandwidth selection which gives much more stable bandwidth
estimates (see [10]). As a result, we have obtained a type of plug-in method.
Our ideas are based on an assumption of a “cyclic design”, that is, we
suppose m to be a smooth periodic function and the estimate is obtained by
applying the kernel on the extended series Yi, i = −n + 1, −n + 2, . . . , 2n,
where generally Yj+ln = Yj for j = 1, . . . , n and l ∈ Z. Similarly xi = i/n,
i = −n + 1, −n + 2, . . . , 2n.
The main result of the paper [10] is the plug-in estimator of the optimal
bandwidth h
ˆhPI =
ˆσ2
V (K)(k!)2
2knβ2
kAk
1
2k+1
. (8)
We would like to point out the computational aspect of the proposed
estimator. It has preferable properties compared to the classical methods
because there is no problem of minimization of any error function. Also,
the sample size which is necessary for computing the estimation is far less
than for classical methods. On the other hand, a minor disadvantage could be
the fact that we need a “starting” approximation of the unknown parameter
h. We would also like to specify the proposed method was developed for
a rather limited case: the cyclic design.
3.2.2 Iterative method
Successful approaches to the bandwidth selection in kernel density estimation
can be transferred to the case of kernel regression. The iterative method for
the kernel density was developed and widely discussed in [54]. The ideas
10
of this paper were extended to the regression case. The obtained selector
was introduced in [7] and its statistical properties were derived in [3]. The
proposed method is based on an optimally balanced relation between the
integrated variance and the integrated square bias
AIV {m(·, hopt)} − 2k AISB{ m(·, hopt)} = 0, (9)
where
AIV {m(·, hopt)} =
σ2
V (K)
nh
and
AISB{ m(·, hopt)} =
1
n
n
i=1
(Em(xi, h) − m(xi))2
.
The main idea consists in ﬁnding a ﬁxed point of the equation
h =
ˆσ2
V (K)
2knhAISB {m(·, h)}
. (10)
We use Steﬀensen’s iterative method with the starting approximation ˆh0 =
2/n. This approach leads to an iterative quadratically convergent process
(see [54]).
3.3 Kernel regression for correlated data
As mentioned above, the literature on bandwidth selection methods is quite
extensive in the case of independent observations. Nevertheless, if an autocorrelation
structure of errors occurs in data, then classical bandwidth
selectors have not always provided applicable results (see [35]). There exist
several possibilities for overcoming the eﬀect of dependence on the bandwidth
selection.
In the paper [4] we used the results of [35] and [10] and developed a new
ﬂexible plug-in approach for estimating the optimal smoothing parameter.
The utility of the method was illustrated through a simulation study and
application to the time series of ozone data obtained from the Vernadsky
station in Antarctica.
11
4 Boundary eﬀects in kernel estimation
In practical processing we encounter data which are bounded in some interval.
The quality of the estimate in the boundary region is aﬀected since the “effective”
window [x − h, x + h] does not belong to this interval, so the ﬁnite
equivalent of the moment conditions on the kernel function does not apply
any more. This phenomenon is called the boundary eﬀect. There are several
methods to cope with boundary eﬀects. One of them is based on the construction
of special boundary kernels. Their construction was described in details
for instance in [63] or [51]. These kernels can be used successfully in kernel
regression but their use in density or distribution function estimates gives
often inappropriate results.
Although there is a vast literature on the boundary correction in density
estimation context, the boundary eﬀects problem in distribution function
and regression function context has been less studied. Thus we focused our
research on these areas of kernel smoothing.
4.1 Boundary eﬀects in kernel regression
If the support of the true regression curve is bounded then most nonparametric
methods give estimates that are severely biased in regions near the endpoints.
To be speciﬁc, the bias of m(x) is of order O(h) rather than O(h2
)
for x ∈ [0, h] ∪ [1 − h, 1]. This boundary problem aﬀects the global performance
visually and also in terms of a slower rate of convergence in the usual
asymptotic analysis. It has been recognized as a serious problem and many
works are devoted to reducing the eﬀects.
[44, 45, 46] and [63] discussed boundary kernel methods. Another approach
to the boundary problems are reﬂection methods which generally consist
in reﬂecting data about the boundary points and then estimating the regression
function. These methods were discussed, e.g., in [69, 47]. The reﬂection
principles used in kernel density estimation can be also adapted to kernel
regression. The regression estimator with the assumption of the “cyclic"
model described in [10] can be also considered as the special case of a reﬂection
technique. A short comparative study of methods for boundary eﬀects
eliminating was given in [25].
4.2 Boundary eﬀects in kernel estimation of a distribution
function
We have focused also on the boundary correction in kernel estimation of a cumulative
distribution function (CDF) which is important for other applica-
12
tions – especially for kernel estimation of ROC curves and hazard functions.
In the paper [9], we developed a new kernel type estimator of the ROC
curve that removes boundary eﬀects near the end points of the support. The
estimator is based on a new boundary corrected kernel estimator of distribution
functions and it is based on ideas of [56, 57], developed for boundary
correction in kernel density estimation. The basic technique of construction
of the proposed estimator is a type of a generalized reﬂection method involving
reﬂecting a transformation of the observed data. In fact, the proposed
method generates a class of boundary corrected estimators. We have derived
expressions for the bias and variance of the proposed estimator. Furthermore,
the proposed estimator has been compared with the "classical estimator" using
simulation studies.
Using similar ideas as in [9] we have developed a new kernel estimator
of the hazard function. The method was proposed in [8] and successfully
removes boundary eﬀects and performs considerably better than classical
estimators.
5 Kernel estimation and reliability assessment
The following part of our research is focused on studying discrimination measures
used for detecting the quality of models at separating in a binary classiﬁcation
system. There are many possible ways of measuring the performance
of the classiﬁcation rules. It is often very helpful to know a way of displaying
and summarizing performance over a wide range of conditions. This aim is
fulﬁlled by the ROC (Receiver Operating Characteristic) curve. It is a single
curve summarizing the distribution functions of the scores of two classes.
In our research, we have followed the ﬁnancial sphere, where the discrimination
power of scoring models is evaluated. However, most of all studied
indices have wide application in many other areas, where models with binary
output are used, like biology, medicine, engineering and so on.
References on this topic are quite extensive, see, e.g., [72, 29, 78]. In
[5], we summarized the most important quality measures and gave some
alternatives to them. All of the mentioned indices are based on the density or
on the distribution function, therefore one can suggest the technique of kernel
smoothing for estimation. More detailed studies on all indices can be also
found, e.g., in [20, 21]. Finally, a new conservative approach to quality
assessment was proposed in [18].
13
6 Multivariate kernel density estimation
An important part of our research is devoted to the extension of the univariate
kernel density estimate to the multivariate setting.
Let a d-variate random sample X1, . . . , Xn be drawn from a density f.
The kernel density estimator ˆf at the point x ∈ Rd
is deﬁned as
ˆf(x, H) =
1
n
n
i=1
KH(x − Xi), (11)
where K is a kernel function, which is often taken to be a d-variate symmetric
probability function, H is a d×d symmetric positive deﬁnite matrix and KH
is the scaled kernel function
KH(x) = |H|−1/2
K(H−1/2
x)
with |H| the determinant of the matrix H.
In a univariate case, kernel estimates depend on a bandwidth, which is
a smoothing parameter controlling smoothness of an estimated curve and
a kernel which is considered as a weight function. The choice of the smoothing
parameter is a crucial problem in the kernel density estimation. The literature
on bandwidth selection is quite extensive, e.g., monographs [79], [74],
[75], papers [61], [65], [71], [55], [30]. As far as the kernel estimate of density
derivatives is concerned, this problem has received signiﬁcantly less attention.
In paper [50], an adaptation of the least squares cross-validation method was
proposed for the bandwidth choice in the kernel density derivative estimation.
In paper [52], the automatic procedure of simultaneous choice of the
bandwidth, the kernel and its order for kernel density and its derivative estimates
was proposed. But this procedure can be only applied in case that the
explicit minimum of the Asymptotic Mean Integrated Square Error of the estimate
is available. It is known that this minimum exists only for d = 2 and
the diagonal matrix H. In paper [6], the basic formula for the corresponding
procedure was given.
The need for nonparametric density estimates for recovering the structure
in multivariate data is greater since a parametric modelling is more diﬃcult
than in the univariate case. The extension of the univariate kernel methodology
is not without problems. The most general smoothing parameterization
of the kernel estimator in d dimensions requires the speciﬁcation entries of
d × d positive deﬁnite bandwidth matrix. The multivariate kernel density
estimator we have dealt with is a direct extension of the univariate estimator
(see, e.g., [79]).
14
Successful approaches to the univariate bandwidth selection can be transferred
to the multivariate settings. The least squares cross-validation and
plug-in methods in the multivariate case were developed and widely discussed
in papers [31], [42], [41], [68], [39]. Some papers (e.g., [23], [6], [19]) were focused
on constrained parameterization of the bandwidth matrix such as a diagonal
matrix. It is a well-known fact that a visualization is an important
component of the nonparametric data analysis. In paper [6], this eﬀective
strategy was used to clarify the process of the bandwidth matrix choice using
bivariate functional surfaces. The paper [53] brought a short communication
on a kernel gradient estimator. Tarn Duong’s PhD thesis ([39]) provided
a comprehensive survey of bandwidth matrix selection methods for kernel
density estimation. The papers [32], [40] investigated general density derivative
estimators, i.e., kernel estimators of multivariate density derivatives using
general (or unconstrained) bandwidth matrix selectors. They deﬁned the
kernel estimator of the multivariate density derivative and provided results
for the Mean Integrated Square Error convergence asymptotically and for
ﬁnite samples. Moreover, the relationship between the convergence rate and
the bandwidth matrix was established here. They also developed estimates
for the class of normal mixture densities.
We have followed the mentioned papers and in [2] we proposed a new
data-driven bandwidth matrix selection method. This method is based on
an optimally balanced relation between the integrated variance and the integrated
squared bias, see [54]. Similar ideas have been applied to kernel
estimates of regression functions (see [7] or [3]). We have discussed the statistical
properties and relative rates of convergence of the proposed method
as well.
7 The monograph
The knowledge obtained in our research in kernel smoothing theory has resulted
in writing a monograph [1]. The book provides a brief comprehensive
overview of statistical theory. We do not concentrate on details since
there exists a number of excellent monographs developing statistical theory
([79, 48, 62, 74, 75, 70] etc.). Instead, the emphasis is given to the implementation
of presented methods in MATLAB. All created programs are included
into a special toolbox which is an integral part of the book. This toolbox
contains many MATLAB scripts useful for kernel smoothing of density, distribution
function, regression function, hazard function, multivariate density
and also for kernel estimation and reliability assessment. The toolbox can
be downloaded from the public web page (see [59]).
15
The toolbox is divided into six parts according to the chapters of the book.
All scripts are included in a user interface and it is easy to manipulate with
this interface. Each chapter of the book contains a detailed help for the
related part of the toolbox.
The monograph is intended for newcomers to the ﬁeld of smoothing techniques
and would be also appropriate for a wide audience: advanced graduate
and PhD students, researchers from both the statistical science and interface
disciplines.
8 Conclusion and further research
The previous text summarizes all our results in kernel smoothing which belongs
to a general category of techniques for nonparametric curve estimations.
We have studied several parts of kernel smoothing theory. The most interesting
theoretical results were obtained in multivariate kernel estimating and
in the choosing of the optimal smoothing parameter.
We have also paid attention to the use of our results in many ﬁelds of
applied sciences like environmetrics, biometrics, medicine or econometrics.
Thus our works deal not only with the theoretical background of the considered
problems but also with the application to real data.
In the further research we would like to aim at extending our previous
results to modeling for functional data sets. The functional data set can
be deﬁned as the observation of the random variable which takes values in
an inﬁnite dimensional space (or functional space). Thus the analysis of
functional data seems to be a natural extension of our ideas. For more about
functional data analysis see, e.g., [43].
16
References
Publications Included in the Thesis
[1] I. Horová, J. Koláček, and J. Zelinka, Kernel Smoothing in MATLAB:
Theory and Practice of Kernel Smoothing. Singapore: World Scientiﬁc
Publishing Co. Pte. Ltd., 2012.
[2] I. Horová, J. Koláček, and K. Vopatová, “Full bandwidth matrix selectors
for gradient kernel density estimate,” Computational Statistics &
Data Analysis, vol. 57, no. 1, pp. 364–376, 2013.
[3] J. Koláček and I. Horová, “Selection of bandwidth for kernel regression,”
Communications in Statistics - Theory and Methods. to appear.
[4] I. Horová, J. Koláček, and D. Lajdová, “Kernel regression model for total
ozone data,” Journal of Environmental Statistics, vol. 4, no. 2, pp. 1–12,
2013.
[5] M. Řezáč and J. Koláček, “Lift-based quality indexes for credit scoring
models as an alternative to gini and ks,” Journal of Statistics: Advances
in Theory and Applications, vol. 7, no. 1, pp. 1–23, 2012.
[6] I. Horová, J. Koláček, and K. Vopatová, “Visualization and bandwidth
matrix choice,” Communications in Statistics – Theory and Methods,
vol. 41, no. 4, pp. 759–777, 2012.
[7] J. Koláček and I. Horová, “Iterative bandwidth method for kernel regression,”
Journal of Statistics: Advances in Theory and Applications,
vol. 8, no. 2, pp. 91–103, 2012.
[8] J. Koláček and R. J. Karunamuni, “A generalized reﬂection method for
kernel distribution and hazard functions estimation,” Journal of Applied
Probability and Statistics, vol. 6, no. 2, pp. 73–85, 2011.
[9] J. Koláček and R. J. Karunamuni, “On boundary correction in kernel
estimation of ROC curves,” Austrian Journal of Statistics, vol. 38, no. 1,
pp. 17–32, 2009.
[10] J. Koláček, “Plug-in method for nonparametric regression,” Computational
Statistics, vol. 23, no. 1, pp. 63–78, 2008.
17
Other Publications of the Author
[11] K. Vopatová, I. Horová, and J. Koláček, “Bandwidth matrix selectors
for multivariate kernel density estimation,” in Theoretical and Applied
Issues in Statistics and Demography, pp. 123–130, Barcelona: International
Society for the Advancement of Science and Technology (ISAST),
2013.
[12] K. Konečná, I. Horová, and J. Koláček, “Conditional density estimations,”
in Theoretical and Applied Issues in Statistics and Demography,
pp. 39–45, Barcelona: International Society for the Advancement of Science
and Technology (ISAST), 2013.
[13] D. Lajdová, J. Koláček, and I. Horová, “Kernel regression model with
correlated errors,” in Theoretical and Applied Issues in Statistics and
Demography, pp. 81–88, Barcelona: International Society for the Advancement
of Science and Technology (ISAST), 2013.
[14] M. Trhlík, R. Soumarová, P. Bartoš, M. Těžká, J. Koláček, K. Vopatová,
I. Horová, and P. Šupíková, “Neoadjuvant chemotherapy for primary
advanced ovarian cancer,” in The International Journal of Gynecological
Cancer – October 2012, vol 22, issue 8, supplement 3, E517, 2013.
[15] I. Horová, J. Koláček, K. Vopatová, and J. Zelinka, “Contribution to
bandwidth matrix choice for multivariate kernel density estimate,” in
ISI 2011, Proceedings of the 58th World Statistics Congress, ISI 2011,
2011.
[16] M. Řezáč and J. Koláček, “Adjusted empirical estimate of information
value for credit scoring models,” in PROCEEDINGS ASMDA 2011,
(Rome), pp. 1162–1169, Edizioni ETS, 2011.
[17] J. Koláček and M. M. Řezáč, “Quality measures for predictive scoring
models,” in PROCEEDINGS ASMDA 2011 (C. H. S. Raimondo Manca,
ed.), (Rome, Italy), pp. 720–727, Edizioni ETS, 2011.
[18] J. Koláček and M. Řezáč, “A conservative approach to assessment
of discriminatory models,” in Workshop of the Jaroslav Hájek Center
and Financial Mathematics in Practice I, Book of short papers (I. H.
Jiří Zelinka, ed.), (Brno), pp. 30–36, Masaryk University, 2011.
[19] K. Vopatová, I. Horová, and J. Jan Koláček, “Bandwidth matrix choice
for bivariate kernel density derivative,” in Proceedings of the 25th International
Workshop on Statistical Modelling, (Glasgow (UK)), pp. 561–
564, 2010.
18
[20] J. Koláček and M. Řezáč, “Assessment of scoring models using information
value,” in 19th International Conference on Computational Statistics,
Paris France, August 22-27, 2010 Keynote, Invited and Contributed
Papers, (Paris), pp. 1191–1198, SpringerLink, 2010.
[21] M. Řezáč and J. Koláček, “On aspects of quality indexes for scoring
models,” in 19th International Conference on Computational Statistics,
Paris France, August 22-27, 2010 Keynote, Invited and Contributed Papers,
(Paris), pp. 1517–1524, SpringerLink, 2010.
[22] I. Horová, J. Koláček, J. Zelinka, and A. H. El-Shaarawi, “Smooth estimates
of distribution functions with application in environmental studies,”
in Advanced topics on mathematical biology and ecology, (Mexico),
pp. 122–127, WSEAS Press, 2008.
[23] I. Horová, J. Koláček, J. Zelinka, and K. Vopatová, “Bandwidth
choice for kernel density estimates.,” in Proceedings IASC, (Yokohama),
pp. 542–551, IASC, 2008.
[24] J. Koláček, “An improved estimator for removing boundary bias in kernel
cumulative distribution function estimation,” in Proceedings in Computational
Statistics COMPSTAT’08, (Porto), pp. 549–556, PhysicaVerlag,
2008.
[25] J. Koláček and J. Poměnková, “A comparative study of boundary eﬀects
for kernel smoothing,” Austrian Journal of Statistics, vol. 35, no. 2,
pp. 281–289, 2006.
[26] J. Koláček, “Use of fourier transformation for kernel smoothing,” in Proceedings
in Computational Statistics COMPSTAT’04, pp. 1329 – 1336,
Springer, 2004.
[27] J. Koláček, “Some stabilized bandwidth selectors for nonparametric regression,”
Journal of Electrical Engineering, vol. 54, no. 12, pp. 65–68,
2003.
[28] J. Koláček, “Problems of automatic data-driven bandwidth selectors for
nonparametric regression,” Journal of Electrical Engineering, vol. 53,
no. 12, pp. 48–51, 2002.
19
Other References
[29] R. Anderson. The credit scoring toolkit: theory and practice for retail
credit risk management and decision automation. Oxford University
Press, 2007.
[30] R. Cao, A. Cuevas, and W. González Manteiga. A comparative study
of several smoothing methods in density estimation. Computational
Statistics and Data Analysis, 17(2):153–176, 1994.
[31] J. E. Chacón and T. Duong. Multivariate plug-in bandwidth selection
with unconstrained pilot bandwidth matrices. Test, 19(2):375–398, 2010.
[32] J. E. Chacón, T. Duong, and M. P. Wand. Asymptotics for general multivariate
kernel density derivative estimators. Statistica Sinica, 21(2):807–
840, 2011.
[33] S. Chiu. Why bandwidth selectors tend to choose smaller bandwidths,
and a remedy. Biometrika, 77(1):222–226, 1990.
[34] S. Chiu. Some stabilized bandwidth selectors for nonparametric regression.
Annals of Statistics, 19(3):1528–1546, 1991.
[35] C. K. Chu and J. S. Marron. Choosing a kernel regression estimator.
Statistical Science, 6(4):404–419, 1991.
[36] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association, 74(368):829–
836, 1979.
[37] P. Craven and G. Wahba. Smoothing noisy data with spline functions estimating
the correct degree of smoothing by the method of generalized
cross-validation. Numerische Mathematik, 31(4):377–403, 1979.
[38] B. Droge. Some comments on cross-validation. Technical Report 1994-7,
Humboldt Universitaet Berlin, 1996.
[39] T. Duong. Bandwidth selectors for multivariate kernel density estimation.
PhD thesis, School of Mathematics and Statistics, University
of Western Australia, oct 2004.
[40] T. Duong, A. Cowling, I. Koch, and M. P. Wand. Feature signiﬁcance
for multivariate kernel density estimation. Computational Statistics &
Data Analysis, 52(9):4225–4242, 2008.
20
[41] T. Duong and M. Hazelton. Convergence rates for unconstrained bandwidth
matrix selectors in multivariate kernel density estimation. Journal
of Multivariate Analysis, 93(2):417–433, 2005.
[42] T. Duong and M. Hazelton. Cross-validation bandwidth matrices for
multivariate kernel density estimation. Scandinavian Journal of Statistics,
32(3):485–506, 2005.
[43] F. Ferraty and P. Vieu. Nonparametric functional data analysis: theory
and practice. Springer, 2006.
[44] T. Gasser and H.-G. Müller. Kernel estimation of regression functions. In
T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve
Estimation, volume 757 of Lecture Notes in Mathematics, pages 23–68.
Springer Berlin / Heidelberg, 1979.
[45] T. Gasser, H.-G. Müller, and V. Mammitzsch. Kernels for nonparametric
curve estimation. Journal of the Royal Statistical Society. Series B
(Methodological), 47(2):238–252, 1985.
[46] B. Granovsky and H.-G. Müller. Optimizing kernel methods - a unifying
variational principle. International Statistical Review, 59(3):373–388,
1991.
[47] P. Hall and T. E. Wehrly. A geometrical method for removing edge
eﬀects from kernel-type nonparametric regression estimators. Journal
of the American Statistical Association, 86(415):pp. 665–672, 1991.
[48] W. Härdle. Applied Nonparametric Regression. Cambridge University
Press, Cambridge, 1st edition, 1990.
[49] W. Härdle, P. Hall, and J. Marron. How far are automatically chosen
regression smoothing parameters from their optimum. Journal of the
American Statistical Association, 83(401):86–95, 1988.
[50] W. Härdle, J. S. Marron, and M. P. Wand. Bandwidth choice for density
derivatives. Journal of the Royal Statistical Society. Series B (Methodological),
52(1):223–232, 1990.
[51] I. Horová. Boundary kernels. In Summer schools MATLAB 94, 95,
pages 17–24. Brno: Masaryk University, 1997.
[52] I. Horová, P. Vieu, and J. Zelinka. Optimal choice of nonparametric
estimates of a density and of its derivatives. Statistics & Decisions,
20(4):355–378, 2002.
21
[53] I. Horová and K. Vopatová. Kernel gradient estimate. In F. Ferraty,
editor, Recent Advances in Functional Data Analysis and Related Topics,
pages 177–182. Springer-Verlag Berlin Heidelberg, 2011.
[54] I. Horová and J. Zelinka. Contribution to the bandwidth choice for
kernel density estimates. Computational Statistics, 22(1):31–47, 2007.
[55] M. C. Jones and R. F. Kappenman. On a class of kernel density estimate
bandwidth selectors. Scandinavian Journal of Statistics, 19(4):337–349,
1991.
[56] R. Karunamuni and T. Alberts. A generalized reﬂection method
of boundary correction in kernel density estimation. Canad. J. Statist.,
33:497–509, 2005b.
[57] R. Karunamuni and S. Zhang. Some improvements on a boundary corrected
kernel density estimator. Statistics & Probability Letters, 78:497–
507, 2008.
[58] J. Koláček. Kernel Estimation of the Regression Function (in Czech).
PhD thesis, Masaryk University, Brno, feb 2005.
[59] J. Koláček and J. Zelinka. MATLAB toolbox, 2012.
[60] J. S. Marron and D. Nolan. Canonical kernels for density-estimation.
Statistics & Probability Letters, 7(3):195–199, 1988.
[61] J. S. Marron and D. Ruppert. Transformations to reduce boundary bias
in kernel density estimation. Journal of the Royal Statistical Society.
Series B (Methodological), 56(4):653–671, 1994.
[62] H.-G. Müller. Nonparametric regression analysis of longitudinal data.
Springer, New York, 1988.
[63] H.-G. Müller. Smooth optimum kernel estimators near endpoints.
Biometrika, 78(3):521–530, 1991.
[64] E. A. Nadaraya. On estimating regression. Theory of Probability and
its Applications, 9(1):141–142, 1964.
[65] B. Park and J. Marron. Comparison of data-driven bandwidth selectors.
Journal of the American Statistical Association, 85(409):66–72, 1990.
[66] M. B. Priestley and M. T. Chao. Non-parametric function ﬁtting. Journal
of the Royal Statistical Society. Series B (Methodological), 34(3):385–
392, 1972.
22
[67] J. Rice. Bandwidth choice for nonparametric regression. Annals
of Statistics, 12(4):1215–1230, 1984.
[68] S. Sain, K. Baggerly, and D. Scott. Cross-validation of multivariate
densities. Journal of the American Statistical Association, 89(427):807–
817, 1994.
[69] E. Schuster. Incorporating support constraints into nonparametric estimators
of densities. Communications in Statistics-Theory end Methods,
14(5):1123–1136, 1985.
[70] D. W. Scott. Multivariate density estimation: theory, practice, and
visualization. Wiley, 1992.
[71] D. W. Scott and G. R. Terrell. Biased and unbiased cross-validation
in density estimation. Journal of the American Statistical Association,
82(400):1131–1146, 1987.
[72] N. Siddiqi. Credit risk scorecards: developing and implementing intelligent
credit scoring. Wiley and SAS Business Series. Wiley, 2006.
[73] B. W. Silverman. Some aspects of the spline smoothing approach to
non-parametric regression curve ﬁtting. Journal of the Royal Statistical
Society. Series B (Methodological), 47:1–52, 1985.
[74] B. W. Silverman. Density estimation for statistics and data analysis.
Chapman and Hall, London, 1986.
[75] J. S. Simonoﬀ. Smoothing Methods in Statistics. Springer-Verlag, New
York, 1996.
[76] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics,
5(4):595–620, 1977.
[77] M. Stone. Cross-validatory choice and assessment of statistical predictions.
Journal of the Royal Statistical Society Series B-Statistical
Methodology, 36(2):111–147, 1974.
[78] L. Thomas. Consumer credit models: pricing, proﬁt, and portfolios.
Oxford University Press, 2009.
[79] M. Wand and M. Jones. Kernel smoothing. Chapman and Hall, London,
1995.
[80] G. S. Watson. Smooth regression analysis. Sankhya: The Indian Journal
of Statistics, Series A, 26(4):359–372, 1964.
23
Reprints of articles
24
Computational Statistics and Data Analysis 57 (2013) 364–376
Contents lists available at SciVerse ScienceDirect
Computational Statistics and Data Analysis
journal homepage: www.elsevier.com/locate/csda
Full bandwidth matrix selectors for gradient kernel density estimate
Ivana Horováa,∗
, Jan Koláčeka
, Kamila Vopatováb
a
Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic
b
Department of Econometrics, University of Defence, Brno, Czech Republic
a r t i c l e i n f o
Article history:
Received 4 July 2011
Received in revised form 2 July 2012
Accepted 5 July 2012
Available online 10 July 2012
Keywords:
Asymptotic mean integrated square error
Multivariate kernel density
Unconstrained bandwidth matrix
a b s t r a c t
The most important factor in multivariate kernel density estimation is a choice of a
bandwidth matrix. This choice is particularly important, because of its role in controlling
both the amount and the direction of multivariate smoothing. Considerable attention has
been paid to constrained parameterization of the bandwidth matrix such as a diagonal
matrix or a pre-transformation of the data. A general multivariate kernel density derivative
estimator has been investigated. Data-driven selectors of full bandwidth matrices for a
density and its gradient are considered. The proposed method is based on an optimally
balanced relation between the integrated variance and the integrated squared bias. The
analysis of statistical properties shows the rationale of the proposed method. In order
to compare this method with cross-validation and plug-in methods the relative rate of
convergence is determined. The utility of the method is illustrated through a simulation
study and real data applications.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
Kernel density estimates are one of the most popular nonparametric estimates. In a univariate case, these estimates
depend on a bandwidth, which is a smoothing parameter controlling smoothness of an estimated curve and a kernel which is
considered as a weight function. The choice of the smoothing parameter is a crucial problem in the kernel density estimation.
The literature on bandwidth selection is quite extensive, e.g., monographs Wand and Jones (1995), Silverman (1986) and
Simonoff (1996), papers Marron and Ruppert (1994), Park and Marron (1990), Scott and Terrell (1987), Jones and Kappenman
(1991) and Cao et al. (1994). As far as the kernel estimate of density derivatives is concerned, this problem has received
significantly less attention. In paper Härdle et al. (1990), an adaptation of the least squares cross-validation method is
proposed for the bandwidth choice in the kernel density derivative estimation. In paper Horová et al. (2002), the automatic
procedure of simultaneous choice of the bandwidth, the kernel and its order for kernel density and its derivative estimates
was proposed. But this procedure can be only applied in case that the explicit minimum of the Asymptotic Mean Integrated
Square Error of the estimate is available. It is known that this minimum exists only for d = 2 and the diagonal matrix H. In
paper Horová et al. (2012), the basic formula for the corresponding procedure is given.
The need for nonparametric density estimates for recovering structure in multivariate data is greater since a parametric
modeling is more difficult than in the univariate case. The extension of the univariate kernel methodology is not without its
problems. The most general smoothing parameterization of the kernel estimator in d dimensions requires the specification
entries of d × d positive definite bandwidth matrix. The multivariate kernel density estimator we are going to deal with is
a direct extension of the univariate estimator (see, e.g., Wand and Jones (1995)).
Successful approaches to the univariate bandwidth selection can be transferred to the multivariate settings. The least
squares cross-validation and plug-in methods in the multivariate case have been developed and widely discussed in papers
∗ Correspondence to: Department of Mathematics and Statistics, Kotlářská 2, 61137, Brno, Czech Republic. Tel.: +420 549494429; fax: +420 549491421.
E-mail addresses: horova@math.muni.cz (I. Horová), kolacek@math.muni.cz (J. Koláček), 63985@mail.muni.cz (K. Vopatová).
0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.csda.2012.07.006
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 365
Chacón and Duong (2010), Duong and Hazelton (2005b,a), Sain et al. (1994) and Duong (2004). Some papers (e.g., Horová
et al. (2008, 2012) and Vopatová et al. (2010)) have been focused on constrained parameterization of the bandwidth matrix
such as a diagonal matrix. It is well-known fact that a visualization is an important component of the nonparametric data
analysis. In paper Horová et al. (2012), this effective strategy was used to clarify the process of the bandwidth matrix
choice using bivariate functional surfaces. The paper Horová and Vopatová (2011) brings a short communication on a
kernel gradient estimator. Tarn Duong’s PhD thesis (Duong, 2004) provides a comprehensive survey of bandwidth matrix
selection methods for kernel density estimation. Papers Chacón et al. (2011) and Duong et al. (2008) investigated general
density derivative estimators, i.e., kernel estimators of multivariate density derivatives using general (or unconstrained)
bandwidth matrix selectors. They defined the kernel estimator of the multivariate density derivative and provided results
for the Mean Integrated Square Error convergence asymptotically and for finite samples. Moreover, the relationship between
the convergence rate and the bandwidth matrix has been established here. They also developed estimates for the class of
normal mixture densities.
The paper is organized as follows: In Section 2 we describe kernel estimates of a density and its gradient and give a
form of the Mean Integrated Square Error and the exact MISE calculation for a d-variate normal kernel as well. The next
sections are devoted to a data-driven bandwidth matrix selection method. This method is based on an optimally balanced
relation between the integrated variance and the integrated squared bias, see Horová and Zelinka (2007a). Similar ideas
were applied to kernel estimates of hazard functions (see Horová et al. (2006) or Horová and Zelinka (2007b)). It seems that
the basic idea can be also extended to a kernel regression and we are going to investigate this possibility. We discuss the
statistical properties and relative rates of convergence of the proposed method as well. Section 5 brings a simulation study
and in the last section the developed theory is applied to real data sets.
2. Estimates of a density and its gradient
Let a d-variate random sample X1, . . . , Xn be drawn from a density f . The kernel density estimator ˆf at the point x ∈ Rd
is defined as
ˆf (x, H) =
1
n
n
i=1
KH(x − Xi), (1)
where K is a kernel function, which is often taken to be a d-variate symmetric probability function, H is a d × d symmetric
positive definite matrix and KH is the scaled kernel function
KH(x) = |H|−1/2
K(H−1/2
x)
with |H| the determinant of the matrix H.
The kernel estimator of the gradient Df at the point x ∈ Rd
is
Df (x, H) =
1
n
n
i=1
DKH(x − Xi), (2)
where DKH(x) = |H|−1/2
H−1/2
DK(H−1/2
x) and DK is the column vector of the partial derivatives of K.
Since we aim to investigate both density itself and its gradient in a similar way, we introduce the notation
Dr f (x, H) =
1
n
n
i=1
Dr
KH(x − Xi), r = 0, 1, (3)
where D0
f = f , D1
f = Df .
We make some additional assumptions and notations:
(A1) The kernel function K satisfies the moment conditions

K(x)dx = 1,

xK(x)dx = 0,

xxT
K(x)dx = β2Id, Id is the
d × d identity matrix.
(A2) H = Hn is a sequence of bandwidth matrices such that n−1/2
|H|−1/2
(H−1
)r
, r = 0, 1, and entries of H approach zero
((H−1
)0
is considered as equal to 1).
(A3) Each partial density derivative of order r + 2, r = 0, 1, is continuous and square integrable.
(N1) H is a class of d × d symmetric positive definite matrices.
(N2) V(ρ) =

R
ρ2
(x)dx for any square integrable scalar valued function ρ.
(N3) V(g) =

Rd g(x)gT
(x)dx for any square integrable vector valued function g. In the rest of the text,

stands for

Rd
unless it is stated otherwise.
(N4) DDT
= D2
is a Hessian operator. Expressions like DDT
= D2
involve ‘‘multiplications’’ of differentials in the sense that
∂
∂xi
∂
∂xj
=
∂2
∂xi∂xj
.
This means that (D2
)m
, m ∈ N, is a matrix of the 2m-th order partial differential operators.
366 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
(N5) vecH is a d2
× 1 vector obtained by stacking columns of H.
(N6) Let d∗
= d(d + 1)/2, vechH is d∗
× 1 a vector-half obtained from vecH by eliminating each of the above diagonal
entries.
(N7) The matrix Dd of size d2
× d∗
of ones and zeros such that
DdvechH = vecH
is called the duplication matrix of order d.
(N8) Jd denotes d × d matrix of ones.
The quality of the estimate Dr f can be expressed in terms of the Mean Integrated Square Error
MISEr { Dr f (·, H)} = E

∥ Dr f (x, H) − Dr
f (x)∥2
dx,
with ∥ · ∥ standing for the Euclidean norm, i.e., ∥v∥2
= vT
v = tr(vvT
). For the sake of simplicity we write the argument of
MISEr as H. This error function can be also expressed as the standard decomposition
MISEr (H) = IVr (H) + ISBr (H),
where IVr (H) =

Var{ Dr f (x, H)}dx is the integrated variance and
ISBr (H) =

∥E Dr f (x, H) − Dr
f (x)∥2
dx
=
 




K(z)Dr
f (x − H1/2
z)dz − Dr
f (x)




2
dx
=

∥(KH ∗ Dr
f )(x) − Dr
f (x)∥2
dx
is the integrated square bias (the symbol ∗ denotes convolution).
Since MISEr is not mathematically tractable, we employ the Asymptotic Mean Integrated Square Error. The AMISEr
theorem has been proved (e.g., in Duong et al. (2008)) and reads as follows:
Theorem 1. Let assumptions (A1) –(A3) be satisfied. Then
MISEr (H) ≃ AMISEr (H),
where
AMISEr (H) = n−1
|H|−1/2
tr

(H−1
)r
V(Dr
K)

  
AIVr
+
β2
2
4
vechT
HΨ4+2r vechH
  
AISBr
. (4)
The term Ψ4+2r involves higher order derivatives of f and its subscript 4 + 2r, r = 0, 1, indicates the order of derivatives
used. It is a d∗
× d∗
symmetric matrix.
It can be shown that

∥{tr(HD2
)Dr
}f (x)∥2
dx = vechT
H Ψ4+2r vechH.
Then (4) can be rewritten as
AMISEr (H) = n−1
|H|−1/2
tr(H−1
)r
V(Dr
K) +
β2
2
4

∥{tr(HD2
)Dr
}f (x)∥2
dx, r = 0, 1. (5)
Let K = φI be the d-variate normal kernel and suppose that f is the normal mixture density f (x) =
k
l=1 wlφΣl
(x − µl),
where for each l = 1, . . . , k, φΣl
is the d-variate N(0, Σl) normal density and w = (w1, . . . , wk)T
is a vector of positive
numbers summing to one.
In this case, the exact formula for MISEr was derived in Chacón et al. (2011). For r = 0, 1 it takes the form
MISEr (H) = 2−r
n−1
(4π)−d/2
|H|−1/2
(trH−1
)r
+ wT

(1 − n−1
)Ω2 − 2Ω1 + Ω0

w, (6)
where
(Ωc )ij = (−1)r
φcH+Σij
(µij)

µT
ij (cH + Σij)−2
µij − 2tr(cH + Σij)−1
r
with Σij = Σi + Σj, µij = µi − µj.
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 367
3. Bandwidth matrix selection
The most important factor in multivariate kernel density estimates is the bandwidth matrix H. Because of its role in
controlling both the amount and the direction of smoothing this choice is particularly important.
Let H(A)MISE,r stand for a bandwidth matrix minimizing (A)MISEr , i.e.,
HMISE,r = arg min
H∈H
MISEr (H)
and
HAMISE,r = arg min
H∈H
AMISEr (H).
As it has been mentioned in former works (see, e.g., Duong and Hazelton (2005a,b)), the discrepancy between HMISE,r and
HAMISE,r is asymptotically negligible in comparison with the random variation in the bandwidth matrix selectors that we
consider. The problems of estimating HMISE,r and HAMISE,r are equivalent for most practical purposes.
If we denote DH = ∂
∂vechH
, then using matrix differential calculus yields
DHAMISEr (H) = −(2n)−1
|H|−1/2
tr

(H−1
)r
V(Dr
K)

DT
d vecH−1
+ n−1
|H|−1/2
r

−DT
d vec(H−1
V(Dr
K)H−1
)

+
β2
2
2
Ψ4+2r vechH.
Unfortunately, there is no explicit solution for the equation
DHAMISEr (H) = 0 (7)
(with an exception of d = 2, r = 0 and a diagonal bandwidth matrix H, see, e.g., Wand and Jones (1995)). But nevertheless
the following lemma holds.
Lemma 2.
AIVr (HAMISE,r ) =
4
d + 2r
AISBr (HAMISE,r ). (8)
Proof. See Complements for the proof.
It can be shown (Chacón et al., 2011) that
HAMISE,r = C0,r n−2/(d+2r+4)
= O(n−2/(d+2r+4)
Jd)
and then AMISEr (HAMISE,r ) is of order n−4/(d+2r+4)
.
Since HAMISE,r resp. HMISE,r cannot be found in practice, the data-driven methods for selection of H have been proposed in
papers Chacón and Duong (2010), Duong (2004), Duong and Hazelton (2005b), Sain et al. (1994) and Wand and Jones (1994)
etc.. The performance of bandwidth matrix selectors can be assessed by its relative rate of convergence. We generalize the
definition for the relative rate of convergence for the univariate case to the multivariate one.
Let Hr be a data-driven bandwidth matrix selector. We say that Hr converges to HAMISE,r with relative rate n−α
if
vech(Hr − HAMISE,r ) = Op(Jd∗ n−α
)vechHAMISE,r . (9)
This definition was introduced by Duong (2004).
Now, we remind cross-validation methods CVr (H) (Duong and Hazelton, 2005b; Chacón and Duong, 2012) which aim to
estimate MISEr . CVr (H) is an unbiased estimate of MISEr (H) − trV(Dr
f ) and
CVr (H) = (−1)r
tr



1
n2
n
i,j=1
D2r
(KH ∗ KH)(Xi − Xj) −
2
n(n − 1)
n
i,j=1
i̸=j
D2r
KH(Xi − Xj)



, (10)
HCVr = arg min
H∈H
CVr (H).
It can be shown that the relative rate of convergence to HMISE,r is n−d/(2d+4r+8)
(Chacón and Duong, 2012) and to HAMISE,r is
n− min{d,4}/(2d+4r+8)
(see Duong and Hazelton (2005b) for r = 0).
Plug-in methods for the bandwidth matrix selection were generalized to the multivariate case in Wand and Jones (1994).
The idea consists of estimating the unknown matrix Ψ4+2r . The relative rate of convergence to HMISE,r and HAMISE,r is the same
n−2/(d+2r+6)
when d ≥ 2 (see, e.g., Chacón (2010) and Chacón and Duong (2012)).
In papers Horová et al. (2008, 2012), a special method for bandwidth matrix selection for a bivariate density for the case
of diagonal bandwidth matrix has been developed and the rationale of this method has been explained. This method is based
on formula (8). As concerns the bandwidth matrix selection for the kernel gradient estimator, the aforementioned method
was extended to this case in Vopatová et al. (2010) and Horová and Vopatová (2011). Because the problem of the bandwidth
matrix choice both for density itself and its gradient are closely related one to each other, we address the problem of these
choices together.
368 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
4. Proposed method and its statistical properties
As mentioned, our method is based on Eq. (8) in the sense that a solution of DHAMISEr (H) = 0 is equivalent to solving
Eq. (8). But AISBr (H) depends on the unknown density. Thus we adapt the similar idea as in the univariate case (Horová and
Zelinka (2007a)) and use a suitable estimate of AISBr .
Eq. (8) can be rewritten as
(d + 2r)n−1
|H|−1/2
tr

(H−1
)r
V(Dr
K)

− β2
2

∥{tr(HD2
)Dr
}f (x)∥2
dx = 0. (11)
Let us denote
Λ(z) = (K ∗ K ∗ K ∗ K − 2K ∗ K ∗ K + K ∗ K)(z),
ΛH(z) = |H|−1/2
Λ(H−1/2
z).
Then the estimate of AISBr (H) can be considered as
AISBr (H) =

∥(KH ∗ Dr f )(x, H) − Dr f (x, H)∥2
dx.
This estimate involves non-stochastic terms, therefore, according to Taylor (1989), Jones and Kappenman (1991) and Jones
et al. (1991), we eliminated these terms and propose an (asymptotically unbiased) estimate
AISBr (H) = tr



(−1)r
n2
n
i,j=1
i̸=j
D2r
ΛH(Xi − Xj)



.
Now, instead of Eq. (11) we aim to solve the equation
(d + 2r)n−1
|H|−1/2
tr

(H−1
)r
V(Dr
K)

− 4tr



(−1)r
n2
n
i,j=1
i̸=j
D2r
ΛH(Xi − Xj)



= 0. (12)
Remark 1. The bandwidth matrix selection method based on Eq. (12) is called the Iterative method (IT method) and the
bandwidth estimate is denoted HITr .
Remark 2. In the following we assume that K is the standard normal density φI. Thus Λ(z) = φ4I(z) − 2φ3I(z) + φ2I(z) and
β2 = 1. We are going to discuss statistical properties of the Iterative method which will show its rationality.
Let Γr (H) stand for the left hand side of (11) and Γr (H) for the left hand side of (12).
Theorem 3. Let the assumptions (A1) –(A3) be satisfied and K = φI. Then
E(Γr (H)) = Γr (H) + o(∥vecH∥5/2
),
Var(Γr (H)) = 32n−2
|H|−1/2
∥vecH∥−2r
V(vecD2r
Λ)V(f ) + o(n−2
|H|−1/2
∥vecH∥−2r
).
Proof. For the proof see Complements.
As far as the convergence rate of the IT method is concerned, we are inspired with AMSE lemma (Duong, 2004; Duong
and Hazelton, 2005a). The following theorem takes place.
Theorem 4. Let the assumptions (A1) –(A3) be satisfied and K = φI. Then
MSE{vechHITr } = O

n− min{d,4}/(d+2r+4)
Jd∗

× vechHAMISE,r vechT
HAMISE,r .
Proof. Proof of theorem can be found in Complements.
Corollary 5. The convergence rate to HAMISE,r is n− min{d,4}/(2d+4r+8)
for the IT method.
Remark 3. For the r-th derivative the cross-validation method is of order n− min{d,4}/(2d+4r+8)
and the plug-in method is of
order n−2/(d+2r+6)
(with respect to HAMISE,r ).
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 369
5. Computational aspects and simulations
Eq. (12) can be rewritten as
|HITr |1/2
4 tr



(−1)r
n
n
i,j=1
i̸=j
D2r
ΛHITr
(Xi − Xj)



= (d + 2r)tr

(H−1
ITr
)r
V

Dr
K

.
This equation represents a nonlinear equation for d∗
unknown entries of HITr . In order to find all these entries we need
additional d∗
− 1 equations. Below, we describe a possibility of obtaining these equations.
We adopt a similar idea as in the case of the diagonal matrix (see also Terrell (1990), Scott (1992), Duong et al. (2008) and
Horová and Vopatová (2011)). We explain this approach for the case d = 2 with the matrix
HITr =

ˆh11,r
ˆh12,r
ˆh12,r
ˆh22,r

.
Let Σ be a sample covariance matrix
Σ =

ˆσ2
11 ˆσ12
ˆσ12 ˆσ2
22

.
The initial estimates of entries of HITr can be chosen as
ˆh11,r = ˆh2
1,r = ( ˆσ2
11)(12+r)/12
n(r−4)/12
,
ˆh22,r = ˆh2
2,r = ( ˆσ2
22)(12+r)/12
n(r−4)/12
,
ˆh12,r = sign ˆσ12| ˆσ12|(12+r)/12
n(r−4)/12
.
For details see Horová and Vopatová (2011).
Hence
ˆh22,r =

ˆσ2
22
ˆσ2
11
(12+r)/12
ˆh11,r , (13)
ˆh2
12,r =

ˆσ2
12
ˆσ2
11
(12+r)/12
ˆh11,r (14)
and further
|HITr | = ˆh2
11,r

( ˆσ11 ˆσ22)(12+r)/6
− ˆσ
(12+r)/6
12
 
ˆσ
(12+r)/3
11
= ˆh2
11,r S( ˆσij).
Thus we arrive at the equation for the unknown ˆh11,r
4ˆh11,r

S( ˆσij)tr



(−1)r
n
n
i,j=1
i̸=j
D2r
ΛHITr
(Xi − Xj)



= (d + 2r)tr

(H−1
ITr
)r
V

Dr
K

. (15)
This approach is very important for computational aspects of solving Eq. (12). Putting Eqs. (13)–(15) forms one nonlinear
equation for the unknown ˆh11,r and it can be solved by means of an appropriate iterative numerical method. This procedure
gives the name of the proposed method. Evidently, this approach is computationally much faster than a general minimization
process.
To test the effectiveness of our estimator, we simulated its performance against the least squares cross-validation
method. All simulations and computations were done in MATLAB. The simulation is based on 100 replications of 6 bivariate
normal mixture densities, labeled A–F. Means and covariance matrices of these distributions were generated randomly.
Table 1 brings the list of the normal mixture densities. Densities A and B are unimodal, C and D are bimodal and E and F are
trimodal. Their contour plots are displayed in Fig. 1.
The sample size of n = 100 was used in all replications. We calculated the Integrated Square Error (ISE)
ISEr { Dr f (·, H)} =

∥ Dr f (x, H) − Dr
f (x)∥2
dx
for each estimated density and its derivative over all 100 replications. The logarithm of results is displayed in Tables 2 and
3 and in Fig. 2. Here ‘‘ITER’’ denotes the results for our proposed method, ‘‘LSCV’’ stands for the results of the Least Squares
Cross-validation method (10) and ‘‘MISE’’ is a tag for the results obtained by minimizing (6).
Finally, we compared computational times of all methods. Results are listed in Table 4.
370 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
Table 1
Normal mixture densities.
Density Formula N(vecT
µ, vecT
Σ)
A N

(−0.2686, −1.7905), (7.9294, −10.0673; −10.0673, 22.1150)

B N

(−0.6847, 2.6963), (16.9022, 9.8173; 9.8173, 6.0090)

C 1
2
N

(0.3151, −1.6877), (0.1783, −0.1821; −0.1821, 1.0116)

+ 1
2
N

(1.1768, 0.3731), (0.2414, −0.8834; −0.8834, 4.2934)

D 1
2
N

(1.8569, 0.1897), (1.5023, −0.9259; −0.9259, 0.8553)

+ 1
2
N

(0.3349, −0.2397), (2.3050, 0.8895; 0.8895, 1.2977)

E 1
3
N

(0.0564, −0.9041), (0.9648, −0.8582; −0.8582, 0.9332)

+ 1
3
N

(−0.7769, 1.6001), (2.8197, −1.4269; −1.4269, 0.9398)

+ 1
3
N

(1.0132, 0.4508), (3.9982, −3.7291; −3.7291, 5.5409)

F 1
3
N

(2.2337, −2.9718), (0.6336, −0.9279; −0.9279, 3.1289)

+ 1
3
N

(−4.3854, 0.5678), (2.1399, −0.6208; −0.6208, 0.7967)

+ 1
3
N

(1.5513, 2.2186), (1.1207, 0.8044; 0.8044, 1.0428)

Fig. 1. Contour plots for target densities.
Table 2
Logarithm of ISE0 for bandwidth matrices.
Target density A B C D E F
ITER Mean −7.562 −6.345 −4.319 −4.918 −4.779 −5.103
Std 0.459 0.448 0.264 0.274 0.203 0.180
LSCV Mean −7.110 −5.781 −4.332 −4.957 −4.917 −5.138
Std 0.531 0.610 0.407 0.518 0.385 0.325
MISE Mean −7.865 −4.256 −4.168 −3.521 −2.763 −3.903
Std 0.397 0.418 0.188 0.340 0.237 0.164
6. Application to real data
An important question arising in application to real data is which observed features – such as a local extremes – are
really there. Chaudhuri and Marron (1999) introduced the SiZer (Significant Zero) method for finding structure in smooth
data. Duong et al. (2008) proposed a framework for feature significance in d-dimensional data which combines kernel
density derivative estimators and hypothesis tests for modal regions. Distributional properties are given for the gradient
and curvature estimators, and pointwise tests extend the two-dimensional feature significance ideas of Godtliebsen et al.
(2002).
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 371
Table 3
Logarithm of ISE1 for bandwidth matrices.
Target density A B C D E F
ITER Mean −7.618 −4.005 −0.888 −2.698 −1.991 −3.203
Std 0.289 0.405 0.055 0.099 0.030 0.032
LSCV Mean −5.364 −0.210 0.503 −1.214 −0.501 −1.544
Std 2.638 2.960 2.364 2.437 2.373 1.914
MISE Mean −7.939 −4.314 −1.813 −3.544 −2.732 −3.864
Std 0.391 0.443 0.311 0.359 0.241 0.172
Fig. 2. Box plots for log(ISE).
Table 4
Average computational times (in seconds).
Target density r A B C D E F
ITER 0 0.0826 0.0685 0.0596 0.0801 0.0754 0.0591
1 0.8295 0.8201 0.8542 0.8605 0.8538 0.8786
LSCV 0 0.5486 0.5732 0.5182 0.4844 0.5004 0.5004
1 1.7936 1.6483 1.3113 1.3128 1.6495 1.5581
MISE 0 0.1927 0.1982 0.7126 0.5540 1.8881 2.4000
1 0.5236 0.3112 1.2653 1.3452 2.3089 4.1172
We started with the well-known ‘Old Faithful’ data set (Simonoff, 1996), which contains characteristics of 222 eruptions
of the ‘Old Faithful Geyser’ in Yellowstone National Park, USA, during August 1978 and August 1979. Kernel density and
first derivative estimates using the standard normal kernel based on the following bandwidth matrices obtained by the IT
method
HIT0
=

0.0703 0.7281
0.7281 9.801

, HIT1
=

0.2388 3.006
3.006 50.24

are displayed in Fig. 3. The intersections of ∂f /∂x1 = 0 and ∂f /∂x2 = 0 show the existence of extremes.
The second data set is taken from UNICEF—‘‘The State of the World’s Children 2003’’. It contains 72 pairs of observations
for countries with a GNI less than 1000 US dollars per capita in 2001. X1 variable describes the under-five mortality rate,
i.e., the probability of dying between birth and exactly five years of age expressed per 1000 live births, and X2 is a life
expectancy at birth, i.e., the number of years newborn children would live if subject to the mortality risks prevailing for
372 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
Fig. 3. ‘Old Faithful’ data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right).
Fig. 4. ‘UNICEF Children’ data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right).
Fig. 5. Swiss bank notes data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right).
the cross-section of population at the time of their birth (UNICEF, 2003). These data have also been analyzed in Duong and
Hazelton (2005b).
Bandwidth matrices for the estimated density ˆf and its gradient Df are
HIT0
=

1112.0 −138.3
−138.3 24.20

and HIT1
=

2426 −253.7
−253.7 38.38

,
respectively. Fig. 4 illustrates the use of the iterative bandwidth matrices for the ‘UNICEF Children’ data set.
We also analyzed a Swiss bank notes data set from Simonoff (1996). It contains measurements of the bottom margin
and diagonal length of 100 real Swiss bank notes and 100 forged Swiss bank notes. Contour plots in Fig. 5 represent kernel
estimates of the joint distribution of the bottom margin and diagonal length of the bills using bandwidth matrices
HIT0
=

0.1227 −0.0610
−0.0610 0.0781

, HIT1
=

0.6740 −0.3159
−0.3159 0.4129

.
The bills with longer diagonal and shorter bottom margin correspond to real bills.
The density estimate shows a bimodal structure for the forged bills (bottom right part of the plot) and it seems that the
gradient estimate does not match this structure. The elements of the bandwidth matrix for the gradient estimate are bigger
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 373
in magnitude than the ones of the bandwidth matrix for density estimate, as expected from the theory. Three bumps in the
tails are too small and the gradient estimator is not able to distinguish them.
7. Conclusion
We restricted ourselves on the use of the standard normal kernel. This kernel satisfies smoothness conditions and
provides easy computations of convolutions. Due to these facts it was possible to compare the IT method with the LSCV
method.
The simulation study and application to real data show that the IT method provides a sufficiently reliable way of
estimating arbitrary density and its gradient. The IT method is also easy implementable and seems to be less time consuming
(see Horová and Zelinka (2007a) for d = 1, see also Table 4 for d = 2).
Further assessment of the practical performance and an extension to a curvature density estimate would be very
important further research. Although the theoretical comparison also involves PI methods, they are not included in the
simulation study. This would be an interesting task for further research.
8. Complements
We start with introducing some facts on matrix differential calculus and on the Gaussian density (see Magnus and
Neudecker (1979, 1999) and Aldershof et al. (1995)).
Let A, B be d × d matrices and r = 0, 1:
1◦
. tr(AT
B) = vecT
AvecB
2◦
. DH|H|−1/2
= −1
2
|H|−1/2
DT
d vecH−1
3◦
. DHtr(H−1
A) = −DT
d vec(H−1
AH−1
)
4◦
.

φcI(z){tr(H1/2
D2
H1/2
zzT
)D2r
}f (x)dz = c{tr(HD2
)D2r
}f (x)
φcI(z){tr2
(H1/2
D2
H1/2
zzT
)D2r
}f (x)dz = 3c2
{tr2
(HD2
)D2r
}f (x)
φcI(z){trk
(H1/2
D2
H1/2
zzT
)tr(H1/2
DzT
)D2r
}f (x)dz = 0, k ∈ N0
5◦
. Λ(z) = φ4I(z) − 2φ3I(z) + φ2I(z),
then using 4◦
yields
Λ(z)dz = 0
Λ(z){tr(H1/2
D2
H1/2
zzT
)D2r
}f (x)dz = 0
Λ(z){tr2
(H1/2
D2
H1/2
zzT
)D2r
}f (x)dz = 6{tr2
(HD2
)D2r
}f (x)
Λ(z){trk
(H1/2
D2
H1/2
zzT
)tr(H1/2
DzT
)D2r
}f (x)dz = 0, k ∈ N0
6◦
.

Dk
f (x)[Dk
f (x)]T
dx = (−1)k

D2k
f (x)f (x)dx, k ∈ N
7◦
. Taylor expansion in the form (for r = 0, 1)
D2r
f (x − H1/2
z) = D2r
f (x) − {zT
H1/2
DD2r
}f (x)
+
1
2!
{(zT
H1/2
D)2
D2r
}f (x) + · · · +
(−1)k
k!
{(zT
H1/2
D)k
D2r
}f (x) + o(∥H1/2
z∥k
Jdr ).
Sketch of the proof of Lemma 2:
Proof. Consider Eq. (7) and multiply it from the left by 1
2
vechT
H.
Then
(4n)−1
|H|−1/2
vechT
Htr

(H−1
)r
V(Dr
K)

DT
d vecH−1
+ (2n)−1
|H|−1/2
rvechT
H

DT
d vec(H−1
V(Dr
K))H−1

=
β2
2
4
vechT
HΨ4+2r vechH.
The right hand side of this equation is AISBr . Further, if we use the facts on matrix calculus, we arrive at formula (8).
We only present a sketch of proofs of theorems. Detailed proofs are available on request from the first author.
Sketch of the proof of Theorem 3:
Proof. In order to show the validity of the relation for the expected value of Γr (H), we evaluate E(AISBr (H)) and start with
E tr

D2r
ΛH(X1 − X2)

= tr

D2r
ΛH(x − y)f (x)f (y)dxdy
= tr

ΛH(x − y)f (x)D2r
f (y)dxdy
= tr

Λ(z)D2r
f (x − H1/2
z)f (x)dzdx.
374 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
Taylor expansion, defined in 7◦
, and using 5◦
yields
= tr

Λ(z)

5
i=0
(−1)i
i!
{(zT
H1/2
D)i
D2r
}f (x) + o(∥H1/2
z∥5
)Jdr

f (x)dzdx
= tr

Λ(z)

1
4!
{(zT
H1/2
D)4
D2r
}f (x) + o(∥H1/2
z∥5
)Jdr

f (x)dzdx
=
1
4!
tr

Λ(z){tr2
(H1/2
D2
H1/2
zzT
)D2r
}f (x)f (x)dzdx + o(∥vecH∥5/2
),
using properties 5◦
and 6◦
we arrive at
=
1
4
tr

{tr2
(HD2
)D2r
}f (x)f (x)dx + o(∥vecH∥5/2
)
=
(−1)r
4

∥{tr(HD2
)Dr
}f (x)∥2
dx + o(∥vecH∥5/2
).
To prove the second part of the Theorem it is sufficient to derive Var(AISBr (H))
Var(AISBr (H)) = Var



4
n2
n
i,j=1
i̸=j
trD2r
ΛH(Xi − Xj)



.
Since trD2r
ΛH is symmetric about zero, we can use U-statistics, e.g., Wand and Jones (1995). In our case
Var
4
n2
n
i,j=1
i̸=j
trD2r
ΛH(Xi − Xj) = 32n−3
(n − 1)Var trD2r
ΛH(X1 − X2) + 64n−3
(n − 1)(n − 2)
× Cov{trD2r
ΛH(X1 − X2), trD2r
ΛH(X1 − X3)}.
Most of terms are asymptotically negligible, therefore the formula written above reduces to
32n−2
E(trD2r
ΛH(X1 − X2))2
  
ξ2
−64n−1
E2
trD2r
ΛH(X1 − X2)
  
ξ0
+ 64n−1
E(trD2r
ΛH(X1 − X2)trD2r
ΛH(X1 − X3))
  
ξ1
. (16)
Let us express ξ0, ξ1 and ξ2. From previous computations of the expected value one can see that ξ0 is of order o(∥vecH∥3
).
ξ1 =

trD2r
ΛH(x − y)trD2r
ΛH(x − z)f (x)f (y)f (z)dxdydz
=

Λ(u)Λ(v)f (x)trD2r
f (x − H1/2
u)trD2r
f (x − H1/2
v)dxdudv
=

Λ(u)Λ(v)f (x)tr

5
i=0
(−1)i
i!
{D2r
ai
}f (x) + o(∥H1/2
u∥5
)Jdr

× tr

5
i=0
(−1)i
i!
{D2r
bi
}f (x) + o(∥H1/2
v∥5
)Jdr

dxdudv, where a = uT
H1/2
D, b = vT
H1/2
D
=

Λ(u)Λ(v)f (x)
1
4!4!
tr{D2r
a4
}f (x)tr{D2r
b4
}f (x)dxdudv + o(∥vecH∥4
)
=
1
4!4!

f (x)

Λ(z){tr2
(H1/2
D2
H1/2
zzT
)D2r
}f (x)dz
2
dx + o(∥vecH∥4
)
=
1
16

{tr2
(HD2
)D2r
}f (x){tr2
(HD2
)D2r
}f (x)f (x)dx + o(∥vecH∥4
).
Thus ξ1 is of order o(∥vecH∥3
) and is negligible.
I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 375
Finally
ξ2 =

trD2r
ΛH(x − y)trD2r
ΛH(x − y)f (x)f (y)dxdy
= |H|−1/2

tr2
(H−r
D2r
Λ(z))f (x)f (x − H1/2
z)dxdz
= |H|−1/2
∥vecH∥−2r
V(vecD2r
Λ)V(f ) + o(|H|−1/2
∥vecH∥−2r
),
which completes the proof of Theorem 3.
Sketch of the proof of Theorem 4:
Proof. Since Γr (H)
P
→ Γr (H) then HITr
P
→ HAMISE,r as n → ∞ and we can adopt ideas of AMSE lemma (Duong, 2004). We
expand
Γr (HITr ) = (Γr − Γr )(HITr ) + Γr (HITr )
= (1 + o(1))(Γr − Γr )(HAMISE,r ) + Γr (HAMISE,r )
+ (1 + o(1))DT
HΓr (HAMISE,r )vech(HITr − HAMISE,r ).
We multiply the equation by vechHAMISE,r from the left side and remove all negligible terms. Then we obtain
0 = vechHAMISE,r (Γr − Γr )(HAMISE,r ) + vechHAMISE,r DT
HΓr (HAMISE,r )vech(HITr − HAMISE,r ).
It is easy to see that DT
HΓr (HAMISE,r ) = aT
n−2/(d+2r+4)
and vechHAMISE,r = bn−2/(d+2r+4)
for constant vectors a and b, which
implies
vech(HITr − HAMISE,r ) = −(baT
)−1
  
C
n4/(d+2r+4)
vechHAMISE,r (Γr − Γr )(HAMISE,r ).
Let us note that the matrix baT
can be singular in some cases (e.g., for a diagonal bandwidth matrix) and thus the matrix
C = −(baT
)−1
does not exist. But this fact does not take any effect for the rate of convergence.
Using results of Theorem 3 we express the convergence rate of MSE

(Γr − Γr )(HAMISE,r )

= Bias2
(Γr (HAMISE,r )) + Var(Γr (HAMISE,r ))
= (o(∥vecHAMISE,r ∥5/2
))2
+ O(n−2
|HAMISE,r |−1/2
∥vecHAMISE,r ∥−2r
)
= (O(∥vecHAMISE,r ∥3
))2
+ O(n−2
|HAMISE,r |−1/2
∥vecHAMISE,r ∥−2r
)
= O(n−12/(d+2r+4)
) + O(n−(d+8)/(d+2r+4)
)
= O(n− min{d+8,12}/(d+2r+4)
).
Then
MSE{vechHITr } = MSE

(Γr − Γr )(HAMISE,r )

C vechHAMISE,r vechT
HAMISE,r CT
n8/(d+2r+4)
= O

n− min{d+8,12}/(d+2r+4)

O

n8/(d+2r+4)
Jd∗

vechHAMISE,r vechT
HAMISE,r
= O

n− min{d,4}/(d+2r+4)
Jd∗

vechHAMISE,r vechT
HAMISE,r .
Acknowledgments
The research was supported by The Jaroslav Hájek Center for Theoretical and Applied Statistics (MŠMT LC 06024).
K. Vopatová has been supported by the University of Defence through the Institutional development project UO FEM
‘‘Economics Laboratory’’. The authors thank the anonymous referees for their helpful comments and are also grateful to
J.E. Chacón for a valuable discussion which contributed to improvement of this paper.
References
Aldershof, B., Marron, J., Park, B., Wand, M., 1995. Facts about the Gaussian probability density function. Applicable Analysis 59, 289–306.
Cao, R., Cuevas, A., González Manteiga, W., 1994. A comparative study of several smoothing methods in density estimation. Computational Statistics and
Data Analysis 17, 153–176.
Chacón, J.E., Duong, T., 2012. Bandwidth selection for multivariate density derivative estimation, with applications to clustering and bump hunting. e-prints.
http://arxiv.org/abs/1204.6160.
Chacón, J.E., 2010. Multivariate kernel estimation, lecture. Masaryk University, Brno.
Chacón, J.E., Duong, T., 2010. Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19, 375–398.
Chacón, J.E., Duong, T., Wand, M.P., 2011. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica 21, 807–840.
376 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376
Chaudhuri, P., Marron, J.S., 1999. SiZer for exploration of structures in curves. Journal of the American Statistical Association 94, 807–823.
Duong, T., 2004. Bandwidth selectors for multivariate kernel density estimation. Ph.D. Thesis. School of Mathematics and Statistics. University of Western
Australia.
Duong, T., Cowling, A., Koch, I., Wand, M.P., 2008. Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis
52, 4225–4242.
Duong, T., Hazelton, M., 2005b. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics 32,
485–506.
Duong, T., Hazelton, M., 2005a. Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. Journal of
Multivariate Analysis 93, 417–433.
Godtliebsen, F., Marron, J.S., Chaudhuri, P., 2002. Significance in scale space for bivariate density estimation. Journal of Computational and Graphical
Statistics 11, 1–21.
Horová, I., Koláček, J., Vopatová, K., 2012. Visualization and bandwidth matrix choice. Communications in Statistics—Theory and Methods 759–777.
Horová, I., Koláček, J., Zelinka, J., Vopatová, K., 2008. Bandwidth choice for kernel density estimates. In: Proceedings IASC. IASC, Yokohama, pp. 542–551.
Horová, I., Vieu, P., Zelinka, J., 2002. Optimal choice of nonparametric estimates of a density and of its derivatives. Statistics and Decisions 20, 355–378.
Horová, I., Vopatová, K., 2011. Kernel gradient estimate. In: Ferraty, F. (Ed.), Recent Advances in Functional Data Analysis and Related Topics. SpringerVerlag,
Berlin, Heidelberg, pp. 177–182.
Horová, I., Zelinka, J., 2007a. Contribution to the bandwidth choice for kernel density estimates. Computational Statistics 22, 31–47.
Horová, I., Zelinka, J., 2007b. Kernel estimation of hazard functions for biomedical data sets. In: Härdle, W., Mori, Y., Vieu, P. (Eds.), Statistical Methods for
Biostatistics and Related Fields. In: Mathematics and Statistics, Springer-Verlag, Berlin, Heidelberg, pp. 64–86.
Horová, I., Zelinka, J., Budíková, M., 2006. Kernel estimates of hazard functions for carcinoma data sets. Environmetrics 17, 239–255.
Härdle, W., Marron, J.S., Wand, M.P., 1990. Bandwidth choice for density derivatives. Journal of the Royal Statistical Society, Series B (Methodological) 52,
223–232.
Jones, M.C., Kappenman, R.F., 1991. On a class of kernel density estimate bandwidth selectors. Scandinavian Journal of Statistics 19, 337–349.
Jones, M.C., Marron, J.S., Park, B.U., 1991. A simple root n bandwidth selector. Annals of Statistics 19, 1919–1932.
Magnus, J.R., Neudecker, H., 1979. Commutation matrix—some properties and application. Annals of Statistics 7, 381–394.
Magnus, J.R., Neudecker, H., 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics, second ed.. Wiley.
Marron, J.S., Ruppert, D., 1994. Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society, Series B
(Methodological) 56, 653–671.
Park, B., Marron, J., 1990. Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association 85, 66–72.
Sain, S., Baggerly, K., Scott, D., 1994. Cross-validation of multivariate densities. Journal of the American Statistical Association 89, 807–817.
Scott, D.W., 1992. Multivariate density estimation: theory, practice, and visualization. In: Wiley Series in Probability and Mathematical Statistics: Applied
Probability and Statistics. Wiley.
Scott, D.W., Terrell, G.R., 1987. Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association 82, 1131–1146.
Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Simonoff, J.S., 1996. Smoothing Methods in Statistics. Springer-Verlag, New York.
Taylor, C.C., 1989. Bootstrap choice of the smoothing parameter in kernel density estimation. Biometrika 76, 705–712.
Terrell, G.R., 1990. The maximal smoothing principle in density estimation. Journal of the American Statistical Association 85, 470–477.
UNICEF, 2003. The state of the world’s children 2003. http://www.unicef.org/sowc03/index.html.
Vopatová, K., Horová, I., Koláček, J., 2010. Bandwidth choice for kernel density derivative. In: Proceedings of the 25th International Workshop on Statistical
Modelling. Glasgow, Scotland, pp. 561–564.
Wand, M., Jones, M., 1995. Kernel Smoothing. Chapman and Hall, London.
Wand, M.P., Jones, M.C., 1994. Multivariate plug-in bandwidth selection. Computational Statistics 9, 97–116.
ForPeerReview
Only! " #$ % # & ' $ (
) ' $ ' % # & ' $ (
" * # $ * $ '
#
* ( +
' # * # (
* '
, ( '
, ' (
* (
(
(
- * * ' *$ '
.( / ' * 0 ( ( ' 1 (
( +
(
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION
JAN KOL´AˇCEK, IVANA HOROV´A
Abstract. The most important factor in kernel regression is a choice of
a bandwidth. Considerable attention has been paid to extension the idea of
an iterative method known for a kernel density estimate to kernel regression.
Data-driven selectors of the bandwidth for kernel regression are considered.
The proposed method is based on an optimally balanced relation between the
integrated variance and the integrated square bias. This approach leads to an
iterative quadratically convergent process. The analysis of statistical properties
shows the rationale of the proposed method. In order to see statistical
properties of this method the consistency is determined. The utility of the
method is illustrated through a simulation study and real data applications.
Keywords and Phrases: kernel regression, bandwidth selection, iterative method.
Mathematics Subject Classiﬁcation: 62G08
1. Introduction
Kernel regression estimates are one of the most popular nonparametric estimates.
In a univariate case, these estimates depend on a bandwidth, which is a smoothing
parameter controlling smoothness of an estimated curve and a kernel which is considered
as a weight function. The choice of the smoothing parameter is a crucial
problem in the kernel regression. The literature on bandwidth selection is quite
extensive, e.g., monographs [20, 17, 18], papers [7, 2, 3, 15, 19, 4, 5, 12, 13].
Although in practice one can try several bandwidths and choose a bandwidth
subjectively, automatic (data-driven) selection procedures could be useful for many
situations; see [16] for more examples. Most of these procedures are based on estimating
of Average Mean Square Error. They are asymptotically equivalent and
asymptotically unbiased (see [7, 2, 3]). However, in simulation studies ([12]), it
is often observed that most selectors are biased toward undersmoothing and yield
smaller bandwidths more frequently than predicted by asymptotic results.
Successful approaches to the bandwidth selection in kernel density estimation can
be transferred to the case of kernel regression. The iterative method for the kernel
density has been developed and widely discussed in [9]. The proposed method is
based on an optimally balanced relation between the integrated variance and the
integrated square bias.
The paper is organized as follows: In Section 2 we describe kernel estimates of
a regression function and give a form of the Mean Integrated Square Error and its
asymptotic alternative. The next section is devoted to a data-driven bandwidth
selection method. This method is based on an optimally balanced relation between
the integrated variance and the integrated squared bias, see [9]. Similar ideas were
Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic.
1
Page 1 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
2 JAN KOL´AˇCEK, IVANA HOROV´A
applied to kernel estimates of hazard functions (see [11] or [10]). It seems that
the basic idea can be also extended to a kernel regression and we are going to
investigate this possibility. We discuss the statistical properties of the proposed
method as well. Section 4 brings a simulation study and in the last section the
developed theory is applied to real data sets.
2. Univariate kernel regression
Consider a standard regression model of the form
(2.1) Yi = m(xi) + εi, i = 1, . . . , n,
where m is an unknown regression function, Y1, . . . , Yn are observable data variables
with respect to the design points x1, . . . , xn. The residuals ε1, . . . , εn are
independent identically distributed random variables for which
E(εi) = 0, var(εi) = σ2
> 0, i = 1, . . . , n.
We suppose the ﬁxed equally spaced design, i.e., design variables are not random
and xi = i/n, i = 1, . . . , n. In the case of random design, where the design points
X1, . . . , Xn are random variables with the same density f, all considerations are
similar as for the ﬁxed design. More detailed description of the random design can
be found, e.g., in [20].
The aim of kernel smoothing is to ﬁnd a suitable approximation m of the unknown
function m.
We consider the estimator proposed by Pristley and Chao [14] which is deﬁned
as
(2.2) m(x, h) =
1
n
n
i=1
Kh(x − xi)Yi, for x ∈ (0, 1).
The function K is called the kernel which is assumed to be symmetric about zero
and be supported on [−1, 1], be such that V (K) = K(u)2
du < ∞ and have a
ﬁnite second moment (i.e., u2
K(u)du = β2 < ∞). Set Kh(.) = 1
h K( .
h ), h > 0.
A parameter h is called a bandwidth.
The quality of a kernel regression estimator can be locally described by the Mean
Square Error (MSE) or by a global criterion the Mean Integrated Square Error
(MISE), which can be written as a sum of the Integrated Variance (IV) and the
Integrated Square Bias (ISB)
MISE m(·, h) = E
1
0
[m(x, h) − m(x)]2
dx
=
1
0
Var m(x, h)dx
IV
+
1
0
[(Kh ∗ m)(x) − m(x)]2
dx
ISB
+O n−1
,
(2.3)
where ∗ denotes a convolution.
Page 2 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 3
Since the MISE is not mathematically tractable we employ the Asymptotic Mean
Integrated Square Error (AMISE)
(2.4) AMISE{m(·, h)} =
V (K)σ2
nh
AIV
+
β2
2
2
V (m′′
)h4
AISB
,
where V (m′′
) =
1
0
(m′′
(x))
2
dx. The optimal bandwidth considered here is hopt,
the minimizer of (2.4), i.e.,
hopt = arg min
h∈Hn
AMISE{m(·, h)},
where Hn = [an−1/5
, bn−1/5
] for some 0 < a < b < ∞.
The calculation gives
(2.5) hopt =
σ2
V (K)
nβ2
2V (m′′)
1
5
.
In nonparametric regression estimation a critical and inevitable step is to choose
the smoothing parameter (bandwidth) to control the smoothness of the curve estimate.
The smoothing parameter considerably aﬀects the features of the estimated
curve.
One of the most widespread procedures for bandwidth selection is the crossvalidation
method, also known as “leave-one-out” method.
The method is based on modiﬁed regression smoother (2.2) in which one, say
the j-th, observation is left out:
m−j(xj, h) =
1
n
n
i=1
i=j
Kh(xi − xj)Yi, j = 1, . . . , n.
With using these modiﬁed smoothers, the error function which should be minimized
takes the form
(2.6) CV(h) =
1
n
n
i=1
{m−i(xi) − Yi}2
.
The function CV(h) is commonly called a “cross-validation” function. Let ˆhCV
stand for minimization of CV(h), i.e.,
ˆhCV = arg min
h∈Hn
CV(h).
The literature on this criterion is quite extensive, e.g., [19, 4, 7, 5].
3. Iterative method for kernel regression
The proposed method is based on the following relation. It is easy to show that
the equation holds
(3.1) AIV {m(·, hopt)} − 4 AISB{ m(·, hopt)} = 0,
Page 3 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
4 JAN KOL´AˇCEK, IVANA HOROV´A
where AIV and AISB are terms used in (2.4). For estimating of AIV and AISB in
(3.1) we use
AIV {m(·, h)} =
ˆσ2
V (K)
nh
, with ˆσ2
=
1
2n − 2
n
i=2
(Yi − Yi−1)2
and
AISB {m(·, h)} =
1
0
[(Kh ∗ m)(x, h) − m(x, h)]2
dx
=
1
4n2h
n
i,j=1
i=j
Λ
xi − xj
h
YiYj,
where Λ(z) = (K ∗K ∗K ∗K −2K ∗K ∗K +K ∗K)(z) (see Complements for more
details, for properties of Λ(z) see [8]).
To ﬁnd the bandwidth estimate ˆhIT we solve the equation
(3.2) AIV {m(·, h)} − 4AISB {m(·, h)} = 0,
which leads to ﬁnding a ﬁxed point of the equation
(3.3) h =
ˆσ2
V (K)
4nhAISB {m(·, h)}
.
We use Steﬀensen’s iterative method with the starting approximation ˆh0 = 2/n.
This approach leads to an iterative quadratically convergent process (see [9]).
The solution ˆhIT of the equation (3.2) can be considered as a suitable approximation
of hopt as it is conﬁrmed by the following theorem.
Theorem 3.1. Let m ∈ C2
[0, 1], m′′
be square integrable, lim h
n→∞
= 0, lim nh
n→∞
= ∞.
Let P(h) stand for the left side of (3.1) and P(h) for the left side of (3.2). Then
(3.4)
E(P(h)) = P(h) + O n−1
,
var(P(h)) = O n−1
.
Theorem 3.1 states that P(h) is a consistent estimate of P(h). This result
conﬁrms that the solution of (3.3) may be expected to be reasonably close to hopt.
Proof of Theorem 3.1 can be found in Complements.
4. Simulation study
We carry out two simulation studies to compare the performance of the bandwidth
estimates. The comparison is done in the following way. The observations,
Yi, for i = 1, . . . , n = 100, are obtained by adding independent Gaussian random
variables with mean zero and variance σ2
to some known regression function. Both
regression functions used in our simulations are illustrated in Fig. 1.
One hundred series are generated. For each data set, we estimate the optimal
bandwidth by both mentioned methods, i.e., for each method we obtain 100 estimates.
Since we know the optimal bandwidth, we compare it with the mean
of estimates and look at their standard deviation, which describes the variability
of methods. The Epanechnikov kernel K(x) = 3
4 (1 − x2
)I[−1,1] is used in all cases.
Page 4 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 5
0 0.2 0.4 0.6 0.8 1
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0 0.2 0.4 0.6 0.8 1
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1. Regression functions.
Finally, we calculate the Integrated Square Error (ISE)
ISE{m(·, h)} =
1
0
(m(x, h) − m(x))
2
dx
for each estimated regression function over all 100 replications. The logarithm of
results are displayed in Tables 2, 4 and in Figures 3, 5. Here “IT” denotes the
results for our proposed method, “CV” stands for the results of the cross-validation
method.
4.1. Simulation 1. In this case, we use the regression function
m(x) = x3
(1 − x)3
with σ2
= 0.0032
. Table 1 summarizes the sample means and the sample standard
deviations of bandwidth estimates, E(ˆh) is the average of all 100 values and std(ˆh)
is their standard deviation. Figure 2 illustrates the histogram of results of all 100
experiments.
hopt = 0.1188
E(ˆh) std(ˆh)
CV 0.1057 0.0297
IT 0.1184 0.0200
Table 1. Means and standard deviations
Table 2 gives the mean and the standard deviations of log(ISE) for each method
compared with log(ISE) for the regression estimate obtained with hopt. Figure 3
illustrates the histogram of log(ISE) of all 100 experiments.
As we see, the standard deviation of all results obtained by the proposed method
is less than the value for the case of cross-validation method and also the mean
of these results is slightly closer to the theoretical optimal bandwidth. The comparison
of results with respect to log(ISE) leads to the similar result. The reason
is that the regression function is smooth and satisﬁes all the conditions supposed
in the previous section. Thus the proposed method works very well in this case.
Page 5 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
6 JAN KOL´AˇCEK, IVANA HOROV´A
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22
0
20
40
60
80
100
hopt
CV method
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22
0
20
40
60
80
100
hopt
IT method
Figure 2. Distribution of ˆh for both methods.
E(log(ISE)) std(log(ISE))
hopt −14.4452 0.5421
IT −14.3481 0.5193
CV −14.2160 0.6276
Table 2. Means and standard deviations of log(ISE)
4.2. Simulation 2. In the second example, we use the regression function
m(x) = sin(π x) cos 3 π x5
with σ2
= 0.05. Table 3 summarizes the sample means and the sample standard
deviations of bandwidth estimates, E(ˆh) is the average of all 100 values and std(ˆh)
is their standard deviation. Figure 4 illustrates the histogram of results of all 100
experiments.
Table 4 brings the mean and the standard deviations of log(ISE) for each method
compared with log(ISE) for the regression estimate obtained with hopt. Figure 5
illustrates the histogram of log(ISE) of all 100 experiments.
Although the mean of ˆhIT is not so close to hopt as the mean of ˆhCV , the values
of ISE are better. Also the variability of the proposed method seems to be smaller
in this case. Thus we make a conclusion that the proposed method can provide
better results for this regression model.
Page 6 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 7
−16.5
−16
−15.5
−15
−14.5
−14
−13.5
−13
−12.5
h_opt IT CV
Figure 3. Logarithm of ISE.
hopt = 0.0585
E(ˆh) std(ˆh)
CV 0.0633 0.0168
IT 0.0708 0.0072
Table 3. Means and standard deviations
E(log(ISE)) std(log(ISE))
hopt −5.0932 0.3908
IT −5.0560 0.3741
CV −4.9525 0.3966
Table 4. Means and standard deviations of log(ISE)
5. Application to real data
The main goal of this section is to make a comparison of mentioned bandwidth
estimators on a real data set. We use data from [1] and follow annual measurements
of the level, in feet, of Lake Huron 1875 – 1972, i.e., the sample size is n = 98. We
transform data to the interval [0, 1] and use both selectors considered in the previous
section to get the optimal bandwidth. We use the Epanechnikov kernel K(x) =
3
4 (1 − x2
)I[−1,1]. All estimates of optimal bandwidth are listed in Table 5.
Page 7 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
8 JAN KOL´AˇCEK, IVANA HOROV´A
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
0
20
40
60
80
100
hopt
CV method
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
0
20
40
60
80
100
hopt
IT method
Figure 4. Distribution of ˆh for both methods.
−6
−5.5
−5
−4.5
−4
h_opt IT CV
Figure 5. Logarithm of ISE.
Figure 6 illustrates the kernel regression estimate with the smoothing parameter
ˆhCV = 0.0204 which was obtained by cross-validation method.
Figure 7 shows the kernel regression estimate with the smoothing parameter
ˆhIT = 0.0501. This value was found by our proposed method
Page 8 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 9
Table 5. Optimal bandwidth estimates for Lake Huron data.
iterative method ˆhIT = 0.0501
cross-validation ˆhCV = 0.0204
1860 1880 1900 1920 1940 1960 1980
5
6
7
8
9
10
11
12
Figure 6. Kernel regression estimate with ˆhCV = 0.0204.
1860 1880 1900 1920 1940 1960 1980
5
6
7
8
9
10
11
12
Figure 7. Kernel regression estimate with ˆhIT = 0.0501.
Since we do not know the true regression function m(x) it is hard to assess objectively
which one of kernel estimates is better. It is very important to realize the fact
that the ﬁnal decision about the estimate is partially subjective because the estimates
of the bandwidth are only asymptotically optimal. The values summarized
in the table and ﬁgures show that the estimate with the smoothing parameter
Page 9 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
10 JAN KOL´AˇCEK, IVANA HOROV´A
obtained by cross-validation criterion is undersmoothed. In the context of these
considerations, the estimate with parameter obtained by the iterative method appears
to be suﬃcient.
6. Conclusion
A new bandwidth selector for kernel regression was proposed. The analysis of
statistical properties shows the rationale of the proposed method. The advantage
of the method is in computational aspects, since it makes possible to avoid the
minimization process and only solves one nonlinear equation.
7. Acknowledgments
This research was supported by Masaryk University, project MUNI/A/1001/2009.
8. Complements
Proof of Theorem 3.1.
Let us denote
(8.1) P(h) =
V (K)σ2
nh
− h4
β2
2V (m′′
)
and let
(8.2) P(h) =
V (K)ˆσ2
nh
−
1
n2h
n
i,j=1
i=j
Λ
xi − xj
h
YiYj
stand for an estimate of P. The proposed method aims to solve the equation
P(h) = 0.
For a better clarity we use the notation for
1
0
in next. As the ﬁrst step, we prove
the following lemma.
Lemma 8.1. For i, j = 1, . . . , n, i = j the formula holds
hΛ
xi − xj
h
YiYj = (K ∗ K)
x − xi
h
− K
x − xi
h
× (K ∗ K)
x − xj
h
− K
x − xj
h
YiYjdx.
Proof.
(K ∗ K)
x − xi
h
− K
x − xi
h
(K ∗ K)
x − xj
h
− K
x − xj
h
dx
= (K ∗ K)
x − xi
h
(K ∗ K)
x − xj
h
dx − 2 (K ∗ K)
x − xi
h
K
x − xj
h
dx
+ K
x − xi
h
K
x − xj
h
dx.
Set the three integrals in the sum as η1, η2, η3. We modify η3 by substitution
t =
x−xj
h . Using the parity of K we get
η3 = h
1−xj
h
−xj
h
K(t)K t −
xi − xj
h
dt.
Page 10 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 11
Provided xj ∈ [0, 1] then, as h → ∞, −xj/h → −∞ and (1−xj)/h → ∞. Therefore
η3 = h(K ∗ K)
xi − xj
h
.
Similarly we can obtain
η2 = h(K ∗ K ∗ K)
xi − xj
h
,
η1 = h(K ∗ K ∗ K ∗ K)
xi − xj
h
.
Thus η1 − 2η2 + η3 = hΛ
xi−xj
h .
We start with an evaluation of 1
n2h E
n
i,j=1
i=j
Λ
xi−xj
h YiYj:
1
n2h
E
n
i,j=1
i=j
Λ
xi − xj
h
YiYj
L.8.1
=
1
n2h2
E
n
i,j=1
i=j
(K ∗ K)
x − xi
h
− K
x − xi
h
× (K ∗ K)
x − xj
h
− K
x − xj
h
YiYjdx
=
1
n2h2
n
i,j=1
i=j
(K ∗ K)
x − xi
h
− K
x − xi
h
× (K ∗ K)
x − xj
h
− K
x − xj
h
m(xi)m(xj)dx
=



∞
−∞
[(K ∗ K)(t) − K(t)] m(x − ht)dt
I1



2
dx + O n−1
.
Now, we approximate the integral I1 by the Taylor’s expansion of m(x − th)
I1 =
∞
−∞
[(K ∗ K)(t) − K(t)] m(x) − thm′
(x) +
t2
h2
2
m′′
(x) + O(t3
h3
) dt.
It is an easy exercise to see the moment conditions for (K ∗ K)(t) − K(t):
∞
−∞
(K ∗
K)(t) − K(t)dt =
∞
−∞
t2k+1
[(K ∗ K)(t) − K(t)]dt = 0, k ∈ N,
∞
−∞
t2
[(K ∗ K)(t) −
K(t)]dt = 2β2.
Thus
I1 = h2
β2m′′
(x) + O h4
Page 11 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
12 JAN KOL´AˇCEK, IVANA HOROV´A
and
1
n2h
E
n
i,j=1
i=j
Λ
xi − xj
h
YiYj = h4
β2
2V (m′′
) + O h6
+ O n−1
.
Finally
EP(h) =
V (K)σ2
nh
− β2
2V (m′′
)h4
+ O n−1
and
(8.3) EP(h) = P(h) + O n−1
.
Since it is assumed lim
n→∞
nh = ∞ then EP(h) → P(h).
Now, we derive the formula for varP(h). As the ﬁrst we express varAISB =
E(AISB)2
− E2
AISB.
E(AISB)2
=
1
16n4h2
E



n
i,j=1
i=j
Λ
xi − xj
h
YiYj



2
=
1
16n4h2
E



n
i,j,k,l=1
i=j=k=l
Λ
xi − xj
h
Λ
xk − xl
h
YiYjYkYl
ζ1
+
n
i,j,k=1
i=j=k
Λ
xi − xj
h
Λ
xi − xk
h
Y 2
i YjYk
ζ2
+
n
i,j=1
i=j
Λ2 xi − xj
h
Y 2
i Y 2
j
ζ3



.
Then we compute
1
16n4h2
Eζ1 =
1
16n4h2
n
i,j,k,l=1
i=j=k=l
Λ
xi − xj
h
Λ
xk − xl
h
m(xi)m(xj)m(xk)m(xl)
=
1
16h2
Λ
x − y
h
Λ
u − v
h
m(x)m(y)m(u)m(v)dxdydudv + O n−1
=
1
16h2
Λ
x − y
h
m(x)m(y)dxdy
2
+ O n−1
=
1
16



∞
−∞
Λ(t)m(x − th)m(x)dtdx



2
+ O n−1
It is easy to see the moment conditions for Λ(z):
∞
−∞
Λ(z)dz =
∞
−∞
z2
Λ(z)dz = 0,
∞
−∞
z2k−1
Λ(z)dz = 0, k ∈ N, z4
Λ(z)dz = 6β2
2 ([8]). By using the second order
Page 12 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 13
Taylor’s expansion of m(x − th) we obtain the result
1
16n4h2
Eζ1 =
1
32



m′′
(x)m(x)dx
∞
−∞
Λ(t)t2
h2
+ O h3
dt



2
+O n−1
= O n−1
.
Similarly,
1
16n4h2
Eζ2 =
1
16n4h2
n
i,j,k=1
i=j=k
Λ
xi − xj
h
Λ
xi − xk
h
m2
(xi) + σ2
m(xj)m(xk)
=
1
16nh2
Λ
x − y
h
Λ
x − z
h
m2
(x) + σ2
m(y)m(z)dxdydz + O n−1
=
1
16n
∞
−∞
∞
−∞
Λ(t)Λ(u) m2
(x) + σ2
m(x − th)m(x − uh)dtdudx + O n−1
=
1
64n
m′′2
(x) m2
(x) + σ2
dx



∞
−∞
Λ(t)t2
h2
dt



2
+ O h6
n−1
+ O n−1
.
1
16n4h2
Eζ3 =
1
16n4h2
n
i,j=1
i=j
Λ2 xi − xj
h
m2
(xi) + σ2
m2
(xj) + σ2
=
1
16n2h2
Λ2 x − y
h
m2
(x) + σ2
m2
(y) + σ2
dxdy + O n−1
=
1
16n2h
∞
−∞
Λ2
(t) m2
(x) + σ2
m2
(x − th) + σ2
dtdx + O n−1
=
V (Λ)V (m2
+ σ2
)
16n2h
+ O n−1
.
By combining results for E(AISB)2
and E2
AISB we arrive at the expression
varAISB = O n−1
.
Since ˆσ2
is a consistent estimator of σ2
(see [6]) and varAISB is of order O n−1
,
varP is a consistent estimator of varP.
References
[1] P.J. Brockwell and R.A. Davis. Time Series: Theory and Methods. Springer Series in Statistics.
Springer, 2009.
[2] S.T. Chiu. Why bandwidth selectors tend to choose smaller bandwidths, and a remedy.
Biometrika, 77(1):222–226, 1990.
[3] S.T. Chiu. Some stabilized bandwidth selectors for nonparametric regression. Annals of Statistics,
19(3):1528–1546, 1991.
[4] P Craven and G Wahba. Smoothing noisy data with spline functions - estimating the correct
degree of smoothing by the method of generalized cross-validation. Numerische Mathematik,
31(4):377–403, 1979.
[5] Bernd Droge. Some comments on cross-validation. Technical Report 1994-7, Humboldt Universitaet
Berlin, 1996.
[6] Peter Hall, J. W. Kay, and D. M. Titterington. Asymptotically optimal diﬀerence-based
estimation of variance in nonparametric regression. Biometrika, 77(3):521–528, 1990.
Page 13 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
ForPeerReview
Only
14 JAN KOL´AˇCEK, IVANA HOROV´A
[7] W. H¨ardle. Applied Nonparametric Regression. Cambridge University Press, Cambridge, 1st
edition, 1990.
[8] I. Horov´a, J. Kol´aˇcek, and J. Zelinka. Kernel Smoothing in MATLAB. World Scientiﬁc,
Singapore, 2012.
[9] I. Horov´a and J. Zelinka. Contribution to the bandwidth choice for kernel density estimates.
Computational Statistics, 22(1):31–47, 2007.
[10] I. Horov´a and J. Zelinka. Kernel estimation of hazard functions for biomedical data sets.
In Wolfgang. H¨ardle, Yuichi. Mori, and Philippe Vieu, editors, Statistical Methods for Biostatistics
and Related Fields, Mathematics and Statistics, pages 64–86. Springer-Verlag Berlin
Heidelberg, 2007.
[11] I. Horov´a, J. Zelinka, and M. Bud´ıkov´a. Kernel estimates of hazard funcions for carcinoma
data sets. Environmetrics, 17(3):239–255, 2006.
[12] Jan Kol´aˇcek. Kernel Estimation of the Regression Function (in Czech). PhD thesis, Masaryk
University, Brno, feb 2005.
[13] Jan Kol´aˇcek. Plug-in method for nonparametric regression. Computational Statistics,
23(1):63–78, 2008.
[14] M. B. Priestley and M. T. Chao. Non-parametric function ﬁtting. Journal of the Royal Statistical
Society. Series B (Methodological), 34(3):385–392, 1972.
[15] J. Rice. Bandwidth choice for nonparametric regression. Annals of Statistics, 12(4):1215–
1230, 1984.
[16] Bernard W. Silverman. Some aspects of the spline smoothing approach to non-parametric
regression curve ﬁtting. Journal of the Royal Statistical Society. Series B (Methodological),
47:1–52, 1985.
[17] Bernard W. Silverman. Density estimation for statistics and data analysis. Chapman and
Hall, London, 1986.
[18] J. S. Simonoﬀ. Smoothing Methods in Statistics. Springer-Verlag, New York, 1996.
[19] M Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the
Royal Statistical Society Series B-Statistical Methodology, 36(2):111–147, 1974.
[20] M.P. Wand and M.C. Jones. Kernel smoothing. Chapman and Hall, London, 1995.
Page 14 of 22
URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics ? Theory and Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
jes
Journal of Environmental Statistics
February 2013, Volume 4, Issue 2. http://www.jenvstat.org
Kernel Regression Model for Total Ozone Data
Horov´a I., Kol´aˇcek J., Lajdov´a D.
Department of Mathematics and Statistics
Masaryk University Brno
Abstract
The present paper is focused on a fully nonparametric regression model for autocorrelation
structure of errors in time series over total ozone data. We propose kernel methods
which represent one of the most eﬀective nonparametric methods.
But there is a serious diﬃculty connected with them – the choice of a smoothing parameter
called a bandwidth. In the case of independent observations the literature on
bandwidth selection methods is quite extensive. Nevertheless, if the observations are dependent,
then classical bandwidth selectors have not always provided applicable results.
There exist several possibilities for overcoming the eﬀect of dependence on the bandwidth
selection. In the present paper we use the results of Chu and Marron (1991) and Kol´aˇcek
(2008) and develop two methods for the bandwidth choice. We apply the above mentioned
methods to the time series of ozone data obtained from the Vernadsky station in
Antarctica. All discussed methods are implemented in Matlab.
Keywords: total ozone, kernel, bandwidth selection.
1. Introduction
Antarctica is signiﬁcantly related to many environmental aspects and processes of the Earth.
And thus its impact on the global climate system and water circulation in the world ocean is
essential.
The stratosphere ozone depletion over Antarctica was discovered at the beginning of the
1990s. The lowest total ozone contents (TOC) in Antarctica are usually observed in the ﬁrst
week of October. The formation of ozone depletion begins approximately in the second half of
August, culminates in the ﬁrst half of October, and dissolves in November. During the ozone
depletion, the average ozone concentration varied at the time of its culmination in October
from the original value over 300 Dobson Units (DU) in 1950s and 1960s to a level between
100 and 150 DU in 1990-2000 (see L´aska et al. (2009)). One DU is set as a 0.001 mm strong
2 Kernel Regression Model for Total Ozone Data
layer of ozone under the pressure 1013 hPa and temperature 273 K.
One of the issues resolved within the Czech–Ukrainian scientiﬁc cooperation implemented on
the Vernadsky Station in Antarctica is the measurement of total ozone content (TOC) in the
stratosphere. The Vernadsky station is located on the west coast of Antarctic peninsula (65◦S,
64◦W). These data were obtained from ground measurements predominantly taken with the
Dobson No 031 spectrophotometer. Data can be found at UAC (2012).
The data sets were processed as time points measuring the average daily amount of ozone.
In order to analyze these data we have to take into account the autocorrelation structure
of errors on such time series. We focus on kernel regression estimators of series of ozone
data. These estimators depend on a smoothing parameter and it is well-known that selecting
the correct smoothing parameter is diﬃcult in the presence of correlated errors. There exist
methods which are modiﬁcations of a classical cross-validation method for independent errors
(the modiﬁed cross-validation method or the partitioned cross-validation method - see Chu
and Marron (1991), H¨ardle and Vieu (1992)).
In the present paper we develop a new ﬂexible plug-in approach for estimating the optimal
smoothing parameter. The utility of this method is illustrated through a simulation study
and application to TOC data measured in periods August to April 2004-2005, 2005-2006,
2006-2007.
2. Procedure Development
2.1. Kernel regression model
In nonparametric regression problems we are interested in estimating the mean function
E(Y |x) = m(x) from a set of observations (xi, Yi), i = 1, . . . , n. Many methods such as
kernel methods, regression splines and wavelet methods are currently available. The papers
in this ﬁled have been mostly focused on case where an unknown function m is hidden by a
certain amount of a white noise. The aim of a regression analysis is to remove the white noise
and produce a reasonable approximation to the unknown function m.
Consider now the case when the noise is no longer white and instead contains a certain amount
of a structure in the form of correlation. In particular, if data sets have been recorded over
time from one object under a study, it is very likely that another response of the object will
depend on its previous response. In this context we will be dealing with a time series case,
where design points are ﬁxed and equally spaced and thus our model takes the form
Yi =m(i/n)+εi, i = 1, . . . , n, (1)
and εi is an unknown ARMA process, i.e.,
E(εi) =0, var(εi) = σ2
, i = 1, . . . , n,
cov(εi, εj) =γ|i−j| = σ2
ρ|i−j|, corr(εi, εj) = ρ|i−j|
(2)
and the stationary process
γ0 = σ2
, ρt =
γt
γ0
,
Journal of Environmental Statistics 3
where ρt is an autocorrelation function and γt is an autocovariance function. We consider the
simplest situation (Opsomer et al. (2001), Chu and Marron (1991))
ρt/n = ρt.
Simple and the most widely used regression smoothers are based on kernel methods (see
e.g. monographs M¨uller (1987), H¨ardle (1990), Wand and Jones (1995)). These methods are
local weighted averages of the response Y . They depend on a kernel which plays the role
of a weighted function, and a smoothing parameter called a bandwidth which controls the
smoothness of the estimate.
Appropriate kernel regression estimators were proposed by Priestley and Chao (1972), Nadaraya
(1964) and Watson (1964), Stone (1977), Cleveland (1979) and Gasser and M¨uller (1979).
These estimators were shown to be asymptotically equivalent (Lejeune (1985), M¨uller (1987),
Wand and Jones (1995)) and without the lost of generality we consider the Nadaraya–Watson
(NW) estimators m of m. The NW estimator of m at the point x ∈ (0, 1) is deﬁned as
m(x, h) =
n
i=1
Kh(xi − x)Yi
n
i=1
Kh(xi − x)
, (3)
for a kernel function K, where Kh(.) = 1
h K( .
h ), and h is a nonrandom positive number
h = h(n) called the bandwidth.
Before studying the statistical properties of m several additional assumptions on the statistical
model and the parameters of the estimator are needed:
I. Let m ∈ C2[0, 1].
II. Let K be a real valued function continuous on R and satisfying the conditions:
(i) |K(x) − K(y)| ≤ L|x − y| for a constant L > 0, ∀x, y ∈ [−1, 1],
(ii) support(K) = [−1, 1], K(−1) = K(1) = 0,
(iii)
1
−1 xjK(x)dx =



1 j = 0,
0 j = 1,
β2 = 0 j = 2.
Such a function is called a kernel of order 2 and a class of these kernels is denoted
as S02.
III. Let h = h(n) be a sequence of nonrandom positive numbers, such that h → 0 and
nh → ∞ as n → ∞.
IV. lim
n→∞
∞
k=1
|ρk| < ∞, i.e., R =
∞
k=1
ρk exists,
V. 1
n
∞
k=1
k|ρk| = 0.
4 Kernel Regression Model for Total Ozone Data
Remark. The well-known kernels are, e.g.,
Epanechnikov kernel K(x) = 3
4 (1 − x2)I[−1,1],
quartic kernel K(x) = 3
4 (1 − x2)2I[−1,1],
triweight kernel K(x) = 35
32 (1 − x2)2I[−1,1],
Gaussian kernel K(x) = 1√
2π
e
−x2
2 ,
where I[−1,1] is an indicator function.
Though the Gaussian kernel does not satisfy the assumption II.(ii), it is very popular in many
applications.
There is no problem with a choice of a suitable kernel. Symmetric probability density functions
are commonly used (see Remark above). But choosing the smoothing parameter is a crucial
problem in all kernel estimates. The literature on bandwidth selections is quite extensive in
case of independent errors.
It is well known that when the kernel method is used to recover m, that correlated errors
trouble bandwidth selection severely (see Altman (1990), Opsomer et al. (2001)). De Brabanter
et al. (2010) developed a bandwidth selection procedure based on bimodal kernels which
successfully removes the error correlation without requiring any prior knowledge about its
structure.
The global quality of the estimate m can be expressed by means of the Mean Integrated
Squared Error (Altman (1990), Opsomer et al. (2001)). However more mathematically
tractable is the Asymptotic Mean Integrated Squared Error (AMISE):
AMISE(m, h) =
V (K)
nh
S
AIV(m,h)
+
β2
2
4
h4
A2
AISB(m,h)
,
where
V (K) = K2(x)dx, S = σ2(1 + 2
∞
k=1
ρk) = σ2(1 + 2R), A2 =
1
0 m′′(x)2dx.
The ﬁrst term is called the asymptotic integrated variance (AIV) and the second one the
asymptotic integrated squared bias (AISB). This decomposition provides an easier analysis
and interpretation of the performance of the kernel regression estimator.
Using a standard procedure of mathematical analysis one can easily ﬁnd that the bandwidth
hopt minimizing the AMISE is given by the formula
hopt =
V (K)S
nβ2
2A2
1/5
= O(n−1/5
). (4)
This formula provides a good insight into an optimal bandwidth, but unfortunately it depends
on the unknown S and A2.
Let us explain the impact of assuming an uncorrelated model.
Journal of Environmental Statistics 5
If R > 0 (error correlation is positive), then AIV(m, h) is larger than in the corresponding
uncorrelated case and AMISE(m, h) is minimized by a value h that is larger than in the
uncorrelated case. It means that assuming wrongly uncorrelated errors causes that the
bandwidth becomes too small.
If R < 0 (error correlation is negative), then AIV(m, h) is smaller and AMISE(m, h) optimal
bandwidth is smaller than in the uncorrelated case.
In the next section the choosing of parameters S and A2 will be treated.
2.2. Choosing the parameters
There are a number of data-driven bandwidth selection methods, but it can be shown that
they fail in the case of correlated errors.
Among the earliest fully automatic and consistent bandwidth selectors are those based on
cross-validation ideas. The cross-validation method employs an objective function
CV (h) =
1
n
n
j=1
m−j(xj, h) − Yj
2
, (5)
where m−j(xj, h) is the estimate of m(xj, h) with xj deleted, i.e., the leave-one-out estimator.
The estimate of hopt is then
hopt = arg min
h∈Hn
CV (h),
where Hn = [an−1/5, bn−1/5], 0 < a < b < ∞.
Remark. If the design points are equally spaced then a recommended interval is [ 1
n , 1).
However, this ordinary method is not suitable in the case of correlated observations. As
it was shown in the papers Altman (1990) and Opsomer et al. (2001), if the observations
are positively correlated, then the CV method produces too small a bandwidth, and if the
observations are negatively correlated, then the CV method produces a large bandwidth.
We demonstrate this fact by the following example.
Consider the regression model (1), where
m(x) = cos (3.15πx), εi = φεi−1 + ei,
ei – i.i.d. normal random variables N(0, σ2),
ε1 – N(0, σ2/(1 − φ2)),
φ = 0.6, σ = 0.5,
i.e, the regression errors are AR(1) process.
Figure 1 shows the result obtained by the CV method. It is evident, that the estimate is
undersmoothed.
In order to overcome this problem, modiﬁed and partitioned CV methods were proposed by
H¨ardle and Vieu (1992) and Chu and Marron (1991), respectively.
6 Kernel Regression Model for Total Ozone Data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1
0
1
2
3
4
Estimate obtained with bandwidth selected by CV
Simulated data with AR(1) correlation
Figure 1: The estimate of simulated data with AR(1) errors
The modiﬁed cross-validation (MCV) method is a ”leave-(2l + 1)-out” version of CV (l ≥ 0).
The idea consists in minimizing of the modiﬁed cross-validation score:
CVl(h) =
1
n
n
j=1
m−j(xj, h) − Yj
2
, (6)
where m−j(xj, h) is the ”leave-(2l+1)-out”estimate of m(xj, h), i.e., the observations (xj+i, Yj+i),
−l ≤ i ≤ l are left out in constructing m(xj, h).
Then
hMCV = arg min
h∈Hn
CVl(h).
The principle of the partitioned cross-validation method (PCV) can be described as follows.
For any natural number g ≥ 1, the PCV involves splitting the observations into g groups by
taking every g-th observation, calculating the ordinary cross-validation score CV0,k(h) of the
k-th group of observations separately, for k = 1, 2, . . . , g, and minimizing the average of these
ordinary cross-validation scores
CV ∗
(h) =
1
g
g
k=1
CV0,k(h). (7)
Let h∗
CV stand for the minimizer of CV ∗(h):
h∗
CV = arg min
h∈Hn
CV ∗
(h).
Since h∗
CV is appropriate for the sample size n/g, the partitioned cross-validated bandwidth
hPCV (g) is deﬁned to be rescaled h∗
CV :
hPCV (g) = g−1/5
h∗
CV .
When g = 1, the PCV is an ordinary cross-validation.
Journal of Environmental Statistics 7
Remark. The number of subgroups is g and the number of observations in each group is
η = n/g. If n is not a multiplier of g, then the values Yj, 1 ≤ j ≤ g[n/g] are applied and
the rest of the observations are dropped out ([n/g] is the highest integer less or equal
to n/g).
The asymptotic behavior of hMCV (l) and hPCV (g) was studied in the paper by Chu and Marron
(1991). Furthemore we focus on the PCV method.
The PCV method needs to determine the factor g. A possible approach for the practical
choice of g is based on an analogue of the mean squared error. Using the asymptotic variance
and the asymptotic mean of hPCV (g)/hopt, the asymptotic mean squared error (AMSE) of
this ratio is deﬁned by
AMSE hPCV (g)/hopt = n−1/5
VARPCV (g) + CPCV (g)/C − 1
2
, (8)
where VARPCV (g), CPCV (g), C depend on γk, K, A2 (see Chu and Marron (1991)).
Theoretically, if there exists a value g which minimizes AMSE over g ≥ 1, then this value is
taken as the optimal value of g in the sense of AMSE:
gopt = arg min
g≥1
AMSE hPCV (g)/hopt .
Unfortunately the minimization of AMSE also depends on the unknown γk and A2.
As far as the estimation of the variance component S is concerned, a common approach is
the following (see e.g. Herrmann et al. (1992), Hart (1991), Opsomer et al. (2001), Chu and
Marron (1991)):
S = ˆγ0 1 + 2
n−1
k=1
ˆρk , ˆγ0 = ˆσ2
, ˆρk =
ˆγk
ˆγ0
,
ˆγk =
1
n − k
n−k
t=1
Yt − Y Yt+k − Y , k = 0, . . . , n − 1.
(9)
Nevertheless there is still a problem of how to estimate A2. In paper Chu and Marron (1991)
a simulation study was only conducted and no idea of estimating A2 was given there.
We complete this method by adding a suitable estimate of A2 and recommend to use an
estimate of A2 proposed by Kol´aˇcek (2008). By means of the Fourier transformation he
derived a suitable estimate A2 of A2. Therefore, A2 in the AMSE formula is replaced by A2.
This approach is commonly known as a plug-in method.
Plug-in methods are also commonly used for selecting the bandwidth in the kernel regression.
But these methods perform badly when the errors are correlated. In the paper Herrmann
et al. (1992) a modiﬁed version of an existing plug-in bandwidth selectors is proposed. This
method is based on the Gasser–M¨uller estimator of the second derivative and an iterative
process is constructed. It is shown that under some additional assumptions this iterative
process converges to a suitable estimate of the optimal bandwidth.
However we do not use this iterative method and propose to directly plug-in A2 in the formula
(4). This new version of a plug-in method is denoted as PI and the bandwidth estimate takes
the form:
hPI =
V (K)S
nβ2
2A2
1/5
.
8 Kernel Regression Model for Total Ozone Data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−6
−5
−4
−3
−2
−1
0
1
2
3
4
Figure 2: The regression function m(x)
hopt = 0.759
E(h) std(h)
PCV 0.1927 0.0649
PI 0.1513 0.0083
Table 1: The estimates h
We would like to point out the computational aspect of the plug-in method. It has preferable
properties to classical methods, because it does not need any additional calculations such as
the PCV method (see Kol´aˇcek (2008) for details).
3. Case study
We conduct a simulation study to compare the PCV method and the PI method. The
Epanechnikov kernel is used both in simulations and in applications.
Consider the regression model (1), where
m(x) =
−6 sin 11x + 5
cotg(x − 7)
, εi = φεi−1 + ei
ei – i.i.d. normal random variables N(0, σ2)
ε1 – N(0, σ2/(1 − φ2))
φ = 0.6, σ = 0.5,
for i = 1, . . . , n = 100.
The graph of the regression function m is presented in Figure 2.
One hundred series are generated. For each data set, the optimal bandwidth is estimated by
the PCV and PI method. Table 1 shows the comparison of means and standard deviations
for these two methods.
Journal of Environmental Statistics 9
ISE (PCV) ISE (Plug−in)
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure 3: ISE(m(., h)) =
1
0 m(x, h) − m(x)
2
dx.
0 5 10 15 20 25 30 35 40 45 50
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Autocorrelation function
Figure 4: The autocorrelation function of the data set August 2004 – April 2005
The Integrated Square Error (ISE) is calculated for each estimate m(., h):
ISE(m(., h)) =
1
0
m(x, h) − m(x)
2
dx
for both PCV and PI methods and the results are displayed by means of the boxplots in
Figure 3.
4. Results and discussion
In this section we apply the methods described above to ozone data. We analyze data which
were measured in the period August to April in years 2004–2005, 2005–2006, 2006–2007. The
sample size is n = 273 days. The observations are correlated as it can be seen in Figure 4. We
transform data to the interval [0,1] and use the PCV method and the PI method to get the
optimal bandwidth. Then we re-transform the bandwidth to the original sample and obtain
the ﬁnal kernel estimate.
Kernel estimates based on the PCV and PI methods are presented in Figure 6, Figure 7, or
in Figure 8, respectively.
10 Kernel Regression Model for Total Ozone Data
0 50 100 150 200 250
100
150
200
250
300
350
400
450 8 9 10 11 12 1 2 3 4
Time
DobsonUnits
August 2004 − April 2005
PI method
RLWR
Figure 5: RLWR estimate with span = 40 (dashed line) and PI estimate with the bandwidth
= 17.8 (solid line).
0 50 100 150 200 250
100
150
200
250
300
350
400
450 8 9 10 11 12 1 2 3 4
Time
DobsonUnits
August 2004 − April 2005
PI method
PCV method
Figure 6: PCV estimate with the bandwidth = 20.9 (dashed line) and PI estimate with the
bandwidth = 17.8 (solid line).
In paper Kalvov´a and Dubrovsk´y (1995) the robust locally wighted regression (RLWR) is
employed for data processing of TOC. They recommended to optimize h subjectively. This
approach needs an experience and a special knowledge of the given data sets. The advantage
of our methods consists in more complex approach. These methods are general and they
allow to choose the value of h automatically. We used their methodology for data April 2004
- August 2005 and the comparison of the estimate obtained by the PI method and by the
robust locally weighted regression can be seen in Figure 5. The PI method yields a rather
oversmoothed estimate.
Our experience shows that both methods could be considered as a suitable tool for the choice
of the bandwidth. But it seems that the PI method is suﬃciently reliable and less time
consuming than the PCV method.
Presented methods can be applied to other time series not only in environmetrics but also in
economics or other ﬁelds.
Journal of Environmental Statistics 11
0 50 100 150 200 250
100
150
200
250
300
350
400
450 8 9 10 11 12 1 2 3 4
Time
DobsonUnits August 2005 − April 2006
PI method
PCV method
Figure 7: PCV estimate with the bandwidth = 20.4 (dashed line) and PI estimate with the
bandwidth = 21.9 (solid line).
0 50 100 150 200 250
100
150
200
250
300
350
400
450 8 9 10 11 12 1 2 3 4
Time
DobsonUnits
August 2006 − April 2007
PI method
PCV method
Figure 8: PCV estimate with the bandwidth = 17.2 (dashed line) and PI estimate with the
bandwidth = 22.3 (solid line).
12 Kernel Regression Model for Total Ozone Data
Acknowledgments
The research was supported by The Jaroslav H´ajek Center for Theoretical and Applied Statistics
(MˇSMT LC 06024). The work was supported by the Student Project Grant at Masaryk
university, rector’s programme no. MUNI/A/1001/2009.
References
Altman N (1990). “Kernel Smoothing of Data With Correlated Errors.” Journal of the
American Statistical Association, 85, 749–759.
Chu CK, Marron JS (1991). “Choosing a Kernel Regression Estimator.” Statistical Science,
6(4), 404–419. ISSN 08834237.
Cleveland WS (1979). “Robust Locally Weighted Regression and Smoothing Scatterplots.”
Journal of the American Statistical Association, 74(368), 829–836. ISSN 01621459.
De Brabanter K, De Brabanter J, Suykens J, De Moor B (2010). “Kernel Regression with
Correlated Errors.” Computer Applications in Biotechnology, pp. 13–18.
Gasser T, M¨uller HG (1979). “Kernel estimation of regression functions.” In T Gasser,
M Rosenblatt (eds.), Smoothing Techniques for Curve Estimation, volume 757 of Lecture
Notes in Mathematics, pp. 23–68. Springer Berlin / Heidelberg.
H¨ardle W (1990). Applied Nonparametric Regression. 1st edition. Cambridge University
Press, Cambridge.
H¨ardle W, Vieu P (1992). “Kernel Regression Smoothing of Time Series.” Journal of Time
Series Analysis, 13(3), 209–232.
Hart JD (1991). “Kernel Regression Estimation with Time Series Errors.” Journal of the
Royal Statistical Society, 53, 173–187.
Herrmann E, Gasser T, Kneip A (1992). “Choice of Bandwidth for Kernel Regression when
Residuals are Correlated.” Biometrika, 79, 783–795.
Kalvov´a J, Dubrovsk´y M (1995). “Assessment of the Limits Between Which Daily Average
Values of Total Ozone Can Normally Vary.” Meteorol. Bulletin, 48, 9–17.
Kol´aˇcek J (2008). “Plug-in Method for Nonparametric Regression.” Computational Statistics,
23(1), 63–78. ISSN 0943-4062.
L´aska K, Proˇsek P, Bud´ık L, Bud´ıkov´a M, Milinevsky G (2009). “Prediction of Erythemally
Eﬀective UVB Radiation by Means of Nonlinear Regression Model.” Environmetrics, 20(6),
633–646.
Lejeune M (1985). “Estimation Non-param´etrique par Noyaux: R´egression Polynomiale Mobile.”
Revue de Statistique Appliqu´ee, 33(3), 43–67.
M¨uller HG (1987). “Weighted Local Regression and Kernel Methods for Nonparametric Curve
Fitting.”Journal of the American Statistical Association, 82(397), 231–238. ISSN 01621459.
Journal of Environmental Statistics 13
Nadaraya EA (1964). “On Estimating Regression.” Theory of Probability and its Applications,
9(1), 141–142.
Opsomer J, Wang Y, Yang Y (2001). “Nonparametric Regression with Correlated Errors.”
Statistical Science, 16(2), 134–153.
Priestley MB, Chao MT (1972). “Non-Parametric Function Fitting.” Journal of the Royal
Statistical Society. Series B (Methodological), 34(3), 385–392. ISSN 00359246.
Stone CJ (1977). “Consistent Nonparametric Regression.” The Annals of Statistics, 5(4),
595–620. ISSN 00905364.
UAC (2012). “World Ozone and Ultraviolet Radiation Data Centre (WOUDC) [data].” URL
http://www.woudc.org.
Wand M, Jones M (1995). Kernel smoothing. Chapman and Hall, London.
Watson GS (1964). “Smooth Regression Analysis.” Sankhya - The Indian Journal of Statistics,
Series A, 26(4), 359–372. ISSN 0581572X.
Aﬃliation:
Ivana Horov´a
Masaryk University
Department of Mathematics and Statistics
Brno, Czech Republic
E-mail: horova@math.muni.cz
URL: https://www.math.muni.cz/~horova/
Journal of Environmental Statistics http://www.jenvstat.org
Volume 4, Issue 2 Submitted: 2012-03-31
February 2013 Accepted: 2012-10-09
Journal of Statistics: Advances in Theory and Applications
Volume 7, Number 1, 2012, Pages 1-23
2010 Mathematics Subject Classification: 62P05, 90B50.
Keywords and phrases: credit scoring, quality indexes, Gini index, lift, lift ratio, integrated
relative lift.
Received February 14, 2012
 2012 Scientific Advances Publishers
LIFT-BASED QUALITY INDEXES FOR CREDIT
SCORING MODELS AS AN ALTERNATIVE
TO GINI AND KS
MARTIN ŘEZÁČ and JAN KOLÁČEK
Department of Mathematics and Statistics
Masaryk University
Kotláįská 2, 61137 Brno
Czech Republic
e-mail: mrezac@math.muni.cz
Abstract
Assessment of risk associated with the granting of credits is very successfully
supported by techniques of credit scoring. To measure the quality, in the sense
of the predictive power, of the scoring models, it is possible to use quantitative
indexes such as the Gini index (Gini), the K-S statistic (KS), the c-statistic, and
lift. They are used for comparing several developed models at the moment of
development as well as for monitoring the quality of the model after
deployment into real business. The paper deals with the aforementioned
quality indexes, their properties and relationships. The main contribution of
the paper is the proposal and discussion of indexes and curves based on lift.
The curve of ideal lift is defined; lift ratio (LR) is defined as analogous to Gini
index. Integrated relative lift (IRL) is defined and discussed. Finally, the
presented case study shows a case when LR and IRL are much more
appropriate to use than Gini and KS.
MARTIN ĮEZÁČ AND JAN KOLÁČEK2
1. Introduction
Banks and other financial institutions receive thousands of credit
applications every day (in the case of consumer credits, it can be tens or
hundreds of thousands every day). Since it is impossible to process them
manually, automatic systems are widely used by these institutions for
evaluating the credit reliability of individuals, who ask for credit. The
assessment of the risk associated with the granting of credits has been
underpinned by one of the most successful applications of statistics and
operations research: credit scoring.
Credit scoring is the set of predictive models and their underlying
techniques that aid financial institutions in the granting of credits. These
techniques decide who will get credit, how much credit they should get,
and what further strategies will enhance the profitability of the
borrowers to the lenders. Credit scoring techniques assess the risk in
lending to a particular client. They do not identify “good” or “bad”
(negative behaviour is expected, e.g., default) applications on an
individual basis, but forecast the probability that an applicant with any
given score will be “good” or “bad”. These probabilities or scores, along
with other business considerations such as expected approval rates,
profit, churn, and losses, are then used as a basis for decision making.
Several methods connected to credit scoring have been introduced
during last six decades. The most well-known and widely used are logistic
regression, classification trees, the linear programming approach, and
neural networks.
The methodology of credit scoring models and some measures of their
quality have been discussed in surveys including Hand and Henley [7],
Thomas [14] or Crook et al. [4]. Even if ten years ago the list of books
devoted to the issue of credit scoring was not extensive, the situation has
improved in the last decade. In particular, this list now includes
Anderson [1], Crook et al. [4], Siddiqi [11], Thomas et al. [15], and
Thomas [16].
LIFT-BASED QUALITY INDEXES FOR CREDIT … 3
The aim of this paper is to give an overview of widely used techniques
used to assess the quality of credit scoring models, to discuss the
properties of these techniques, and to extend some known results. We
review widely used quality indexes, their properties and relationships.
The main part of the paper is devoted to lift. The curve of ideal lift is
defined; lift ratio is defined as analogous to Gini index. Integrated
relative lift is defined and discussed.
2. Measuring the Quality
We can consider two basic types of quality indexes: first, indexes
based on a cumulative distribution function like the KolmogorovSmirnov
statistic, Gini index or lift; second, indexes based on a likelihood
density function like the mean difference (Mahalanobis distance) or
informational statistic. For further available measures and appropriate
remarks, see Wilkie [17], Giudici [6] or Siddiqi [11].
Assume that the realization Rs ∈ of a random variable S (score) is
available for each client and put the following markings:



=
otherwise.,0
good,isclient,1
D (1)
Distribution functions, respectively, their empirical forms, of the scores of
good (bad) clients are given by
( ) ( ),1
1
1
. =∧≤= ∑=
DasI
n
aF i
N
i
GOODn
( ) ( ) [ ],,,0
1
1
. HLaDasI
m
aF i
N
i
BADm ∈=∧≤= ∑=
(2)
where is is the score of i-th client, n is the number of good clients, m is
the number of bad clients, and I is the indicator function, where
( ) 1true =I and ( ) .0false =I L is the minimum value of a given score, H
is the maximum value. The empirical distribution function of the scores of
all clients is given by
MARTIN ĮEZÁČ AND JAN KOLÁČEK4
( ) ( ) [ ],,,
1
1
. HLaasI
N
aF i
N
i
ALLN ∈≤= ∑=
(3)
where mnN += is the number of all clients. We denote the proportion
of bad (good) clients by
.,
mn
n
p
mn
m
p GB
+
=
+
= (4)
An often-used characteristic in describing the quality of the model
(scoring function) is the Kolmogorov-Smirnov statistic (K-S or KS). It is
defined as
[ ]
( ) ( ) .max ..
,
aFaFKS GOODnBADm
HLa
−=
∈
(5)
It takes values from 0 to 1. Value 0 corresponds to a random model, value
1 corresponds to the ideal model. The higher the KS, the better the
scoring model.
The Lorenz curve (LC), sometimes called the ROC curve (receiver
operating characteristic curve), can also be successfully used to show the
discriminatory power of a scoring function, i.e., the ability to identify good
and bad clients. The curve is given parametrically by
( ),. aFx BADm=
( ) [ ].,,. HLaaFy GOODn ∈= (6)
Each point of the curve represents some value of a given score. If we
consider this value as a cut-off value, we can read the proportion of
rejected bad and good clients. An example of a Lorenz curve is given in
Figure 1. We can see that by rejecting 20% of good clients, we also reject
50% of bad clients at the same time.
LIFT-BASED QUALITY INDEXES FOR CREDIT … 5
Figure 1. Lorenz curve (ROC).
The LC for a random scoring model is represented by the diagonal
line from [ ]0,0 to [ ].1,1 It is the polyline from [ ]0,0 through [ ]0,1 to
[ ]1,1 in the case of an ideal model. It is obvious that the closer the curve
is to the bottom right corner, the better is the model.
The definition and name (LC) is consistent with Müller and Rönz [8].
One can find the same definition of the curve, but called ROC, in Thomas
et al. [15]. Siddiqi [11] used the name ROC for a curve with reversed axes
and LC for a curve with the CDF of bad clients on the vertical axis and
the CDF of all clients on the horizontal axis. This curve is also called the
CAP (cumulative accuracy profile) or lift curve, see Sobehart et al. [12] or
Thomas [16]. Furthermore, it is called a gains chart in the field of
marketing; see Berry and Linoff [2]. An example of CAP is displayed in
Figure 2. The ideal model is now represented by a polyline from [ ]0,0
MARTIN ĮEZÁČ AND JAN KOLÁČEK6
through [ ]1,Bp to [ ].1,1 The advantage of this figure is that, one can
easily read the proportion of rejected bads against the proportion of all
rejected. For example, in the case of Figure 2, we can see that if we want
to reject 70% of bads, we have to reject about 40% of all applicants.
Figure 2. CAP.
In connection to LC, we consider the next quality measure, the Gini
index. This index describes a global quality of the scoring model. It takes
values from 0 to 1 (it can take negative values for contrariwise models).
The ideal model, i.e., the scoring function that perfectly separates good
and bad clients, has a Gini index equal to 1. On the other hand, a model
that assigns a random score to the client, has a Gini index equal to 0. It
can be shown that the Gini index is greater than or equal to KS for any
scoring model. Using Figure 3, it can be defined as follows:
.2A
BA
A
Gini =
+
= (7)
LIFT-BASED QUALITY INDEXES FOR CREDIT … 7
Figure 3. Lorenz curve, Gini index.
This means that, we compute the ratio of the area between the curve
and the diagonal (which represents a random model) to the area between
the ideal model’s curve and the diagonal. Since the axes describe a unit
square, the area BA + is always equal to 0.5. Therefore, we can compute
the Gini as two times the area A. Using previous markings, the
computational formula of the Gini index is given by
[( )1..
2
1 −
=
−−= ∑ kk BADmBADm
N
k
FFGini
( )],1.. −+× kk GOODnGOODn FF (8)
where ( )kk GOODnBADm FF .. is the k-th vector value of the empirical
distribution function of bad (good) clients. For further details, see
Anderson [1] or Xu [18]. The Gini index is a special case of Somers’ D
(Somers [13]), which is an ordinal association measure. According to
Thomas [16], one can calculate the Somers’ D as
MARTIN ĮEZÁČ AND JAN KOLÁČEK8
,
mn
bgbg
D
j
ij
i
i
j
ij
i
i
S
⋅
−
=
∑∑∑∑ ><
(9)
where ( )ji bg is the number of goods (bads) in the i-th interval of scores.
Furthermore, it holds that SD can be expressed by the Mann-Whitney
U-statistic; see Nelsen [9] for further details.
When we use CAP instead of LC, we can define the accuracy rate
(AR); see Thomas [16] or Sobehart et al. [12], where it is called the
accuracy ratio. Again, it is defined by the ratio of some areas. We have
diagonalandCAPslmodeidealbetweenArea
diagonalandcurveCAPbetweenArea
′
=AR
( )
.
10.5
diagonalandcurveCAPbetweenArea
Bp−
= (10)
Although the ROC and CAP are not equivalent, it is true that Gini and
AR are equal for any scoring model. Proof for discrete scores is given
in Engelmann et al. [5]; for continuous scores, one can find it in
Thomas [16].
In connection to the Gini index, the c-statistic (Siddiqi [11]) is defined as
.
2
1
_
Gini
statc
+
= (11)
It represents the likelihood that a randomly selected good client has a
higher score than a randomly selected bad client, i.e.,
( ).01_ 2121 =∧=≥= DDssPstatc (12)
It takes values from 0.5, for the random model, to 1, for the ideal model.
An alternative name for the c-statistic can be found in the literature. It is
known also as Harrell’s c, which is a reparameterization of Somers’ D
(Newson [10]). Furthermore, it is called AUROC, e.g., in Thomas [16] or
AUC, e.g., in Engelmann et al. [5].
LIFT-BASED QUALITY INDEXES FOR CREDIT … 9
3. Lift
Another possible indicator of the quality of scoring model is lift, which
determines the number of times that, at a given level of rejection, the
scoring model is better than random selection (the random model). More
precisely, the ratio is the proportion of bad clients with a score less than a
(where [ ]HLa ,∈ ) to the proportion of bad clients in the general
population. Formally, it can be expressed by
( )
( )
( )
( )
( )
( )10
0
0
1
1
1
1
=∨=
=
≤
=∧≤
==
∑
∑
∑
∑
=
=
=
=
DDI
DI
asI
DasI
BadRate
aCumBadRate
aLift
N
i
N
i
i
N
i
i
N
i
( )
( )
.
0
1
1
N
m
asI
DasI
i
N
i
i
N
i
≤
=∧≤
=
∑
∑
=
=
(13)
It can be easily verified that the lift can be equivalently expressed as
( )
( )
( )
[ ].,,
.
.
HLa
aF
aF
aLift
ALLN
BADn
∈= (14)
Now, we would like to discuss the form of the lift function for the case of
the ideal model. This is the model for which sets of output scores of bad
and good clients are disjoint. So there exists a cut-off point, for which
MARTIN ĮEZÁČ AND JAN KOLÁČEK10
( )
( )
( ) ( )


>=∧≤+=
≤=∧≤
=≤
.,10
,,0
caDaSPDP
caDaSP
aSP (15)
Thus, we can derive the form of the lift function
( )
( )





>
≤
=
.,
1
,,
1
.
ca
aF
ca
p
aLift
ALLN
B
ideal (16)
In practice, lift is computed corresponding to %100,%,20%,10 … of
clients with the worst score (see Coppock [3]). Usually, it is computed by
using a table with the numbers of both all and bad clients in given score
bands (deciles). An example of such a table is given by Table 1.
Table 1. Lift (absolute and cumulative form) computational scheme
Absolutely Cumulatively
Decile #Clients # Bad
clients
Bad
rate
Abs.
Lift
#Bad
clients
Bad
rate
Cum.
Lift
1 100 35 35.0% 3.50 35 35.0% 3.50
2 100 16 16.0% 1.60 51 25.5% 2.55
3 100 8 8.0% 0.80 59 19.7% 1.97
4 100 8 8.0% 0.80 67 16.8% 1.68
5 100 7 7.0% 0.70 74 14.8% 1.48
6 100 6 6.0% 0.60 80 13.3% 1.33
7 100 6 6.0% 0.60 86 12.3% 1.23
8 100 5 5.0% 0.50 91 11.4% 1.14
9 100 5 5.0% 0.50 96 10.7% 1.07
10 100 4 4.0% 0.40 100 10.0% 1.00
All 1000 100 10.0%
It is possible to compute the lift value in each decile (absolute lift in
the fifth column in Table 1), but usually, and in accordance with the
definition of Lift(a), the cumulative form is used. It holds that the value of
lift has an upper limit of Bp/1 and tends to a value of 1 when the score
tends to infinity (or to its upper limit). In our case, we can see that the
LIFT-BASED QUALITY INDEXES FOR CREDIT … 11
best possible value of lift is equal to 10. We obtained the value 3.5 in the
first decile, which is nothing excellent, but high enough for the model to
be considered applicable in practice. Results are further illustrated in
Figure 4.
Figure 4. Lift value (absolute and cumulative).
In the context of this approach, we define
( )
( ( ))
( ( ))qFF
qFF
qLift
ALLNALLN
ALLNBADm
1
..
1
..
−
−
=Q
( ( )) ( ],1,0,
1 1
.. ∈= −
qqFF
q ALLNBADm (17)
where q represents the score level of %100q of the worst scores and
( )qF ALLN
1
.
−
can be computed as
( ) { [ ] ( ) }.,,min .
1
. qaFHLaqF ALLNALLN ≥∈=−
(18)
It can be easily shown that the lift function for the ideal model is now
MARTIN ĮEZÁČ AND JAN KOLÁČEK12
( )
( ]
( ]




∈
∈
=
.1,
1
,,0,
1
,B
B
B
ideal
pq
q
pq
p
qLiftQ (19)
Figure 5, below, gives an example of the lift function for ideal, random,
and actual models.
Figure 5. QLift function, lift ratio.
Using the previous Figure 5, we define lift ratio as analogous to Gini
index
( )
( )
.
1
1
1
0
1
0
−
−
=
+
=
∫
∫
dqqLift
dqqLift
BA
A
LR
idealQ
Q
(20)
LIFT-BASED QUALITY INDEXES FOR CREDIT … 13
It is obvious that, it is a global measure of a model's quality and that it
takes values from 0 to 1. Value 0 corresponds to the random model, value
1 matches the ideal model. The meaning of this index is quite simple: the
higher, the better. An important feature is that lift ratio allows us to
fairly compare two models developed on different data samples, which is
not possible with lift.
Since lift ratio compares areas under the lift function corresponding
to actual and ideal models, the next concept is focused on the comparison
of lift functions themselves. We define the relative lift function by
( )
( )
( )
( ].1,0, ∈= q
qLift
qLift
qRLift
idealQ
Q
(21)
An example of this function is presented in Figure 6. The definition
domain of the function is [ ];1,0 the range is a subinterval of [ ].1,0 The
graph starts at point [ ( )],, minmin qLiftpq B Q⋅ where minq is a positive
number near to zero. Then, it falls to a local minimum in point
[ ( )]BBB pLiftpp Q⋅, and then rises up to point [ ].1,1 It is obvious that
the graph of relative lift function for a better model is closer to the top
line, which represents the function for the ideal model.
MARTIN ĮEZÁČ AND JAN KOLÁČEK14
Figure 6. Relative lift function.
Now, it is natural to ask what we obtain when we integrate the
relative lift function. We define the integrated relative lift (IRL) by
( ) .
1
0
dqqRLiftIRL
∫= (22)
It takes values from ,
2
5.0
2
Bp
+ for the random model, to 1, for the ideal
model. Again the following holds: the higher, the better. This global
measure of scoring a model’s quality has an interesting connection to the
c-statistic.
We made a simulation with scores generated from a normal
distribution. The scores of bad clients had a mean equal to 0 and a
variance equal to 1. The scores of good clients had a mean and variance
LIFT-BASED QUALITY INDEXES FOR CREDIT … 15
from 0.1 to 10 with a step equal 0.1. The number of samples and sample
size were Bp,1000 was equal to 0.1. IRL and the c-statistic were
computed for each sample and each value of the mean and variance of a
good clients’ scores. Finally, means of IRL and the c-statistic were
computed. The results are presented in Figure 7. Part (b) represents the
contour plot of the figure in part (a).
The simulation shows that IRL and the c-statistic are approximately
equal when the variances of good and bad clients are equal. Furthermore,
it shows that they significantly differ when the variances are different
and the ratio of the mean and variance of good clients is near to 1.
4. Case Study
To illustrate the advantage of the proposed indexes, we introduce a
simple case study. We consider two scoring models with a score
distribution given in Table 2.
Furthermore, we consider the standard meaning of scores, i.e., a
higher score band means better clients (clients with the lowest scores,
i.e., clients in score band 1, have the highest probability of default).
MARTIN ĮEZÁČ AND JAN KOLÁČEK16
(a)
(b)
Figure 7. Difference of IRL and c-stat (a) and its contour plot (b).
LIFT-BASED QUALITY INDEXES FOR CREDIT … 17
Table 2. Score distribution and QLift of given scoring models
Scoring model 1 Scoring model 2
Score
band
#Clients q
# Bad
clients
Cumul.
bad rate
QLift
#Bad
clients
Cumul.
bad rate
QLift
1 100 0.1 20 20.0% 2.00 35 35.0% 3.50
2 100 0.2 18 19.0% 1.90 16 25.5% 2.55
3 100 0.3 17 18.3% 1.83 8 19.7% 1.97
4 100 0.4 15 17.5% 1.75 8 16.8% 1.68
5 100 0.5 12 16.4% 1.64 7 14.8% 1.48
6 100 0.6 6 14.7% 1.47 6 13.3% 1.33
7 100 0.7 4 13.1% 1.31 6 12.3% 1.23
8 100 0.8 3 11.9% 1.19 5 11.4% 1.14
9 100 0.9 3 10.9% 1.09 5 10.7% 1.07
10 100 1.0 2 10.0% 1.00 4 10.0% 1.00
All 1000 100 100
The Gini index for each model is equal to 0.420. KS is equal to 0.356
for model 1 and to 0.344 for model 2. According to these numbers, one can
say that both models are almost the same, maybe the first one is slightly
better. However, if we look at the models in more detail, we find that they
differ significantly. We get the first insight from their Lorenz curves in
Figure 8.
MARTIN ĮEZÁČ AND JAN KOLÁČEK18
Figure 8. Lorenz curves for model 1 and model 2.
We can see that model 1 is stronger for higher score bands. This
means that this model better separates the good from the best clients. On
the other hand, model 2 is stronger for lower score bands, which means
that it better separates the bad from the worst clients. We can read the
same result from the figures of QLift and RLift in Figure 9.
LIFT-BASED QUALITY INDEXES FOR CREDIT … 19
Figure 9. QLift and RLift for model 1 and model 2.
MARTIN ĮEZÁČ AND JAN KOLÁČEK20
It is necessary to mention one computational problem at this point. In
the discrete case, as in the case of Table 2, we do not know the value of
QLift for q less than 0.1. Since QLift is not defined for ,0=q we need to
extrapolate it somehow. According to the shape of the QLift curve, we
propose using quadratic extrapolation, which yields
( ) ( ) ( ) ( ).3.02.031.030 LiftLiftLiftLift QQQQ +⋅−⋅= (23)
When we have a full data set, we can use formula (17). In this case, the
extrapolation is not needed. Of course, we still do not have the value
QLift (0). However, if we start the computation of QLift in some positive
value of q, which is sufficiently near to zero, the final result is precise
enough.
Overall, we can compare our two scoring models. Table 3, below,
contains values of Gini indexes, K-S statistics, values of QLift(0.1), LR
indexes, and IRL indexes. QLift(0.1) is a local measure of a model’s
quality; model 2 was designed to be better in the first score bands, hence
it is natural that the value of QLift(0.1) is significantly higher for model
2, concretely 3.5 versus 2.0. On the other hand, all remaining indexes are
global measures of a model’s quality. Models were designed to have the
same Gini index and similar KS. However, we can see that LR and IRL
significantly differ for our models, 0.242 versus 0.372 and 0.699 versus
0.713, respectively.
Table 3. Quality indexes of two assessed scoring models
Scoring model 1 Scoring model 2
Gini 0.420 0.420
KS 0.356 0.344
QLift(0.1) 2.000 3.500
LR 0.242 0.372
IRL 0.699 0.713
LIFT-BASED QUALITY INDEXES FOR CREDIT … 21
Finally, if the expected reject rate is up to 40%, which is a very
natural assumption, using LR and IRL, we can state that model 2 is
better than model 1 although their Gini indexes are equal and even their
KS are in reverse order.
5. Conclusion
In Section 2, we presented widely used indexes for the assessment of
credit scoring models. We focused mainly on the definitions of Lorenz
curve, CAP, Gini index, AR, and lift. The Lorenz curve is sometimes
confused with ROC. The discussion of their definitions is given within the
paper. We suggest using the definition of the Lorenz curve given in
Müller and Rönz [8], the definition of ROC given in Siddiqi [11], and the
definition of CAP given in Sobehart et al. [12].
The main part of the paper, Section 3, was devoted to lift. Formulas
for lift in basic and quantile form were presented as well as their forms
for ideal models. These formulas allow the calculation of the value of lift
for any given score and any given quantile level and comparison with the
best obtainable results.
Lift ratio was presented as analogous to Gini index. An important
feature is that LR allows the fair comparison of two models developed on
different data samples, which is not possible with lift or QLift.
Furthermore, a relative lift function was proposed, which shows the ratio
of the QLifts of the actual and ideal models. Finally, integrated relative
lift was defined. The connection to the c-statistic was presented by means
of a simulation by using normally distributed scores. This simulation
showed that IRL and the c-statistic are approximately equal in the case
when the variances of good and bad clients are equal.
Despite the high popularity of the Gini index and KS, we conclude
that the proposed lift based indexes are more appropriate for assessing
the quality of credit scoring models. In particular, it is better to use them
in the case of an asymmetric Lorenz curve. In such cases, using the Gini
index or KS during the development process could lead to the selection of
a weaker model.
MARTIN ĮEZÁČ AND JAN KOLÁČEK22
Acknowledgement
This research was supported by our department and by The Jaroslav
Hájek Center for Theoretical and Applied Statistics (grant No. LC 06024).
References
[1] R. Anderson, The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk
Management and Decision Automation, Oxford University Press, Oxford, 2007.
[2] M. J. A. Berry and G. S. Linoff, Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management, 2nd Edition, Wiley, Indianapolis, 2004.
[3] D. S. Coppock, Why Lift? DM Review Online, (2002). [Accessed on 1 December
2009].
www.dmreview.com/news/5329-1.html
[4] J. N. Crook, D. B. Edelman and L. C. Thomas, Recent developments in consumer
credit risk assessment, European Journal of Operational Research 183(3) (2007),
1447-1465.
[5] B. Engelmann, E. Hayden and D. Tasche, Measuring the Discriminatory Power of
Rating System, (2003). [Accessed on 4 October 2010].
http://www.bundesbank.de/download/bankenaufsicht/dkp/200301dkp_b.pdf
[6] P. Giudici, Applied Data Mining: Statistical Methods for Business and Industry,
Wiley, Chichester, 2003.
[7] D. J. Hand and W. E. Henley, Statistical classification methods in consumer credit
scoring: A review, Journal of the Royal Statistical Society, Series A 160(3) (1997),
523-541.
[8] M. Müller and B. Rönz, Credit Scoring using Semiparametric Methods, In:
J. Franke, W. Härdle and G. Stahl (Eds.), Measuring Risk in Complex Stochastic
Systems, Springer-Verlag, New York, 2000.
[9] R. B. Nelsen, Concordance and Gini’s measure of association, Journal of
Nonparametric Statistics 9(3) (1998), 227-238.
[10] R. Newson, Confidence intervals for rank statistics: Somers’ D and extensions, The
Stata Journal 6(3) (2006), 309-334.
[11] N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit
Scoring, Wiley, New Jersey, 2006.
[12] J. Sobehart, S. Keenan and R. Stein, Benchmarking Quantitative Default Risk
Models: A Validation Methodology, Moody’s Investors Service, (2000). [Accessed on
4 October 2010].
http://www.algorithmics.com/EN/media/pdfs/Algo-RA0301-ARQ-DefaultRiskModels.pdf
LIFT-BASED QUALITY INDEXES FOR CREDIT … 23
[13] R. H. Somers, A new asymmetric measure of association for ordinal variables,
American Sociological Review 27 (1962), 799-811.
[14] L. C. Thomas, A survey of credit and behavioural scoring: Forecasting financial risk
of lending to consumers, International Journal of Forecasting 16(2) (2000), 149-172.
[15] L. C. Thomas, D. B. Edelman and J. N. Crook, Credit Scoring and its Applications,
SIAM Monographs on Mathematical Modelling and Computation, Philadelphia,
2002.
[16] L. C. Thomas, Consumer Credit Models: Pricing, Profit, and Portfolio, Oxford
University Press, Oxford, 2009.
[17] A. D. Wilkie, Measures for Comparing Scoring Systems, In: L. C. Thomas, D. B.
Edelman and J. N. Crook (Eds.): Readings in Credit Scoring, Oxford University
Press, Oxford, (2004), 51-62.
[18] K. Xu, How has the literature on Gini’s index evolved in past 80 years?
(2003). [Accessed on 1 December 2009].
economics.dal.ca/RePEc/dal/wparch/howgini.pdf
g
This article was downloaded by: [ Masarykova Univerzita v Brne] , [ Ivana Horova]
On: 12 January 2012, At: 08: 02
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
Communications in Statistics - Theory and Methods
Publication details, including instructions for authors and subscription information:
http:/ / www.tandfonline.com/ loi/ lsta20
Visualization and Bandwidth Matrix Choice
Ivana Horová
a
, Jan Koláček
a
& Kamila Vopatová
a
a
Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic
Available online: 10 Jan 2012
To cite this article: Ivana Horová, Jan Koláček & Kamila Vopatová (2012): Visualization and Bandwidth Matrix Choice,
Communications in Statistics - Theory and Methods, 41:4, 759-777
To link to this article: http:/ / dx.doi.org/ 10.1080/ 03610926.2010.529539
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http: / / www.tandfonline.com/ page/ terms-and-conditions
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to
anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should
be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims,
proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in
connection with or arising out of the use of this material.
Communications in Statistics—Theory and Methods, 41: 759–777, 2012
Copyright © Taylor & Francis Group, LLC
ISSN: 0361-0926 print/1532-415X online
DOI: 10.1080/03610926.2010.529539
Visualization and Bandwidth Matrix Choice
IVANA HOROVÁ, JAN KOLÁ ˇCEK,
AND KAMILA VOPATOVÁ
Department of Mathematics and Statistics, Masaryk University,
Brno, Czech Republic
Kernel smoothers are among the most popular nonparametric functional estimates.
These estimates depend on a bandwidth that controls the smoothness of the estimate.
While the literature for a bandwidth choice in a univariate density estimate is
quite extensive, the progress in the multivariate case is slower. The authors focus
on a bandwidth matrix selection for a bivariate kernel density estimate provided
that the bandwidth matrix is diagonal. A common task is to ﬁnd entries of the
bandwidth matrix which minimizes the Mean Integrated Square Error (MISE). It is
known that in this case there exists explicit solution of an asymptotic approximation
of MISE (Wand and Jones, 1995). In the present paper we pay attention to the
visualization and optimizers are presented as intersection of bivariate functional
surfaces derived from this explicit solution and we develop the method based on this
visualization. A simulation study compares the least square cross-validation method
and the proposed method. Theoretical results are applied to real data.
Keywords Asymptotic mean integrated square error; Bandwidth matrix; Mean
integrated square error; Product kernel.
Mathematics Subject Classiﬁcation 62G07; 62H12.
1. Introduction
Methods for a bandwidth choice in a univariate density estimate have been
developed in many papers and monographs (e.g., Cao et al., 1994; Chaudhuri and
Marron, 1999; Härdle et al., 2004; Horová et al., 2002; Horová and Zelinka, 2007;
Silverman, 1989; Taylor, 1989; Wand and Jones, 1995).
In this paper we focus on a problem of a data-driven choice of a bandwidth
matrix in bivariate kernel density estimates. Bivariate kernel density estimation
problem is an excellent setting for understanding aspects of multivariate kernel
smoothing.
This problem, despite being the simplest multivariate density estimation
problem, presents many challenges when it comes to selecting the correct amount
of smoothing (i.e., choosing of a bandwidth matrix H). Most of popular bandwidth
Received July 19, 2010; Accepted September 28, 2010
Address correspondence to Ivana Horová, Department of Mathematics and Statistics,
Masaryk University, Kotlarska 2, Brno 61137, Czech Republic; E-mail: horova@math.muni.cz
759
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
760 Horová et al.
selection methods in a univariate case (e.g., Cao et al., 1994; Härdle et al., 2004)
can be transferred into multivariate settings. The least squares cross-validation,
the biased cross-validation, the smoothed cross-validation, and plug-in methods in
multivariate case have been developed and widely discussed (Chacón and Duong,
2009; Duong and Hazelton, 2003, 2005a,b; Sain et al., 1994; Scott, 1992; Wand and
Jones, 1994). The problem of the bandwidth matrix selection can be simpliﬁed by
imposing constraints on H (Wand and Jones, 1995).
A common approach to the multivariate smoothing is to ﬁrst rescale the data so
the sample variances are equal in each dimension—this approach is called scaling or
sphering the data so the sample covariance matrix is the identity (e.g., Duong, 2007;
Wand and Jones, 1993). The aim of the present paper is to propose methods for the
bandwidth matrix choice in bivariate case without using any pretransformations of
the data.
It is well known that a visualization is an important component of a
nonparametric data analysis (e.g., Chaudhuri and Marron, 1999; Godtliebsen et al.,
2002). We use this effective strategy to clarify the process of the bandwidth
matrix choice by using bivariate functional surfaces. The proposed method uses
an optimally balanced relation between bias squared and variance and a suitable
estimate of the asymptotic approximation of Mean Integrated Square Error (MISE).
The paper is organized as follows: In Section 2 we describe the basic properties
of the multivariate density estimates. Section 3 is devoted to the mean integrated
square error and its minimization. In Section 4 we deal with asymptotic MISE
(AMISE) and its minimization. In Section 5 we describe the idea of our method
and the theoretical results are explain by means of bivariate functional surfaces.
In Section 6 we conduct a simulation study comparing the least squares crossvalidation
(LSCV) method and the proposed method. In Section 7 the theoretical
results are applied to real data.
2. Kernel Density Estimation
Consider a d-variate random sample X1 Xn coming from an unknown
density f. We denote Xi1 Xid the components of Xi and a generic vector x ∈ d
has the representation x = x1 xd
T
.
For a d-variate random sample X1 Xn drawn from the density f the kernel
density estimator is deﬁned
ˆf x H =
1
n
n
i=1
KH x − Xi (1)
where H is a symmetric positive deﬁnite d × d matrix called the bandwidth matrix,
and KH x = H −1/2
K H−1/2
x , where H stands for the determinant of H, and K
is a d-variate kernel function. The kernel function K is often taken to be a d-variate
probability density function.
There are two types of multivariate kernels created from a symmetric univariate
kernel k—a product kernel KP
and a spherically symmetric kernel KS
:
KP
x =
d
i=1
k xi KS
x = ckk
√
xT x
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 761
where c−1
k = k
√
xT x dx. The choice of a kernel does not inﬂuence the estimate as
signiﬁcantly as the bandwidth matrix.
The choice of the smoothing matrix H is of a crucial importance. This matrix
controls the amount and the direction of the multivariate smoothing.
Let ℋℱ denote the class of symmetric, positive deﬁnite d × d matrices. The
matrix H ∈ ℋℱ has 1
2
d d + 1 independent entries which have to be chosen. A
simpliﬁcation can be obtained by imposing the restriction H ∈ ℋ , where ℋ ⊂ ℋℱ
is the subclass of diagonal positive deﬁnite matrices: H = diag h2
1 h2
d . A further
simpliﬁcation follows from the restriction H ∈ ℋ where ℋ = h2
Id h > 0 , Id is
d × d identity matrix and leads to the single bandwidth estimator (Wand and Jones,
1995). Using the single bandwidth matrix parametrization class ℋ is not advised
for data which have different dispersions in the coordinate directions (Wand and
Jones, 1993). On the other hand, the bandwidth selectors in the general ℋℱ class
are able to handle differently dispersed data but are computationally intensive. So
the ℋ diagonal matrix class is a compromise between computational speed with
sufﬁcient ﬂexibility.
For this reason we turn our attention to the bivariate kernel density estimate
provided that the bandwidth matrix is diagonal (i.e., H = diag h2
1 h2
2 ). First, let us
make some notation:
• will be shorthand for and dx will be shorthand for dx1dx2, V K =
K2
x dx, and
• f stands for the gradient and 2
f for the Hessian matrix.
f =


f x
x1
f x
x2

 2
f =



2f x
x2
1
2f x
x1 x2
2f x
x1 x2
2f x
x2
2



For the next steps we need a few assumptions about the kernel function K, the
bandwidth matrix H, and the density f:
(A1) K is a product bivariate kernel function satisfying
K x dx = 1 xK x dx = 0 xxT
K x dx = 2 K I2
(A2) H = Hn is a sequence of diagonal bandwidth matrices such that n−1
h1h2
−1
and h2
1 and h2
2 approach zero as n → .
(A3) Each entry of the Hessian matrix 2
f is piecewise continuous and square
integrable.
3. MISE and Its Minimization
The quality of the estimate (1) can be expressed in terms of MISE (Wand and Jones,
1995)
MISE H = E ˆf x H − f x
2
dx = var ˆf x H dx + bias2 ˆf x H dx
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
762 Horová et al.
that is,
MISE H =
1
nh1h2
V K + o nh1h2
−1
+
1
4
2
2 K h4
1 4 0 + 2h2
1h2
2 2 2 + h4
2 0 4 + o h2
1 + h2
2
2
where
k ℓ =
2
f
x2
1
k/2 2
f
x2
2
ℓ/2
dx k ℓ = 0 2 4 k + ℓ = 4
Let HMISE be a minimizer of MISE with respect to H, that is,
HMISE = arg min
H∈ℋ
MISE
The well known method of estimating HMISE is the LSCV method (Duong and
Hazelton, 2005b; Wand and Jones, 1995). The LSCV objective function is
LSCV H = ˆf x H
2
dx −
2
n
n
i=1
ˆf−i Xi H
ˆf−i Xi H =
1
n − 1
n
j=1
j=i
KH Xi − Xj
This function can be written in terms of convolutions f ∗ g x = f u g x −
u du (Duong and Hazelton, 2005b):
LSCV H = n−2
n
i j=1
KH ∗ KH − 2KH Xi − Xj + 2n−1
KH 0
Moreover, HLSCV = arg minH∈ℋ LSCV is an unbiased estimate of H in the sense
E LSCV H = MISE ˆf · H − f2
x dx
4. AMISE and Its Minimization
Since MISE is not mathematically tractable, we employ an AMISE, which can be
written as a sum of an asymptotic integrated variance and an asymptotic integrated
square bias:
AMISE H =
V K
nh1h2
AIVar
+
1
4 2 K 2
h4
1 4 0 + 2h2
1h2
2 2 2 + h4
2 0 4
AIBias2
(2)
and HAMISE stands for minimum of AMISE
HAMISE = arg min
H∈ℋ
AMISE
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 763
First, we summarize properties of AMISE and HAMISE. As a multivariate
analogue of the functional, which minimization yields optimal kernels, we consider
the functional
W K = V K 2/3
2 K 2/3
Moreover, we deﬁne as a canonical factor
3
=
V K
2 K 2
Making some calculations we arrive at the following lemma.
Lemma 4.1. AMISE(H) can be expressed in the form
AMISE H = W K
nh1h2
+
1
4 2
h4
1 4 0 + 2h2
1h2
2 2 2 + h4
2 0 4 (3)
It can be shown (Wand and Jones, 1995) that the entries of HAMISE are equal to
h2
1 AMISE =


3/4
0 4 V K
n 2 K 2 3/4
4 0 2 2 + 1/2
0 4
1/2
4 0


1/3
(4)
h2
2 AMISE =


3/4
4 0 V K
n 2 K 2 3/4
0 4 2 2 + 1/2
0 4
1/2
4 0


1/3
Thus h2
i AMISE = O n−1/3
i = 1 2.
Inserting these quantities into the formula (2), we arrive at the following lemma.
Lemma 4.2. Let HAMISE ∈ ℋ be a minimizator of AMISE with entries given by
formula (4). Then
varˆf x HAMISE dx
AIVar
= 2 biasˆf x HAMISE
2
dx
AIBias2
(5)
This relation is of great importance because it serves as a basis for a method we
are going to present. It means that minimization of AMISE is equivalent to seeking
for HAMISE such that (5) is satisﬁed.
Further, the use of formulas (4) in the relation (3) yields
AMISE HAMISE =
3
2
n−2/3
W K 2 2 + 1/2
0 4
1/2
4 0
1/3
(6)
that is, AMISE HAMISE = O n−2/3
.
It is easy to show that
h2 AMISE
h1 AMISE
=
4 0
0 4
1/4
(7)
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
764 Horová et al.
and
2 2 + 1/2
0 4
1/2
4 0
1/3
=
n1/3h1 AMISE h2 AMISE
Then substituting 2 2 + 1/2
0 4
1/2
4 0
1/3
into (6) we obtain
AMISE HAMISE =
3W K
2nh1 AMISEh2 AMISE
This formula allows to separate kernel effects from bandwidth matrix effects in
AMISE and thus offers a possibility to choose the kernel and the bandwidth matrix
in some automatic and optimal way. For a univariate case an automatic procedure
for simultaneous choice of a bandwidth, a kernel, and an order of the kernel was
proposed previously (Horová et al., 2002).
Remark. The biased cross-validation methods and smoothed cross-validation
method for estimating HAMISE have been widely discussed previously (Duong and
Hazelton, 2005b; Sain et al., 1994; Wand and Jones, 1994).
5. Proposed Methods
Our method is based on formula (5) and on a suitable estimate of AMISE.
In Horová et al. (2008) a suitable estimate of AMISE was used and the
extension of the method for a univariate case was presented in Horová and Zelinka
(2007). Here, we brieﬂy describe this method and provide theoretical results.
Let
AMISE H = varˆf x H dx + biasˆf x H
2
dx
where
varˆf x H dx =
1
n
K2
H x − y ˆf y H dy dx
=
1
n
H −1/2
K2
z ˆf x − H1/2
z H dzdx
=
1
n
H −1/2
V K ˆf x H dx
=
1
n
H −1/2
V K
and
biasˆf x H
2
dx = KH x − y ˆf y H dy − ˆf x H
2
dx
= K z ˆf x − H1/2
z H dz − ˆf x H
2
dx
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 765
=
1
n2
n
i j=1
KH ∗ KH ∗ KH ∗ KH − 2KH ∗ KH ∗ KH
+ KH ∗ KH Xi − Xj
Here a connection of the estimated squared bias term with the bootstrap method of
Taylor (1989) can be seen.
Hereinafter, HAMISE = diag ˆh2
1 AMISE
ˆh2
2 AMISE is the minimizer of AMISE over
the class of diagonal bandwidth matrices ℋ (i.e., HAMISE = arg minH∈ℋ AMISE).
Let g h1 h2 stand for the sum of convolutions in the form biasˆf x H
2
dx,
that is,
g h1 h2 =
n
i=1
n
j=1
KH ∗ KH ∗ KH ∗ KH − 2KH ∗ KH ∗ KH + KH ∗ KH Xi − Xj
The idea of our method is based on Lemma 4.2. Thus, we are seeking for ˆh1, ˆh2
such that
1
n
1
ˆh1
ˆh2
V K = 2
1
n2
g ˆh1
ˆh2
that is,
nV K = 2ˆh1
ˆh2g ˆh1
ˆh2 (8)
It means that minimization of AMISE could be achieved through the solving
Eq. (8).
But (8) is the nonlinear equation for two variables and thus we need another
relation between h1 and h2. This problem will be dealt with in the next section. Now
we explain the rationale of the proposed method.
Theorem 5.1. Let assumptions (A1), (A2), (A3) be satisﬁed and let the density f have
continuous partial derivatives of the fourth order. Then
E KH x − y ˆf y H dy = f x + 2 K tr H 2
f x
+
1
4 2 K 2
tr H 2
f H 2
f x + o trH
The proof is given in the Appendix.
Corollary 5.1. Under assumptions of Theorem 5.1, the relation
E biasˆf x H = biasˆf x H + o trH
is valid.
The last relation conﬁrms that the solution of Eq. (8) may be expected to be
reasonably close to HAMISE.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
766 Horová et al.
Figure 1. Optimal values of h1 and h2 lie on the curve h1 h2 = 2h1h2g h1 h2 −
nV K = 0, which is an intersection of the surface h1 h2 (light gray) and the coordinate
plane z = 0 (white).
Remark. Jones et al. (1991) was treated of the properties of the estimated square
bias for a univariate case.
Remark. Wand and Jones (1995) reminded of solve-the-equation (STE) univariate
selectors, which require solving nonlinear equation with respect to h. But their idea
is different from that which we present.
Figure 1 shows the shape of the functional h1 h2 = 2h1h2g h1 h2 − nV K
and the point we are seeking lies on curve h1 h2 = 0. Obviously, it has not a
unique solution, and thus we need another relationship between h1 and h2 to get the
unique solution. We propose two possibilities how to ﬁnd this relationship.
5.1. M1 Method
Using Scott’s rule (Scott, 1992) ˆhi = ˆin−1/6
for i = 1 2 gives the other relationship
between h1 and h2. It is easy to see that
h2 = ˆch1 ˆc =
ˆ2
ˆ1
and ˆ can be estimated by a sample standard deviation, or by some robust method
(e.g., a median deviation).
Now, the system of two equations for two unknowns h1, h2 has to be solved:
M1



2h1h2g h1 h2 = nV K
h2 = ˆch1
(9)
Figure 2 demonstrates the solution of the system (9) as an intersection of the
functional and planes.
As it will be shown in a simulation study, the method is rather inappropriate
because the entries of covariance matrix are often not able to take into account the
curvature of f and its orientation.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 767
Figure 2. M1 method: The point ˆh1
ˆh2 we are looking for is an intersection of the plane
h2 − ˆch1 = 0 (dark gray) and the surface h1 h2 (light gray) and the coordinate plane
z = 0 (white).
5.2. M2 Method
The second method can be considered as a hybrid of the biased cross-validation
method (Duong and Hazelton, 2005b; Sain et al., 1994) and the plug-in method
(Wand and Jones, 1994). We are concerned with fact (7), that is,
h4
2 AMISE · 0 4 = h4
1 AMISE · 4 0 (10)
where
0 4 =
2
f
x2
2
2
dx 4 0 =
2
f
x2
1
2
dx
For the sake of simplicity in the next considerations the notation h1 = h1 AMISE,
h2 = h2 AMISE is used.
Relation (10) means that h1, h2 should be such that this equation is satisﬁed.
At this step the estimates of 0 4 and 4 0 are needed. Since we assume that K is a
product kernel we can express the estimates of 0 4 and 4 0 as the following
ˆ 0 4 = n−2
n
i j=1
2
KH
x2
2
∗
2
KH
x2
2
Xi − Xj
ˆ 4 0 = n−2
n
i j=1
2
KH
x2
1
∗
2
KH
x2
1
Xi − Xj
where instead of a pilot bandwidth matrix G in the plug-in method the bandwidth
matrix H is used (i.e., ˆ 0 4, ˆ 4 0 estimate the density curvature in both directions).
Now, relation (10) yields
h4
2n−2
n
i j=1
2
KH
x2
2
∗
2
KH
x2
2
Xi − Xj = h4
1n−2
n
i j=1
2
KH
x2
1
∗
2
KH
x2
1
Xi − Xj (11)
Hence, we have the second equation for h1, h2.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
768 Horová et al.
Figure 3. The searched point ˆh1
ˆh2 is an intersection of the surface h1 h2 (light gray),
the coordinate plane z = 0 (white) and the surface h1 h2 (dark gray).
The proposed method is described by the system
M2



2h1h2g h1 h2 = nV K
h4
2
n
i j=1
2
KH
x2
2
∗
2
KH
x2
2
Xi − Xj = h4
1
n
i j=1
2
KH
x2
1
∗
2
KH
x2
1
Xi − Xj
(12)
The solution ˆh1
ˆh2 of this nonlinear system is an estimate of h1 AMISE h2 AMISE .
This system can be solved by Newton’s method.
Table 1
Target densities
Normal I 2 0 0 1/4 1 0
Normal II 1
2 2 −3/2 0 1/16 1 0 + 1
2 2 3/2 0 1/16 1 0
Normal III 1
2 2 0 0 1 1 0 + 1
2 2 3 0 1 1/2 0
Normal IV 1
3 2 0 0 1 1 0 + 1
3 2 0 4 1 4 0 + 1
3 2 4 0 4 1 0
Normal V 1
4 2 0 0 1 1 0 + 3
4 2 4 3 4 3 0
Normal VI 1
5 2 0 0 1 1 0 + 1
5 2 1/2 1/2 4/9 4/9 0
+3
5 2 13/12 13/12 25/81 25/81 0
Normal VII 1
3 2 0 −3 1 1/16 0 + 1
3 2 0 0 1 1/16 0 + 1
3 2 0 3 1 1/16 0
Normal VIII 1
3 2 0 −3 1 1/16 0 + 1
3 2 0 0 1/2 1/16 0 + 1
3 2 0 3 1/8 1/16 0
Normal IX 1
3 2 −6/5 0 9/16 9/16 7/10 + 1
3 2 0 0 9/16 9/16 −7/10
+1
3 2 6/5 0 9/16 9/16 7/10
Beta Beta ℬ 2 4 · ℬ 2 6
Beta Weibull ℬ 2 4 · 2 3
Gamma Beta 2 1 · ℬ 2 6
LogNormal ℒ 2 0 0 1 1 0
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 769
Figure 4. Contour plots of normal target densities.
Figure 5. Contour plots of nonnormal target densities.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
770 Horová et al.
In Fig. 3 the graphs of the surfaces
h1 h2 = 2h1h2g h1 h2 − nV K
and
h1 h2 = h4
2
n
i j=1
2
KH
x2
2
∗
2
KH
x2
2
Xi − Xj − h4
1
n
i j=1
2
KH
x2
1
∗
2
KH
x2
1
Xi − Xj
are presented. The solution of this system yields the estimates ˆh1 and ˆh2.
Remark. It is clear that h2
i AMISE = O n−1/3
. Asymptotic properties and a rate of
convergence of HAMISE to HAMISE can be treated of in a similar way as in (Duong and
Hazelton, 2005a,b) and (Duong and Hazelton, 2005a) showed that the discrepancy
between HAMISE and HMISE is asymptotically negligible.
6. Simulation Study
In this section we conduct a simulation study comparing the LSCV method with the
M1 and M2 methods. Samples of the size n = 100 were drawn from densities listed
in Table 1. Bandwidth matrices were selected for 100 random samples generated
from each density. Contour plots of target densities are displayed in Figures 4
and 5.
As a criterion for comparison of data driven bandwidth matrix selectors the
average of integrated square errors, that is,
ISE = avgH
ˆf x H − f x 2
dx (13)
Table 2
ISE: The average of ISE with a standard error in parentheses
Density LSCV M1 M2
Normal I 1 58 · 10−2
0 150 · 10−2
0 91 · 10−2
0 041 · 10−2
0 92 · 10−2
0 042 · 10−2
Normal II 1 82 · 10−2
0 068 · 10−2
3 59 · 10−2
0 045 · 10−2
1 39 · 10−2
0 043 · 10−2
Normal III 0 62 · 10−2
0 040 · 10−2
0 47 · 10−2
0 016 · 10−2
0 49 · 10−2
0 017 · 10−2
Normal IV 0 28 · 10−2
0 024 · 10−2
0 20 · 10−2
0 007 · 10−2
0 23 · 10−2
0 008 · 10−2
Normal V 0 23 · 10−2
0 013 · 10−2
0 18 · 10−2
0 005 · 10−2
0 18 · 10−2
0 005 · 10−2
Normal VI 1 55 · 10−2
0 110 · 10−2
1 00 · 10−2
0 045 · 10−2
1 01 · 10−2
0 045 · 10−2
Normal VII 1 23 · 10−2
0 063 · 10−2
5 51 · 10−2
0 146 · 10−2
1 11 · 10−2
0 075 · 10−2
Normal VIII 2 92 · 10−2
0 126 · 10−2
5 52 · 10−2
0 144 · 10−2
2 76 · 10−2
0 124 · 10−2
Normal IX 1 98 · 10−2
0 084 · 10−2
1 91 · 10−2
0 044 · 10−2
1 81 · 10−2
0 048 · 10−2
Beta Beta 3 07 · 10−1
0 194 · 10−1
1 93 · 10−1
0 071 · 10−1
1 99 · 10−1
0 104 · 10−1
Beta Weibull 5 92 · 10−2
0 420 · 10−2
3 72 · 10−2
0 151 · 10−2
4 12 · 10−2
0 248 · 10−2
Gamma Beta 5 93 · 10−2
0 324 · 10−2
4 05 · 10−2
0 128 · 10−2
4 27 · 10−2
0 221 · 10−2
LogNormal 2 49 · 10−2
0 060 · 10−2
2 51 · 10−2
0 065 · 10−2
2 81 · 10−2
1 601 · 10−2
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 771
is used, where the average is taken over simulated realizations. Table 2 brings
the results of this comparison. It can be also considered the criterion IAE =
avgH
ˆf x H − f x dx.
Figures 6–8 show distributions of the entries ˆh1 and ˆh2 of bandwidth matrices
HAMISE in the h1 h2 coordinate plane. We observe that LSCV estimates of
HAMISE suffer from large variability. This fact could be explained by the fact that
MISE H surface is rather ﬂatter near HMISE. The M1 and M2 methods perform
very similarly; however, the M1 estimator fails for densities Normal II, Normal VII
and Normal VIII. It is due to the fact that the use of the Scott’s (1992) rule does
not quite account for the curvature of f. The same problem occurs in application to
real data, shown in the next section. The advantage of the M1 method is contained
in its simplicity.
Figure 6. Distribution of ˆh1 and ˆh2—normal densities.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
772 Horová et al.
Figure 7. Distribution of ˆh1 and ˆh2—normal densities.
Figure 8. Distribution of ˆh1 and ˆh2—nonnormal densities.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 773
Figure 9. Kernel estimate of plasma lipid data.
On the other hand, it is obvious that the LSCV method performs rather well
in mixtures of normal densities (Normal VII, Normal VIII) with respect to M1.
The M2 method seems to be sufﬁciently reliable and easy to implement (using the
product kernel). This fact is also conﬁrmed by examing these methods on real data
sets in the next section.
7. Application to Real Data
We applied the proposed methods to the plasma lipid data—a bivariate data set
consisting of concentration of plasma cholesterol and plasma triglycerides taken on
320 patients with chest pain in a heart disease study (Scott, 1992). A scatterplot of
the data is shown in Fig. 9a. Figures 9c and 9d represent reconstructed probability
density functions using the bandwidth matrix HM1 = diag 5 372
11 632
and HM2 =
diag 14 992
25 582
, respectively. It can be compared with the reconstructed
probability density function using HLSCV = diag 42 312
31 862
shown in Fig. 9b.
The authors of the original case study (Scott et al., 1978) found two primary clusters
in these data set as well as the method M2 has found. See also papers by ´Cwik
and Koronacki (1997), Sain et al. (1994), Silverman (1989), and Wand and Jones
(1995). Interestingly, while the LSCV and M1 methods fail to recognize the density
bimodality, the M2 estimate is clearly bimodal.
8. Conclusion
The advantage of these methods is in their ﬂexibility and in the fact that they are
very easy to implement, especially for product kernels. Due to the fast computations
of convolutions these methods seem to be less time consuming. Simulations show
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
774 Horová et al.
that M2 estimates provides a sufﬁciently reliable way of estimating arbitrary
densities.
We would like to emphasize that we restrict ourselves on the use of the
Epanechnikov product kernel, because it has an optimality property (Wand and
Jones, 1995) and corresponding integrals can be easily evaluated by means of
convolutions. On the other hand, this kernel does not satisfy smoothness conditions
for bias cross-validation methods and the plug-in method. Thus the simulation study
compares the proposed methods with the LSCV method. Moreover, the proposed
methods essentially minimize the MISE as the LSCV does.
Further assessment of their practical performance and comparison with other
matrix bandwidth selectors through a large-scale simulation study would be very
important further research.
Appendix
Proof of Theorem 5.1. The proof requires some notations: for a m × n matrix A vec
is the vector operation (i.e., vec A is a mn × 1 vector of stacked columns of the
matrix A), and A ⊗ B denotes the Kronecker product of matrices A and B.
Let us denote
I x = E KH x − y ˆf y H dy
= E K z ˆf x − H1/2
z H dz
= K z E ˆf x − H1/2
z H dz
And now compute
I1 z = E ˆf x − H1/2
z H = KH x − H1/2
z − y f y dy
Substitutions yield
I1 z = K w − z f x − H1/2
w dw = K u f x − H1/2
u − H1/2
z du
We use Taylor expansion in the form
f x − H1/2
u − H1/2
z = f x − H1/2
z − H1/2
u
T
f x − H1/2
z
+
1
2
H1/2
u
T 2
f x − H1/2
z H1/2
u + o tr H
Hence, using properties (A1) of the kernel
I1 z = f x − H1/2
z +
1
2
H1/2
u
T 2
f x − H1/2
z H1/2
uK u du + o tr H
= f x − H1/2
z +
1
2 2 K tr H 2
f x − H1/2
z + o tr H
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 775
Further
tr H 2
f x − H1/2
z = vecH T
vec 2
f x − H1/2
z
(Magnus and Neudecker, 2007).
Now, we need Taylor expansion of vec 2
f x − H1/2
z :
vec 2
f x − H1/2
z = vec 2
f x − f ⊗ 2
f x H1/2
z
+
1
2
2
f ⊗ 2
f x vec zzT
H + O vecH2
Thus
I1 z = f x − H1/2
z +
1
2 2 K vecH T
vec 2
f x − f ⊗ 2
f x H1/2
z
+
1
2
2
f ⊗ 2
f x vec zzT
H + O vecH2
+ o tr H
Hence
I x = K z I1 z dz
= K z f x − H1/2
z dz +
1
2 2 K vecH T
vec 2
f x
+
1
4 2 K 2
vecH T 2
f ⊗ 2
f x vecH + o tr H
= E ˆf x H +
1
2 2 K tr H 2
f x
+
1
4 2 K 2
tr H 2
f x H 2
f x + o tr H
where we use again the results from (Magnus and Neudecker, 2007): A B C D
square matrices ⇒ trABCD = vecD T
A ⊗ C T
vecBT
. In our case D = B = H A =
C = 2
f x . All matrices are symmetrical and from this statement the last expression
follows immediately. Since
E ˆf x H = f x +
1
2 2 K tr H 2
f x + o trH
the statement of Theorem 5.1 is valid.
Proof of Corollary 5.1.
E biasˆf x H = E KH x − y ˆf y H dy − ˆf x H
= f x + 2 K tr H 2
f x +
1
4
2
2 K tr H 2
f x H 2
f x
+ o tr H − E ˆf x H
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
776 Horová et al.
= f x + 2 K tr H 2
f x +
1
4
2
2 K tr H 2
f x H 2
f x
− f x −
1
2 2 K tr H 2
f x + o tr H
=
1
2 2 K tr H 2
f x +
1
4
2
2 K tr H 2
f x H 2
f x + o tr H
Further, E ˆf x H − f x = 1
2 2 K tr H 2
f x + o tr H , then
E biasˆf x H = biasˆf x H +
1
4
2
2 K tr H 2
f x H 2
f x + o tr H
= biasˆf x H + o tr H
Acknowledgment
This research was supported by Ministry of Education, Youth and Sports of the
Czech Republic under the project LC06024 and by Masaryk University under the
Student Project Grant MUNI/A/1001/2009. The authors would like to thank José
E. Chacón for his very helpful and constructive comments and suggestions.
References
Cao, R., Cuevas, A., González Manteiga, W. (1994). A comparative study of several
smoothing methods on density estimation. Comput. Statist. Data Anal. 17:153–176.
Chacón, J. E., Duong, T. (2009). Multivariate plug-in bandwidth selection with
unconstrained pilot bandwidth matrices. Test 19:375–398.
Chaudhuri, P., Marron, J. S. (1999). SiZer for exploration of structure in curves. J. Amer.
Statist. Assoc. 94:807–823.
´Cwik, J., Koronacki, J. (1997). A combined adaptive-mixtures/plug-in estimator of
multivariate probability densities. Comput. Statist. Data Anal. 26:199–218.
Duong, T. (2007). ks: Kernel density estimation and kernel discriminant analysis for
multivariate data in R. J. Stat. Soft. 21:1–16.
Duong, T., Hazelton, M. L. (2003). Plug-in bandwidth matrices for bivariate kernel density
estimation. J. Nonparametr. Stat. 15:17–30.
Duong, T., Hazelton, M. L. (2005a). Convergence rates for unconstrained bandwidth matrix
selectors in multivariate kernel density estimation. J. Multivariate Anal. 93:417–433.
Duong, T., Hazelton, M. L. (2005b). Cross-validation bandwidth matrices for multivariate
kernel density estimation. Scand. J. Statist. 32:485–506.
Godtliebsen, F., Marron, J. S., Chaudhuri, P. (2002). Signiﬁcance in scale space for density
estimation. J. Comput. Graph. Statist. 11:1–21.
Härdle, W., Müller, M., Sperlich, S., Werwatz, A. (2004). Nonparametric and Semiparametric
Models. [On-line]. Retrieved from http://fedc.wiwi.hu-berlin.de/xplore/ebooks/html/-
spm/
Horová, I., Vieu, P., Zelinka, J. (2002). Optimal choice of nonparametric estimates of a
density and of its derivatives. Statistics & Decisions 20:355–378.
Horová, I., Koláˇcek, J., Zelinka, J., Vopatová, K. (2008). Bandwidth choice for kernel density
estimates. Proc. IASC, 542–551.
Horová, I., Zelinka, J. (2007). Contribution to the bandwidth choice for kernel density
estimates. Comput. Statist. 22:31–47.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Visualization and Bandwidth Choice 777
Jones, M. C., Marron, J. S., Park, B. U. (1991). A simple root n bandwidth selector. Ann.
Statist. 19:1919–1932.
Magnus, J. R., Neudecker, H. (2007). Matrix Differential Calculus With Applications in
Statistics and Econometrics. Chichester, England: Wiley.
Sain, S. R., Baggerly, K. A., Scott, D. W. (1994). Cross-validation of multivariate densities.
J. Amer. Statist. Assoc. 89:807–817.
Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.
New York: Wiley.
Scott, D. W., Gorry, G. A., Hoffman, R. G., Barboriak, J. J., Gotto, A. M. (1978). A
new approach for evaluating risk factors in coronary artery disease: a study of lipid
concentrations and severity of disease in 1847 males. Circulation 62:477–484.
Silverman, B. W. (1989). Density Estimation for Statistics and Data Analysis. London:
Chapman & Hall.
Taylor, C. C. (1989). Bootstrap choice of the smoothing parameter in kernel density
estimation. Biometrika 76:705–712.
Wand, M. P., Jones, M. C. (1993). Comparison of smoothing parameterizations in bivariate
kernel density estimation. J. Amer. Statist. Assoc. 88:520–528.
Wand, M. P., Jones, M. C. (1994). Multivariate plug-in bandwidth selection. Comput. Statist.
9:97–116.
Wand, M. P., Jones, M. C. (1995). Kernel Smoothing. London: Chapman & Hall.
Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012
Journal of Statistics: Advances in Theory and Applications
Volume 8, Number 2, 2012, Pages 91-103
2010 Mathematics Subject Classification: 62G08.
Keywords and phrases: kernel regression, bandwidth selection, iterative method.
Received November 1, 2012
 2012 Scientific Advances Publishers
ITERATIVE BANDWIDTH METHOD FOR
KERNEL REGRESSION
JAN KOLÁČEK and IVANKA HOROVÁ
Department of Mathematics and Statistics
Masaryk University
Brno
Czech Republic
e-mail: kolacek@math.muni.cz
Abstract
The aim of the contribution is to extend the idea of an iterative method known
for a kernel density estimate to kernel regression. The method is based on a
suitable estimate of the mean integrated square error. This approach leads to
an iterative quadratically convergent process. We conduct a simulation study
comparing the proposed method with the well-known cross-validation method.
Results are implemented in Matlab.
1. Univariate Kernel Density Estimator
Let nXX ,,1 … be independent real random variables having the
same continuous density f. The symbol fˆ will be used to denote whatever
density estimation is currently being considered.
Definition 1.1. Let k be an even nonnegative integer and K be a real
valued function continuous on R and satisfying the conditions:
(i) ( ) ( ) yxLyKxK −≤− for a constant [ ],1,1,,0 −∈∀> yxL
JAN KOLÁČEK and IVANKA HOROVÁ92
(ii) support ( ) [ ] ( ) ( ) ,011,1,1 ==−−= KKK
(iii) ( )







=≠β
=
<<
=
∫−
.,0
,,1
,0,0
1
1
kj
j
kj
dxxKx
k
j
ν
Such a function is called a kernel of order k and a class of these kernels is
denoted as .0kS
Remark 1.2. The well-known kernels are, e.g.,
(a) Epanechnikov kernel: ( ) ( ) [ ],1
4
3
1,1
2
−−= IxxK
(b) quartic kernel: ( ) ( ) [ ],1
4
3
1,1
22
−−= IxxK
(c) triweight kernel: ( ) ( ) [ ],1
32
35
1,1
22
−−= IxxK
(d) Gaussian kernel: ( ) ,
2
1 2
2x
exK
−
π
=
where [ ]1,1−I is an indicator function. Though the Gaussian kernel does
not satisfy the assumption (ii), it is very popular in many applications.
Let ,0kSK ∈ set ( ) ( ) .0,
.1
. >= h
h
K
h
Kh A parameter h is called a
bandwidth. The kernel estimator of f at the point R∈x is defined as
( ) .
1
,ˆ
1





 −
= ∑=
h
Xx
K
nh
hxf i
n
i
The problem of choosing the smoothing parameter is of a crucial
importance and will be treated in the next sections. Our analysis requires
the specification of an appropriate error criterion for measuring the error
when estimating the density at a single point as well as the error when
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 93
estimating the density over the whole real line. A useful criterion when
estimating at a single point is the mean square error (MSE) defined by
{ ( )} { ( ) ( )} .,ˆ,ˆ 2
xfhxfEhxfMSE −=
As concerns a global criterion, we consider the mean integrated square
error
{ ( )} { ( ) ( )} .,ˆ,ˆ 2
dxxfhxfEhfMISE −=⋅
∫
Since MISE is not mathematically tractable, we employ the asymptotic
mean integrated square error (AMISE), which can be written as a sum of
the asymptotic integrated variance and the asymptotic integrated square
bias
{ ( )}
( )
( ( ) ),
!
,ˆ
ˆ
2
2
2
ˆ
fAISB
kkk
fAIV
fV
k
h
nh
KV
hfAMISE
β
+=⋅ (1.1)
where ( ) ( ) .2
dxxggV ∫= Now, by minimizing (1.1) with respect to h, we
obtain the AMISE-optimal bandwidth { ( )},,ˆminarg, hfAMISEh kopt ⋅=
which takes the form
( )
( ( ) )
.
!
2 2
2
12
,
k
k
k
kopt
k
fknV
KV
h
β
=+
For more details, see, e.g., [9], [14].
2. Iterative Method for Kernel Density Estimation
The problem of choosing how much to smooth, i.e., how to choose the
bandwidth is a crucial common problem in kernel smoothing. Methods for
a bandwidth choice have been developed in many papers and
monographs, see, e.g., [1, 2, 5, 7, 8, 11, 12, 14], and many others.
However, there does not exist any universally accepted approach to this
serious problem yet.
JAN KOLÁČEK and IVANKA HOROVÁ94
The iterative method is based on the relation
( ) ( ),,ˆ2,ˆ
,, koptkopt hfkAISBhfAIV ⋅=⋅ (2.1)
with estimates of AIV and AISB
( )
( )ˆ , ,
V K
AIV f h
nh
⋅ =
and
( ) ( ) ( ) ( )( )
2
ˆ ˆ ˆ, , ,AISB f h K x f x hy h dy f x h dx⋅ = − −
∫ ∫
,
1
1,
2 




 −
Λ= ∑=
h
XX
hn
ji
n
ji
where
( ) ( )( ),2 zKKKKKKKKKz ∗+∗∗−∗∗∗=Λ
and ∗ denotes the convolution, i.e., ( ) ( ) ( ) .dttuKtKuKK −=∗ ∫ The
bandwidth estimate kITh ,
ˆ is a solution of the equation
( )
.0
2
1,
2
=




 −
Λ− ∑=
h
XX
hn
k
nh
KV ji
n
ji
(2.2)
In the paper [8], this nonlinear equation was solved by Steffensen’s
method. But this equation can be rewritten as
( ) .0
2
1,
=−




 −
Λ∑
≠
=
KV
h
XX
n
k ji
n
ji
ji
(2.3)
Since the first derivative of the function standing on the left hand side of
this equation is easy to compute by using convolutions, Newton’s method
can be used. For more details, see [9].
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 95
3. Univariate Kernel Regression
Consider a standard regression model of the form
( ) ,,,1, nixmY iii …=ε+= (3.1)
where m is an unknown regression function, nYY ,,1 … are observable
data variables with respect to the design points .,,1 nxx … The residuals
nεε ,,1 … are independent identically distributed random variables for
which
( ) ( ) .,,1,0var,0 2
niE ii …=>σ=ε=ε
The aim of kernel smoothing is to find a suitable approximation m of the
unknown function m.
To avoid boundary effects, the estimate is obtained by applying the
kernel on the extended series ,2,,2,1,
~
nnniYi …+−+−= where
jnj YY =±
~
for .,,1 nj …= Similarly, .2,,2,1, nnninixi …+−+−==
The assumption of the cyclic model leads to the kernel regression
estimator
( ) ( )
2
1
1
, , 1, , ,
n
ij h i j
n i n
m x h K x x Y j n
C
=− +
= − =∑ … (3.2)
where ( ).
1
1
ih
n
ni
n xKC ∑
−
+−=
= For more details about this estimator, see [9]
and [10].
The quality of a kernel regression estimator can be locally described
by the mean square error (MSE) or by a global criterion the mean
integrated square error (MISE). According to same reasons as in kernel
density estimation, we employ the asymptotic mean integrated square
error (AMISE), which can be written as a sum of the asymptotic
integrated variance and asymptotic integrated square bias
JAN KOLÁČEK and IVANKA HOROVÁ96
{ ( )}
( ) 2 2
2
, ,
!
k
k
AIV AISB
V K k
AMISE m h A h
nh k
σ β ⋅ = +  
 
(3.3)
where ( ( )( )) .2
dxxmA k
k ∫= The optimal bandwidth considered here is
,, kopth the minimizer of (3.3), i.e.,
{ ( )}, arg min , ,
n
opt k
h H
h AMISE m h
∈
= ⋅
where [ ( ) ( ) ]121121
, +−+−
= kk
n bnanH for some .0 ∞<<< ba
The calculation gives
( )( )
.
2
! 12
1
2
22
,
+








β
σ
=
k
kk
kopt
Akn
kKV
h (3.4)
In nonparametric regression estimation, like in density estimation, a
critical and inevitable step is to choose the smoothing parameter
(bandwidth) to control the smoothness of the curve estimate. The
smoothing parameter considerably affects the features of the estimated
curve.
One of the most widespread procedures for bandwidth selection is the
cross-validation method, also known as “leave-one-out” method.
The method is based on modified regression smoother (3.2) in which
one, say the j-th, observation is left out
( ) ( )
2
1
1
, , 1, , .
i j
n
ij j h i j
n i n
m x h K x x Y j n
C
≠
−
=− +
= − =∑ …
With using these modified smoothers, the error function which should be
minimized takes the form
( ) { ( ) }2
1
1
.
n
i i i
i
CV h m x Y
n
−
=
= −∑ (3.5)
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 97
The function ( )hCV is commonly called a “cross-validation” function. Let
CVhˆ stand for minimization of ( ),hCV i.e.,
( ).minargˆ hCVh
nHh
CV
∈
=
The literature on this criterion is quite extensive, e.g., [3, 4, 6, 13].
4. Iterative Method for Kernel Regression
The proposed method is based on the similar relation as in the kernel
density estimation. It is easy to show that the following equation holds:
( ) ( ), ,, 2 , ,opt k opt kAIV m h kASBm h⋅ = ⋅ (4.1)
where
( )
( )2
, ,
V K
AIV m h
nh
σ
⋅ =
and
( ) ( ) ( ){ }2
1
1
, , .
n
i i
i
mASB h E m x h m x
n
=
⋅ = −∑
For estimating of AIV and ASB in (4.1), we use
( )
( )
( )
2
22
1
2
ˆ 1
ˆ, , with ,
2 2
n
i i
i
V K
AIV m h Y Y
nh n −
=
σ
⋅ = σ = −
− ∑
and
( ),ASBm h⋅ ( ) ( ) .
~1~11
22
1
2
11
















−−−= ∑∑∑ +−=+−==
lilh
n
nln
ijih
n
nin
n
j
YxxK
C
YxxK
Cn
To find the bandwidth estimate ,ˆ
, kITh we solve the following equation:
JAN KOLÁČEK and IVANKA HOROVÁ98
( )
( )
2
ˆ
.
2 ,
V K
h
knASBm h
σ
=
⋅
(4.2)
We use Steffensen’s iterative method with the starting approximation
.ˆ
0 nkh = This approach leads to an iterative quadratically convergent
process.
5. Simulation Study
We carry out two simulation studies to compare the performance of
the bandwidth estimates. The comparison is done by the following way.
The observations, ,iY for ,100,,1 == ni … are obtained by adding
independent Gaussian random variables with mean zero and variance 2
σ
to some known regression function. Both regression functions used in our
simulations are illustrated in Figure 1. They are not chosen randomly for
our comparison. The first one is suitable for the extension to the cyclic
model, on the other side, the second function does not satisfy the
assumption for the cyclic model.
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 99
Figure 1. Regression functions.
One hundred series are generated. For each data set, we estimate the
optimal bandwidth by both mentioned methods, i.e., for each method, we
obtain 100 estimates. Since we know the optimal bandwidth, we compare
it with the mean of estimates and look at their standard deviation, which
describes the variability of all methods. The Epanechnikov kernel
( ) ( ) [ ]1,1
2
1
4
3
−−= IxxK is used in all cases.
JAN KOLÁČEK and IVANKA HOROVÁ100
5.1. Simulation 1. In this case, we use the regression function
( ) ( ) ,5
6
4sin220cos +




 −+=
x
xxm
with .3.02
=σ Table 1 summarizes the sample means and the sample
standard deviations of bandwidth estimates, ( )hE ˆ is the average of all
100 values and ( )hstd ˆ is their standard deviation. Figure 2 illustrates the
histogram of results of all 100 experiments.
Table 1. Means and standard deviations
0560.02, =opth
( )hE ˆ ( )hstd ˆ
CV 0.0550 0.0120
IT 0.0556 0.0048
Figure 2. Distribution of h for both methods.
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 101
As we see, the standard deviation of all results obtained by the
proposed method is less than the value for the case of cross-validation
method and also the mean of these results is a little bit closer to the
theoretical optimal bandwidth. The reason is that the regression function
is smooth and satisfies the conditions for the extension to the cyclic
design. Thus, the proposed method works very well in this case.
5.2. Simulation 2. In the second example, we use the regression function
( ) ( ) ( ) ,55cot1011ln 13
−−+−= xxxm
with .05.02
=σ Table 2 summarizes the sample means and the sample
standard deviations of bandwidth estimates, ( )hE ˆ is the average of all
100 values and ( )hstd ˆ is their standard deviation. Figure 3 illustrates
the histogram of results of all 100 experiments.
Table 2. Means and standard deviations
0707.02, =opth
( )hE ˆ ( )hstd ˆ
CV 0.1466 0.0443
IT 0.0592 0.0112
JAN KOLÁČEK and IVANKA HOROVÁ102
Figure 3. Distribution of h for both methods.
It is evident that better results are obtained by the proposed method.
This method is successful despite the fact that the regression function
does not meet assumptions for the extension to the cyclic model. The
cross-validation method often results in smaller bandwidths. The
variance of this criterion is also significant.
Acknowledgement
This research was supported by Masaryk University under the project
MUNI/A/1001/2009.
ITERATIVE BANDWIDTH METHOD FOR KERNEL … 103
References
[1] R. Cao, A. Cuevas and W. González Manteiga, A comparative study of several
smoothing methods in density estimation, Computational Statistics and Data
Analysis 17(2) (1994), 153-176.
[2] P. Chaudhuri and J. S. Marron, Sizer for exploration of structures in curves,
Journal of the American Statistical Association 94(447) (1999), 807-823.
[3] P. Craven and G. Wahba, Smoothing noisy data with spline functions estimating
the correct degree of smoothing by the method of generalized cross-validation,
Numerische Mathematik 31(4) (1979), 377-403.
[4] Bernd Droge, Some Comments on Cross-Validation, Technical Report 1994-7,
Humboldt Universitaet Berlin, 1996.
[5] Jianqing Fan and Irene Gijbels, Data-driven bandwidth selection in local
polynomial fitting: Variable bandwidth and spatial adaptation, Journal of the Royal
Statistical Society, Series B (Methodological) 57(2) (1995), 371-394.
[6] W. Härdle, Applied Nonparametric Regression, 1st Edition, Cambridge University
Press, Cambridge, 1990.
[7] W. Härdle, M. Müller, S. Sperlich and A. Wewatz, Nonparametric and
Semiparametric Models, 1st Edition, Springer, Heidelberg, 2004.
[8] I. Horová and J. Zelinka, Contribution to the bandwidth choice for kernel density
estimates, Computational Statistics 22(1) (2007), 31-47.
[9] I. Horová, J. Koláček and J. Zelinka, Kernel Smoothing in MATLAB, World
Scientific, Singapore, 2012.
[10] Jan Koláček, Plug-in method for nonparametric regression, Computational
Statistics 23(1) (2008), 63-78.
[11] David W. Scott, Multivariate Density Estimation: Theory, Practice, and
Visualization, Wiley, 1992.
[12] Bernard W. Silverman, Density Estimation for Statistics and Data Analysis,
Chapman and Hall, London, 1986.
[13] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal
of the Royal Statistical Society Series B-Statistical Methodology 36(2) (1974),
111-147.
[14] M. P. Wand and M. C. Jones, Kernel Smoothing, Chapman and Hall, London, 1995.
g
Journal of Applied Probability and Statistics
Vol. 6, No. 1&2, pp. 73-85
ISOSS Publications 2012
A GENERALIZED REFLECTION METHOD FOR KERNEL
DISTRIBUTION AND HAZARD FUNCTIONS ESTIMATION
Jan Kol´aˇcek
Jan Kol´aˇcek, Department of Mathematics and Statistics, Masaryk University, Brno, Czech
Republic
Email: kolacek@math.muni.cz
Rohana J. Karunamuni
Rohana J. Karunamuni, Department of Mathematical and Statistical Sciences, University of
Alberta, Edmonton, Canada
Email: R.J.Karunamuni@ualberta.ca
summary
In this paper we focus on kernel estimates of cumulative distribution and hazard
functions (rates) when the observed random variables are nonnegative. It is well
known that kernel distribution estimators are not consistent when estimating
a distribution function near the point x = 0. This fact is rather visible in many
applications, for example in kernel ROC curve estimation [10]. In order to avoid
this problem we propose a bias reducing technique that is a kind of generalized
reﬂection method. Our method is based on ideas of [8] and [19] developed for
boundary correction in kernel density estimation. The proposed estimators are
compared with the traditional kernel estimator and with the estimator based on
“classical” reﬂection method using simulation studies.
Keywords and phrases: kernel estimation, reﬂection, distribution function, hazard
function.
AMS Classiﬁcation: 30C40, 62G30.
1 Introduction
The most commonly used nonparametric estimate of a cumulative distribution function F
is the empirical distribution function Fn, where Fn(x) = n−1 n
i=1 I[Xi ≤ x] with X1, ...,
Xn being the observations. But Fn is a step function even in the case that F is a continuous
function. Another type of nonparametric estimator for F is derived from kernel smoothing
methods. Kernel smoothing is most widely used because it is easy to apply and produce
estimators which have good small and asymptotic properties. Kernel smoothing has received
a lot of attention in density estimation. Good references in this area are [3], [16] and [17].
However, results in kernel distribution function estimation are relatively few. Theoretical
properties of kernel distribution function estimator have been investigated by [12], [14] and
[1]. Although there is a vast literature on boundary correction in density estimation context,
boundary eﬀects problem in distribution function context has been less studied. The same
can be said about estimation of hazard function (rates) estimation.
In this paper, we develop a new kernel type estimator of the cumulative distribution
and hazard rates that removes boundary eﬀects near the end points of the support. Our
estimator is based on a new boundary corrected kernel estimator of distribution function
and it is based on ideas of [6], [7], [8] and [19] developed for boundary correction in kernel
density estimation. The basic technique of construction of the proposed estimator is kind of
a generalized reﬂection method involving reﬂecting a transformation of the observed data.
In fact, the proposed method generates a class of boundary corrected estimators. We derive
expressions for the bias and variance of the proposed estimators. Furthermore, the proposed
estimators are compared with the traditional estimator and with the estimator based
on “classical” reﬂection method using simulation studies. We observe that the proposed
estimators successfully remove boundary eﬀects and performs considerably better than the
others two.
Kernel smoothing in distribution function estimation and boundary eﬀects are discussed
in the next section. The proposed estimator of distribution functions is given in Section 3.
Section 4 discusses estimation of hazard functions (rates). Simulation results are given in
Section 5 and our results are applied on real data in Section 6. Finally, some concluding
remarks are given in Section 7.
2 Kernel distribution estimator and boundary eﬀects
Let f denote a continuous density function with support [0, a], 0 < a ≤ ∞, and consider
nonparametric estimation of the cumulative distribution function F of f based on a random
sample X1, ..., Xn from f. Suppose that F(j)
, the j-th derivative of F, exists and is
continuous on [0, a], j = 0, 1, 2, with F(0)
= F and F(1)
= f. Then the traditional kernel
estimator of F is given by
Fh,K(x) =
1
n
n
i=1
W
x − Xi
h
, W(x) =
x
−1
K(t)dt (2.1)
where K is a unimodal symmetric density function with support [−1, 1] and h is the bandwidth
(h → 0 as n → ∞). Set β2 =
1
−1
t2
K(t)dt. The basic properties of Fh,K(x) at interior
points are well-known (e.g., [11]), and under some smoothness assumptions these include,
for h ≤ x ≤ a − h,
E(Fh,K(x)) − F(x) =
1
2
β2f(1)
(x)h2
+ o(h2
)
74
nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(x)
1
−1
W(t)(W(t) − 1)dt + o(h).
The performance of Fh,K(x) at boundary points, i.e., for x ∈ [0, h) ∪ (a − h, a], however, differs
from the interior points due to so-called “boundary eﬀects” that occur in nonparametric
curve estimation problems. More speciﬁcally, the bias of Fh,K(x) is of order O(h) instead
of O(h2
) at boundary points while the variance of Fh,K(x) is of the same order. This fact
can be clearly seen by examining the behavior of Fh,K inside the left boundary region [0, h].
Let x be a point in the left boundary, i.e., x ∈ [0, h]. Then we can write x = ch, 0 ≤ c ≤ 1.
It can be shown that the bias and variance of Fh,K(x) at x = ch are of the form
E(Fh,K(x)) − F(x) = hf(0)
−c
−1
W(t)dt
+ h2
f(1)
(0)



c2
2
+ c
−c
−1
W(t)dt −
c
−1
tW(t)dt



+ o(h2
)
(2.2)
nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(0)



c
−1
W2
(t)dt − c



+ o(h). (2.3)
From the expression (2.2) it is now clear that the bias of Fh,K(x) is of order O(h) instead
of O(h2
). To remove this boundary eﬀect in kernel distribution estimation we investigate a
new class of estimators in the next section.
3 The proposed estimator
In this section we propose a class of estimators of the distribution function F of the form
Fh,K(x) =
1
n
n
i=1
W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
, (3.1)
where h is the bandwidth, W is a cumulative distribution function deﬁned by (2.1) and g1
and g2 are two transformations that need to be determined. We assume that gi, i = 1, 2, are
nonnegative, continuous and monotonically increasing functions deﬁned on [0, ∞). Further
assume that g−1
i exists, gi(0) = 0, g
(1)
i (0) = 1, and that g
(2)
i exists and is continuous on
[0, ∞), where g
(j)
i denotes the j-th derivative of gi, with g
(0)
i = gi and g−1
i denoting the
inverse function of gi, i = 1, 2. We will choose g1 and g2 so that Fh,K(x) ≥ 0 everywhere.
Note that the i-th term of the sum in (3.1) can be expressed as
W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
=
x+g2(Xi)
h
−x+g1(Xi)
h
K(t)dt.
75
The preceding integral is non-negative provided the inequality −x+g1(Xi)
h ≤ x+g2(Xi)
h holds.
Since x ≥ 0, the preceding inequality will be satisﬁed if g1 and g2 are such that g1(Xi) ≤
g2(Xi) for i = 1, . . . , n. Thus we will assume that g1 and g2 are chosen such that g1(x) ≤
g2(x) for x ∈ [0, ∞) for the proposed estimator. Now, we can obtain the bias and variance
of (3.1) at x = ch, 0 ≤ c ≤ 1, as
E(Fh,K(x)) − F(x) = h2



f(1)
(0)

c2
2
+ 2c
−c
−1
W(t)dt −
c
−c
tW(t)dt


− f(0)g
(2)
1 (0)
c
−1
(c − t)W(t)dt
−f(0)g
(2)
2 (0)
−c
−1
(c + t)W(t)dt



+ o(h2
).
(3.2)
nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(0)



c
−1
W2
(t)dt
−2
c
−1
W(t)W(t − 2c)dt +
−c
−1
W2
(t)dt



+ o(h).
(3.3)
The proofs of (3.2) and (3.3) are given in [10]. Similarly we can express the bias and variance
of (3.1) at “interior” points x = c > 1. Note that the contribution of g2 on the bias vanishes
as c → 1. By comparing expressions (2.2), (3.2), (2.3) and (3.3) at boundary points we can
see that the variances are of the same order and the bias of Fh,K(x) is of order O(h) while
the bias of Fh,K(x) is of order O(h2
). So our estimator removes boundary eﬀects in kernel
distribution estimation since the bias at boundary points is of the same order as the bias at
interior points.
It is clear that there are various possible choices available for the pair (g1, g2). However,
we will choose g1 and g2 so that the condition Fh,K(0) = 0 will be satisﬁed because of
the fact that F(0) = 0. A suﬃcient (but not necessary) condition for the preceding to be
satisﬁed is that g1 and g2 must be equal. Thus we need to construct a single transformation
function g such that g = g1 = g2. Other important properties that are desirable in the
estimator Fh,K are the local adaptivity, that is the transformation function g depends on c.
Some discussion on the choice of gc and other various improvements that can be made
would be appropriate here. The trivial choice is gc(y) = y, which represents the “classical”
reﬂection method estimator. However, it is possible to construct functions gc’s that improve
the bias further under some additional conditions. For instance, if one examines the right
hand side of the bias expansion (3.2) then it is not diﬃcult to see that the terms inside
bracket (i.e., the coeﬃcient of h2
) can be made equal to zero if gc is appropriately chosen.
76
Set
Ac =



d1
c2
2 +2cI1−I2
c2+2cI1−I2
, for 0 ≤ c < 1
d1
β2
c2+β2
, for c > 1
where d1 = f(1)
(0)
f(0) , I1 =
−c
−1
W(t)dt, I2 =
c
−c
tW(t)dt.
If gc is chosen such that g
(2)
c (0) = Ac then the bias of Fh,K(x) would be theoretically of
order O(h3
). For such a function gc, the second derivative at zero g
(2)
c (0) will be dependent
on the ratio d1 = f(1)
(0)
f(0) . Then the problem of estimation of d1 naturally arises as in the
papers of [6], [7], [8] and [9]. For example, the ratio d1 = f(1)
(0)
f(0) is estimated in [9] as the
ﬁrst derivative of natural logarithm of f at zero. For more details, especially for the exact
formula for d1 and for some statistical properties, especially for the asymptotic convergence
rate, see the preceding paper.
Summarizing all the assumptions, it is clear now that gc should satisfy the following
conditions:
(i) gc : [0, ∞) → [0, ∞), gc is continuous, monotonically increasing and g
(i)
c exists, i = 1, 2,
(ii) g−1
c (0) = 0, g
(1)
c (0) = 1
(iii) g
(2)
c (0) = Ac.
Functions satisfying conditions (i) – (iii) are easy to construct. We will consider the
following transformation. For y ≥ 0, let us deﬁne
gc(y) = y +
1
2
Acy2
+ λA2
cy3
, (3.4)
where Ac is an estimator of Ac based on an estimator d1 of d1, and λ is a positive constant
such that λ > 1
12 . This condition on λ is necessary for gc(y) to be an increasing function of
y. Based on extensive simulations, we ﬁnd that this transformation adapts well to various
shapes of distribution functions with setting λ = 0.1.
77
4 Estimation of hazard rates
Given a distribution F with probability density function f, the hazard rate is deﬁned by
z(t) =
f(t)
1 − F(t)
. (4.1)
The hazard rate is also called the age-speciﬁc or conditional failure rate. It is useful particularly
in the context of reliability theory and survival analysis and hence in ﬁelds as diverse
as engineering and medical statistics. See [2] for a discussion of the role of the hazard rate
in understanding and modeling survival data. [16] provides a survey of some methods for
nonparametric estimation of hazard estimation.
Given a sample X1, ..., Xn from the density f, a natural nonparametric estimator of the
hazard rate is z(t) = f(t)/(1−F(t)), where f is a suitable density estimator of f based on X1,
..., Xn and F(t) =
t
−∞
f(x)dx estimates F(t). If f is the traditional kernel estimator with a
kernel K and bandwidth h, then F(t) can be obtained by F(t) = n−1 n
i=1 K1((t − Xi)/h),
where K1(u) =
u
−∞
K(t)dt. [18] introduced and discussed z(t) and various alternative
nonparametric estimators of z(t). For further properties of z(t) with kernel and other related
estimators see, e.g., [15], [13] (Section 4.3) and [16] (Section 6.5).
It has been observed that consideration of errors involved in the construction of z show
that, to a ﬁrst approximation, the main contribution to the error will be due to the numerator
of z, i.e., due to the estimator f; see, e.g., [16] (Section 6.5). Thus, to obtain the best
possible estimate of the hazard rate, one should aim to minimize the error in the estimation
of density f. If the support of f is the interval [0, a], 0 < a ≤ ∞, which is usually the
case in survival and reliability data, then the traditional kernel estimators of density f
suﬀer from boundary eﬀects. Therefore, it is advisable to use boundary adjusted estimators
of density f and the distribution F in this context. For this purpose here we implement
a boundary adjusted kernel density estimator similar to the one proposed in [6] and the
boundary adjusted distribution function estimator Fh,K given above. Thus, the proposed
estimator of the hazard rate z(t) is given by, for t = ch, c ≥ 0,
z(t) =
f(t)
1 − Fh,K(t)
, (4.2)
where Fh,K is deﬁned by (3.1) and f is deﬁned by
f(t) =
1
nh
n
i=1
K
t − g1,c(Xi)
h
+ K
t + g1,c(Xi)
h
, (4.3)
where
g1,c(x) = x +
1
2
d1kcx2
+ λ0(d1kc)2
y3
, (4.4)
78
with d1 as deﬁned in [9], λ0 is a positive constant such that 12λ0 > 1, and kc given by, for
c ≥ 0,
kc =
2
1
c
(u − c)K(u)du
c + 2
1
c
(u − c)K(u)du
. (4.5)
Theorem 1. The mean squared error (MSE) of z(t) is given by, for t = ch, c ≥ 0,
E(z(t) − z(t))2
=
1 − F(t)
w1w2
2
f(0)
nh

2
1
c
K(t)K(2c − t)dt + V (K)

 + o
1
nh
, (4.6)
where wi, i = 1, 2 are ﬁnite constants satisfying 1−Fh,K(t) ≥ w1 > 0 and 1−F(t) ≥ w2 > 0
and V (K) =
1
−1
K2
(x)dx.
Proof. For a detailed proof see Appendix.
5 A simulation study
To test the eﬀectiveness of our estimator, we simulated its performance against the classical
reﬂection method. The simulation is based on 1 000 replications. In each replication, the
random variables X ∼ Exp(1) were generated and the estimate of the hazard function was
computed. Let us note that the real hazard function in this case is constant equal to one.
In all replications the sample size of n = 100 was used. In this case, the actual global
optimal bandwidth (see [1]) for F is hF = 0.8479 and for f is hf = 0.7860 (see [16]).
For kernel estimation of both needed functions (distribution and density) we have used the
Epanechnik kernel K(x) = 3
4 (1 − x2
)I[−1,1](x), where IA is the indicator function on the
set A.
For each estimated hazard function we have calculated the mean integrated squared Error
(MISE) on the interval [0, hF ] over all 1 000 replications and have displayed the results in
a boxplot in Figure 1. The variance of each estimator can be accurately gauged by the
whiskers of the plot. The values of means and standard deviations for MISE of each method
are given in Table 1. As we can see the reﬂection method gives the smaller values of MISE
than the classical estimator, but the variance is not so small. From this point of view the
proposed estimator seems to be better.
To get more detailed information about estimators we have calculated the Mean Squared
error (MSE) at four points in the boundary region x = chF , c = 0, 0.25, 0.5, 0.75. The
boxplot of MSE for each estimator over all 1 000 replications is illustrated in Figure 2. The
values of means and standard deviations for MSE at each point for each method are given
in Table 2. These values describe the performance of our proposed method with respect
to MSE when compared with the classical and reﬂection method estimators. The values of
mean and also of the variance were smallest in the case of our proposed estimator. This is
79
0
0.05
0.1
0.15
0.2
0.25
1 2 3
Figure 1: MISE for estimates of z(t) for the classical estimator with boundary eﬀects (1),
the reﬂection method (2) and for our proposed method (3).
Table 1: Means and STD’s for MISE
Method Mean STD
Classical 0.1265 0.0376
Reﬂection 0.0273 0.0209
Proposed 0.0142 0.0185
80
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.25h 0.5h 0.75h 0 0.25h 0.5h 0.75h 0 0.25h 0.5h 0.75h
classical reflection proposed
Figure 2: MSE at points x = chF , c = 0, 0.25, 0.5, 0.75 for the classical estimator with
boundary eﬀects, the reﬂection method and for our proposed method.
caused by a local adaptivity of our estimator. On other hand, the classical and reﬂection
method estimators are not locally adaptive. From the ﬁgures and tables it is clear that the
proposed estimator performed the best among the three compared. It captures the features
of the distribution and hazard functions correctly with minimum bias while holding onto a
low variance.
Table 2: Means and STD’s for MSE at x = chF .
Classical Reﬂection Proposed
c Mean STD Mean STD Mean STD
0.00 0.3103 0.0591 0.0582 0.0369 0.0149 0.0195
0.25 0.1398 0.0528 0.0229 0.0261 0.0144 0.0194
0.50 0.0421 0.0346 0.0140 0.0198 0.0137 0.0198
0.75 0.0140 0.0183 0.0139 0.0210 0.0139 0.0210
81
6 Real data
In this section we apply our results to a real data set. For our analysis, we have used the
suicide data from [16]. The proposed hazard rate estimate is given in Figure 3. The solid
line represents our proposed estimator (4.2) and the dashed line is for the traditional kernel
estimator (with boundary eﬀects). When choosing the optimal bandwidths for the density
and distribution function estimation, we used iterative methods described in [5] and [4].
The optimal bandwidths for the density and the distribution function were estimated as
ˆhf = 132.01 and ˆhF = 144.83, respectively. The proposed estimator of hazard rate again
captures proper features of the actual hazard rate, while the traditional estimator dip near
the left end point due to boundary eﬀects.
0 100 200 300 400 500 600 700 800
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
Figure 3: Hazard rate estimates constructed from the suicide data.
7 Conclusion
In this paper we proposed new kernel-type estimators to estimate the distribution and
hazard functions without boundary eﬀects near the endpoints of the support. The technique
implemented is a kind of generalized reﬂection method involving reﬂecting a transformation
of the data. The proposed method generates a class of boundary corrected estimators and
it is based on ideas of boundary corrections for kernel density estimators presented in [6], [7]
and [8]. We showed some good properties of our proposed method (e.g., local adaptivity).
Furthermore, it is shown that bias of the proposed estimator is better than that of the
“classical” one. The proposed estimators performed quite well in the numerical studies
compared to the classical and reﬂection method estimators.
82
8 Acknowledgements
The research was supported by The Jaroslav H´ajek center for theoretical and applied statistics
(MˇSMT LC 06024).
Appendix: Proof of Theorem 1
Theorem 1 The mean squared error (MSE) of z(t) is given by, for t = ch, c ≥ 0,
E(z(t) − z(t))2
=
1 − F(t)
w1w2
2
f(0)
nh

2
1
c
K(t)K(2c − t)dt + V (K)

 + o
1
nh
,
where wi, i = 1, 2 are ﬁnite constants satisfying 1−Fh,K(t) ≥ w1 > 0 and 1−F(t) ≥ w2 > 0
and V (K) =
1
−1
K2
(x)dx.
Proof. The diﬀerence z(t) − z(t) is equal to, for t = ch, c ≥ 0,
z(t) − z(t) =
f(t)
1 − Fh,K(t)
−
f(t)
1 − F(t)
=
f(t) (1 − F(t)) − f(t)(1 − Fh,K(t)
(1 − Fh,K(t))(1 − F(t))
.
Since we are only concerned about the behavior of z(t) near the left boundary, i.e.,
t = ch, c ≥ 0, we only need to study the diﬀerence near the left endpoint 0. For t = ch,
c ≥ 0 we can assume that 1 − Fh,K(t) ≥ w1 > 0 and 1 − F(t) ≥ w2 > 0, where wi, i = 1, 2
are ﬁnite constants. The preceding conditions are reasonable, since Fh,K(0) = 0, F(0) = 0
and Fh,K and F are continuous functions. Therefore we obtain
(z(t) − z(t))2
≤ (w1w2)−2
(f(t) (1 − F(t)) − f(t)(1 − Fh,K(t)))2
.
To get the formula for MSE of z(t) we need to express E(f(t) (1 − F(t))−f(t)(1−Fh,K(t))2
.
83
E(f(t) (1 − F(t)) − f(t)(1 − Fh,K(t))2
= (1 − F(t))2
Ef2
(t) + f2
(t)E(1 − Fh,K(t))2
− 2f(t)(1 − F(t))Ef(t)(1 − Fh,K(t))
= (1 − F(t))2
varf(t) + (Ef(t))2
+ f2
(t) varFh,K(t) + (1 − EFh,K(t))2
−2f(t)(1 − F(t)) Ef(t)(1 − EFh,K(t)) + o
1
nh
= (1 − F(t))2



f(0)
nh

2
1
c
K(u)K(2c − u)du + V (K)

 + o
1
nh
+ f2
(t) + o(h)



+f2
(t)



1
n
F(t)(1 − F(t)) +
hf(0)
n


c
−1
W2
(u)du − 2
c
−1
W(u)W(u − 2c)du
+
−c
−1
W2
(u)du

 + o(h) + (1 − F(t))2
+ o(h2
)



−2f(t)(1 − F(t)) [f(t) + o(h)] 1 − F(t) + o(h2
) + o
1
nh
= (1 − F(t))2 f(0)
nh

2
1
c
K(u)K(2c − u)du + V (K)

 + o
1
nh
.
References
[1] A. Azzalini, (1981). A note on the estimation of a distribution function and quantiles
by a kernel method, Biometrika, 68, 326–328.
[2] D. Cox and D. Oakes, (1984). Analysis of survival data, London, New York: Chapman
and Hall.
[3] T. Gasser, H. M¨uller and V. Mammitzsch, (1985). Kernels for nonparametric curve
estimation, Journal of the Royal Statistical Society. Series B, 47, 238–252.
[4] I. Horov´a, J. Kol´aˇcek, J. Zelinka and A.H. El-Shaarawi, (2008). Smooth Estimates of
Distribution Functions with Application in Environmental Studies, Advanced topics on
mathematical biology and ecology, pp. 122–127.
[5] I. Horov´a and J. Zelinka, (2007). Contribution to the bandwidth choice for kernel
density estimates, Computational Statistics, 22, 31–47.
[6] R. Karunamuni and T. Alberts, (2005a). A generalized reﬂection method of boundary
correction in kernel density estimation, Canad. J. Statist., 33, 497–509.
84
[7] R. Karunamuni and T. Alberts, (2005b). On boundary correction in kernel density
estimation, Statistical Methodology, 2, 191–212.
[8] R. Karunamuni and T. Alberts, (2006). A locally adaptive transformation method of
boundary correction in kernel density estimation, J. Statist. Plann. Inference, 136,
2936–2960.
[9] R. Karunamuni and S. Zhang, (2007). Some improvements on a boundary corrected
kernel density estimator, Statistics & Probability Letters, 78, 497–507.
[10] J. Kol´aˇcek and R. Karunamuni, (2009). On boundary correction in kernel estimation
of ROC curves, Austrian Journal of Statistics, 38, 17–32.
[11] M. Lejeune and P. Sarda, (1992). Smooth estimators of distribution and density functions,
Computational Statistics & Data Analysis, 14, 457–471.
[12] E. Nadaraya, (1964). Some new estimates for distribution functions, Theory Probab.
Appl., 15, 497–500.
[13] B. Prakasa Rao, (1983). Nonparametric functional estimation, Academic Press.
[14] R. Reiss, (1981). Nonparametric estimation of smooth distribution functions, Scandinavian
Journal of Statistics, 8, 116–119.
[15] J. Rice and M. Rosenblatt, (1976). Estimation of the log survivor function and hazard
function, Sankhya, 38, 60–78.
[16] W. Silverman, (1986). Density estimation for statistics and data analysis, London:
Chapman and Hall.
[17] M. Wand and M. Jones, (1995). Kernel smoothing, London: Chapman and Hall.
[18] G. Watson and M. Leadbetter, (1964). Hazard Analysis I, Biometrika, 51, 175–184.
[19] S. Zhang, R. Karunamuni and M. Jones, (1999). An improved estimator of the density
function at the boundary, J. Amer. Statist. Assoc., 94, 1231–1241.
85
AUSTRIAN JOURNAL OF STATISTICS
Volume 38 (2009), Number 1, 17–32
On Boundary Correction in Kernel Estimation
of ROC Curves
Jan Kol´aˇcek1
and Rohana J. Karunamuni2
1
Dept. of Mathematics and Statistics, Brno
2
Dept. of Mathematical and Statistical Sciences, University of Alberta
Abstract: The Receiver Operating Characteristic (ROC) curve is a statistical
tool for evaluating the accuracy of diagnostics tests. The empirical ROC
curve (which is a step function) is the most commonly used non-parametric
estimator for the ROC curve. On the other hand, kernel smoothing methods
have been used to obtain smooth ROC curves. The preceding process
is based on kernel estimates of the distribution functions. It has been observed
that kernel distribution estimators are not consistent when estimating
a distribution function near the boundary of its support. This problem is
due to “boundary effects” that occur in nonparametric functional estimation.
To avoid these difﬁculties, we propose a generalized reﬂection method of
boundary correction in the estimation problem of ROC curves. The proposed
method generates a class of boundary corrected estimators.
Zusammenfassung: Die Receiver Operating Characteristic (ROC) Kurve
ist ein statistisches Werkzeug zur Bewertung der Pr¨azision diagnostischer
Tests. Die empirische ROC Kurve (sie ist eine Treppenfunktion) ist der am
weitesten verbreitete nicht-parametrische Sch¨atzer der ROC Kurve. Andererseits
wurden Kerng¨attungsmethoden verwendet, um glatte ROC Kurven zu
erhalten. Der vorangehende Prozess basiert dabei auf Kernsch¨atzungen der
Verteilungsfunktionen. Es wurde beobachtet, dass Kernsch¨atzer der Verteilung
nicht konsistent sind falls die Verteilungsfunktion in der N¨ahe des Randes
ihres Tr¨agers gesch¨atzt wird. Dieses Problem beruht auf dem “Randeffekt”
der in der nicht-parametrischen funktionalen Sch¨atzung auftritt. Um
derartige Schwierigkeiten zu vermeiden, empfehlen wir eine verallgemeinerte
Reﬂexionsmethode der Randkorrektur im Sch¨atzproblem von ROC Kurven.
Die vorgeschlagene Methode generiert eine Klasse von randkorrigierten
Sch¨atzern.
Keywords: Reﬂection, Distribution Estimation.
1 Introduction
The Receiver Operating Characteristic (ROC) describes the performance of a diagnostic
test which classiﬁes subjects into either group without condition G0 or group with condition
G1 by means of a continuous discriminant score X, i.e., a subject is classiﬁed as G1 if
X ≥ d and G0 otherwise for a given cutoff point d ∈ R. The ROC is deﬁned as a plot of
probability of false classiﬁcation of subjects from G1 versus the probability of true classiﬁcation
of subjects from G0 across all possible cutoff point values of X. Speciﬁcally, let
18 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
F0 and F1 denote the distribution functions of X in the groups G0 and G1, respectively.
Then, the ROC curve can be written as
R(p) = 1 − F1 F−1
0 (1 − p) , 0 < p < 1 ,
where p is the false positive rate in (0, 1) as the corresponding cut-off point ranges from
−∞ to +∞ and F−1
0 denotes the inverse function of F0.
A simple non-parametric estimator for R(p) is to use the empirical distribution functions
for F0 and F1. The resulting ROC curve is a step function and it is called the
empirical ROC curve. Another type of non-parametric estimator for R(p) is derived from
kernel smoothing methods. Kernel smoothing is most widely used mainly because it is
easy to derive and has good asymptotic and small sample properties. Kernel smoothing
has received a considerable attention in density estimation context; see, for example the
monographs of Silverman (1986) and Wand and Jones (1995). However, applications of
kernel smoothing in distribution function estimation are relatively few. Some theoretical
properties of a kernel distribution function estimator have been investigated by Nadaraya
(1964), Reiss (1981), and Azzalini (1981). Lloyd (1998) proposed a nonparametric estimator
of ROC by using kernel estimators for the distribution functions F0 and F1.
Lloyd and Yong (1999) showed that Lloyd’s estimator has better mean squared error
properties than the empirical ROC curve estimator. However, his estimator has some
drawbacks. For example, Lloyd’s estimator is unreliable near the end points of the support
of the ROC curve due to so-called “boundary effects” that occur in nonparametric functional
estimation. Although there is a vast literature on boundary correction in density
estimation context, boundary effects problem in distribution function context has been
less studied.
In this paper, we develop a new kernel type estimator of the ROC curve that removes
boundary effects near the end points of the support. Our estimator is based on a new
boundary corrected kernel estimator of distribution functions and it is based on ideas of
Karunamuni and Alberts (2005a, 2005b, 2006), Zhang and Karunamuni (1998, 2000),
(Karunamuni and Zhang, 2008), and Zhang, Karunamuni, and Jones (1999) developed
for boundary correction in kernel density estimation. The basic technique of construction
of the proposed estimator is kind of a generalized reﬂection method involving reﬂecting
a transformation of the observed data. In fact, the proposed method generates a class of
boundary corrected estimators. We derive expressions for the bias and variance of the
proposed estimator. Furthermore, the proposed estimator is compared with the “classical
estimator” using simulation studies. We observe that the proposed estimator successfully
remove boundary effects and performs considerably better than the “classical estimator”.
Kernel smoothing in distribution function and ROC curve estimation is discussed in
the next section. The proposed estimator is given in Section 3. Simulation results are
given in Section 4. A real data example is analyzed in Section 5. Finally, some concluding
remarks are given in Section 6.
J. Kol´aˇcek and R. Karunamuni 19
2 Kernel Smoothing
2.1 Kernel ROC Estimator
Suppose that independent samples X01, . . . , X0n0 and X11, . . . , X1n1 are available from
some two unknown distributions F0 and F1, respectively, where F0 ∈ G0 and F1 ∈ G1
and G0 and G1 denote two groups of continuous distribution functions. Then a simple
nonparametric estimator of the ROC curve R(p) = 1 − F1 F−1
0 (1 − p) , 0 < p < 1, is
known as the empirical ROC curve given by
RE(p) = 1 − F1 F−1
0 (1 − p) , 0 ≤ p ≤ 1 ,
where F0 and F1 denote the empirical distribution functions of F0 and F1 based on the
data X01, . . . , X0n0 and X11, . . . , X1n1 , respectively; that is
F0(x) =
1
n0
n0
i=1
I(X0i ≤ x) , F1(x) =
1
n1
n1
i=1
I(X1i ≤ x) .
Note that R is not a continuous function. In fact, it is a step function on the interval [0, 1].
This is a notable weakness of the empirical ROC curve R(p). Since the ROC curve is a
smooth function of p, we would expect to have an estimator that is smooth as well. Lloyd
(1998) proposed a smooth estimator using kernel smoothing techniques. His idea is to
replace unknown distribution F0 and F1 by two smooth kernel estimators. Speciﬁcally,
he employed following kernel estimators of F0 and F1:
F0(x) =
1
n0
n0
i=1
W
x − X0i
h0
, F1(x) =
1
n1
n1
i=1
W
x − X1i
h1
,
where W(x) =
x
−1
K(t)dt, h0 and h1 denote bandwidths (h0 → 0 and h1 → 0 as
n0 → ∞ and n1 → ∞, respectively), and K is a unimodal symmetric density function
with support [−1, 1]. The corresponding estimator of the ROC curve R(p) is then given
by
R(p) = 1 − F1 F−1
0 (1 − p) , 0 ≤ p ≤ 1 .
An example of a smooth estimate of R(p) using R(p) is illustrated in Figure 1.
When G0 and G1 contain distributions with ﬁnite support then the estimator R exhibits
boundary effects near the endpoints of the support due to the same boundary effects that
occur in the uncorrected kernel estimators F0 and F1. The main purpose of this article
is to improve the kernel distribution estimators and thereby to avoid boundary effects of
smooth kernel ROC estimators. Details of the boundary problem with F0 and F1 are
described in the next section.
2.2 Kernel Distribution Estimator and Boundary Effects
Let f denote a continuous density function with support [0, a], 0 < a ≤ ∞, and consider
nonparametric estimation of the cumulative distribution function F of f based on a random
sample X1, . . . , Xn from f. Suppose that F(j)
, the j-th derivative of F, exists and is
20 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TPR
R(p)
Figure 1: Smooth estimate of R(p).
continuous on [0, a], j = 0, 1, 2, with F(0)
= F and F(1)
= f. Then the traditional kernel
estimator of F is given by
Fh,K(x) =
1
n
n
i=1
W
x − Xi
h
, W(x) =
x
−1
K(t)dt ,
where K is a symmetric density function with support [−1, 1] and h is the bandwidth
(h → 0 as n → ∞). The basic properties of Fh,K(x) at interior points are well-known
(e.g. Lejeune and Sarda, 1992), and under some smoothness assumptions these include,
for h ≤ x ≤ a − h,
E Fh,K(x) − F(x) =
1
2
β2f(1)
(x)h2
+ o(h2
)
nvar Fh,K(x) = F(x) (1 − F(x)) + hf(x)
1
−1
W(t) (W(t) − 1) dt + o(h) .
The performance of Fh,K(x) at boundary points, i.e., for x ∈ [0, h) ∪ (a − h, a], however,
differs from the interior points due to so-called “boundary effects” that occur in nonparametric
curve estimation problems. More speciﬁcally, the bias of Fh,K(x) is of order O(h)
instead of O(h2
) at boundary points, while the variance of Fh,K(x) is of the same order.
This fact can be clearly seen by examining the behavior of Fh,K inside the left boundary
region [0, h]. Let x be a point in the left boundary, i.e., x ∈ [0, h]. Then we can write
J. Kol´aˇcek and R. Karunamuni 21
x = ch, 0 ≤ c ≤ 1. The bias and variance of Fh,K(x) at x = ch are of the form
E Fh,K(x) − F(x) = hf(0)
−c
−1
W(t)dt (1)
+ h2
f(1)
(0)
c2
2
+c
−c
−1
W(t)dt−
c
−1
tW(t)dt + o(h2
)
nvar Fh,K(x) = F(x)(1 − F(x)) + hf(0)
c
−1
W2
(t)dt − c + o(h) . (2)
From expression (1) it is now clear that the bias of Fh,K(x) is of order O(h) instead of
O(h2
). To remove this boundary effect in kernel distribution estimation we investigate a
new class of estimators in the next section.
3 The Proposed Estimator
In this section we propose a class of estimators of the distribution function F of the form
Fh,K(x) =
1
n
n
i=1
W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
, (3)
where h is the bandwidth, K is a symmetric density function with support [−1, 1], and
g1 and g2 are two transformations that need to be determined. The same type of estimator
in density estimation case has been discussed in Zhang et al. (1999). As in the
preceding paper, we assume that gi, i = 1, 2, are nonnegative, continuous and monotonically
increasing functions deﬁned on [0, ∞). Further assume that g−1
i exists, gi(0) = 0,
g
(1)
i (0) = 1, and that g
(2)
i exists and is continuous on [0, ∞), where g
(j)
i denotes the j-th
derivative of gi, with g
(0)
i = gi and g−1
i denoting the inverse function of gi, i = 1, 2. We
will choose g1 and g2 such that Fh,K(x) ≥ 0 everywhere. Note that the i-th term of the
sum in (3) can be expressed as
W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
=
x+g2(Xi)
h
−x+g1(Xi)
h
K(t)dt .
The preceding integral is non-negative provided the inequality −x+g1(Xi) ≤ x+g2(Xi)
holds. Since x ≥ 0, the preceding inequality will be satisﬁed if g1 and g2 are such that
g1(Xi) ≤ g2(Xi) for i = 1, . . . , n. Thus we will assume that g1 and g2 are chosen such
that g1(x) ≤ g2(x) for x ∈ [0, ∞) for our proposed estimator. Now, we can obtain the
22 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
bias and variance of (3) at x = ch, 0 ≤ c ≤ 1, as
E Fh,K(x) − F(x) = h2
f(1)
(0)
c2
2
+ 2c
−c
−1
W(t)dt −
c
−c
tW(t)dt
−f(0)g
(2)
1 (0)
c
−1
(c − t)W(t)dt (4)
−f(0)g
(2)
2 (0)
−c
−1
(c + t)W(t)dt + o(h2
)
nvar Fh,K(x) = F(x)(1 − F(x)) + hf(0)
c
−1
W2
(t)dt
−2
c
−1
W(t)W(t − 2c)dt +
−c
−1
W2
(t)dt + o(h) . (5)
The proofs of (4) and (5) are given in the Appendix. Note that the contribution of g2 on
the bias vanishes as c → 1. By comparing expressions (1), (4), (2), and (5) at boundary
points we can see that the variances are of the same order and the bias of Fh,K(x) is of
order O(h) whereas the bias of Fh,K(x) is of order O(h2
). So our proposed estimator
removes boundary effects in kernel distribution estimation since the bias at boundary
points is of the same order as the bias at interior points.
It is clear that there are various possible choices available for the pair (g1, g2). However,
we will choose g1 and g2 so that the condition Fh,K(0) = 0 will be satisﬁed because
of the fact that F(0) = 0. A sufﬁcient (but not necessary) condition for the preceding
condition to be satisﬁed is that g1 and g2 must be equal. Thus we need to construct a single
transformation function g such that g = g1 = g2. Other important properties that are
desirable in the estimator Fh,K are the local adaptivity (i.e., the transformation function
g depends on c) and that Fh,K(x) being equal to the usual kernel estimator Fh,K(x) at
interior points. For the latter, g must satisfy that g(y) → y as c → 1. In order to display
the dependance of g on c, 0 ≤ c ≤ 1, we shall denote g by gc in what follows.
Summarizing all the assumptions, it is clear now that gc should satisfy the conditions
(i) gc : [0, ∞) → [0, ∞), gc is continuous, monotonically increasing and g
(i)
c exists,
i = 1, 2.
(ii) g−1
c (0) = 0 and g
(1)
c (0) = 1.
(iii) gc(y) → y for c → 1.
Functions satisfying conditions (i) to (iii) are easy to construct. The trivial choice is
gc(y) = y, which represents the “classical” reﬂection method estimator. Based on extensive
simulations, we observed that the following transformation adapts well to various
shapes of distributions:
gc(y) = y +
1
2
Icy2
, (6)
for y ≥ 0 and 0 ≤ c ≤ 1, where Ic =
−c
−1
W(t)dt.
Remark: Some discussion on the above choice of gc and other various improvements
that can be made would be appropriate here. It is possible to construct functions gc that
improve the bias further under some additional conditions. For instance, if one examines
J. Kol´aˇcek and R. Karunamuni 23
the right hand side of bias expansion (4) then it is not difﬁcult to see that the terms inside
bracket (i.e., the coefﬁcient of h2
) can be made equal to zero if gc is appropriately chosen.
Indeed, if gc is chosen such that
f(0)g(2)
c (0)
c
−1
(c − t)W(t)dt +
−c
−1
(c + t)W(t)dt
= f(1)
(0)
c2
2
+ 2c
−c
−1
W(t)dt −
c
−c
tW(t)dt ,
then the bias of Fh,K(x) would be theoretically of order O(h3
). For such a function gc,
the second derivative at zero, g
(2)
c (0), will depend on the ratio d1 = f(1)
(0)/f(0). In this
case, the function gc would probably be some cubic polynomial; see e.g. Karunamuni and
Alberts (2005a, 2005b, 2006). Then the problem of estimation of d1 naturally arises as in
the preceding paper. Another problem that one would face is that the second derivative
g
(2)
c (0) may not go to 0 when c → 1 as in the case of density estimation context. Thus
one may not be able to ﬁnd any function gc which satisﬁes condition (iii) and hence the
estimator Fh,K loses the property of “natural extension” to the classical estimator outside
the boundary points. These are basically the main reasons why we decided to implement
a quadratic function deﬁned in (6) as our choice of transformation.
4 Simulation
To test the effectiveness of our estimator, we simulated its performance against the reﬂection
method. The simulation is based on 1000 replications. In each replication, the
random variables X0 ∼ Exp(2) and X1 ∼ Gamma(3, 2) were generated and the estimate
of the ROC curve was computed. The probability distributions of both groups G0 and G1
are illustrated in Figure 2.
In all replications sample sizes of n0 = n1 = 50 were used. In this case, the actual
global optimal bandwidths (see Azzalini, 1981) for F0 and F1 are hF0 = 2.9149 and
hF1 = 5.8298, respectively. For the kernel estimation of the cumulative distributions we
used the quartic kernel K(x) = 15
16
(1 − x2
)2
I[−1,1], where IA is the indicator function on
the set A. In our experience, the quality of estimated curve by using this kernel is not
too sensitive to an optimal bandwidth choice. Hence we used this kernel also in the next
section.
For each ROC curve we have calculated the mean integrated squared error (MISE) on
the interval [0, 1] over all 1000 replications and have displayed the results in a boxplot in
Figure 3. The variance of each estimator can be accurately gauged by the whiskers of the
plot. The values of means and standard deviations for MISE of each method are given in
Table 1.
We also obtained 10 typical realizations of each estimator and displayed these in Figure
4 for comparison purposes with the theoretical ROC curve. The solid line represents
the theoretical ROC curve and the dotted lines illustrate the 10 realizations.
The ﬁnal estimate of the ROC curve depends on estimates of the cumulative distribution
functions F0 and F1. While boundary effects cause problems by estimating F0 and
24 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
0 5 10 15 20 25 30 35
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
G0
G1
Figure 2: The probability distribution of groups G0 and G1.
Table 1: Means and standard deviations of the MISE.
Method Mean STD
Proposed 0.0053 0.0047
Reﬂection 0.0065 0.0050
Classical 0.0084 0.0054
F1 inside the left boundary region, the quality of the ﬁnal estimate of the ROC can also
be inﬂuenced by these effects near the right boundary of the interval [0, 1] as well. As we
can see in Figure 4, the biggest difference between the above mentioned methods is in the
second half part of the interval [0, 1]. Table 1 describes the performance of our proposed
method with respect to the MISE. The values of the mean and the standard deviation for
the MISE were smallest in case of our proposed estimator. Although the theoretical bias
of our estimator is of the same order as in the case of the reﬂection method, the numerical
results of estimators of the ROC curves were better for our estimator in the simulation. In
our opinion, this is due to the fact that our estimator is locally adaptive.
5 Consumer Loans Data
In this example we used some (unspeciﬁed) scoring function to predict the solidity of a
client. The goal here is to determine which clients are able to pay their loans. We considered
a test set of 332 clients; 309 paid their loans (group G0) and 22 had problems with
J. Kol´aˇcek and R. Karunamuni 25
1 2 3
0
0.01
0.02
0.03
0.04
0.05
0.06
MISE
Method
Figure 3: Boxplots of the MISE over [0, 1] for our proposed method (1), the reﬂection
method (2), and the classical estimator with boundary effects (3).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(3)
Figure 4: Estimates of the ROC for our proposed method (1), the reﬂection method (2),
and the classical estimator with boundary effects (3).
payments or did not pay (group G1). We used the ROC curve to assess the discrimination
between clients with and without a good solidity. It is of interest for us to know here if
our scoring function is a good predictor of the solidity.
Estimates of ROC are illustrated in Figure 5. The dashed line represents the estimate
obtained by our proposed method and the solid line is for the kernel ROC with boundary
effects. When choosing the optimal bandwidths for distribution function estimation, we
used the method described in Horov´a, Kol´aˇcek, Zelinka, and El-Shaarawi (2008). A
somewhat similar method for density estimation is given in Sheather and Jones (1991).
The optimal bandwidths for distribution functions F0 and F1 were estimated as ˆhF0 =
26 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 5: The estimate of the ROC for consumer the loans data.
0.0068 and ˆhF1 = 0.0286, respectively.
From the estimates of the ROC one can see that the scoring function is not a good
predictor of the solidity of a client. This fact could be also affected by the different sizes
of both groups. When group G1 is too small it causes larger boundary effects. It is clearly
visible that the estimate of the ROC obtained by the classical estimator (solid line) has
some values under the diagonal of the unit square. However, this situation does not show
up theoretically. Thus there is a larger inﬂuence of boundary effects to the quality of ﬁnal
estimates of the ROC.
6 Conclusion
In this paper we proposed a new kernel-type distribution estimator to avoid the difﬁculties
near the boundary. The technique implemented is a kind of generalized reﬂection method
involving reﬂecting a transformation of the data. The proposed method generates a class
of boundary corrected estimators and it is based on ideas of boundary corrections for kernel
density estimators presented in Karunamuni and Alberts (2005a, 2005b, 2006). We
showed some good properties of our proposed method (e.g., local adaptivity). Furthermore,
it is shown that bias of the proposed estimator is smaller than that of the “classical”
case.
J. Kol´aˇcek and R. Karunamuni 27
Acknowledgements
The research was supported by the Jaroslav H´ajek center for theoretical and applied statistics
(grant No. LC 06024). The second author’s research was supported by a grant from
the Natural Sciences and Engineering Research Council of Canada.
Appendix
Proof of (4). For x = ch, 0 ≤ c ≤ 1, using the property W(t) = 1 − W(−t) we obtain
E(Fh,K(x)) = E W
x − g1(Xi)
h
− E W −
x + g2(Xi)
h
=
∞
0
W
x − g1(y)
h
f(y)dy −
∞
0
W −
x + g2(y)
h
f(y)dy
= h
c
−1
W(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt − h
−c
−1
W(t)
f g−1
2 ((−c − t)h)
g
(1)
2 g−1
2 ((−c − t)h)
dt
= h
−c
−1
W(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
−
f g−1
2 ((−c − t)h)
g
(1)
2 g−1
2 ((−c − t)h)
dt
+h
c
−c
(1 − W(−t))
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt
= h
−c
−1
W(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
−
f g−1
2 (−c − t)h)
g
(1)
2 g−1
2 ((−c − t)h)
dt
+F g−1
1 (2ch) − h
c
−c
W(t)
f g−1
1 ((c + t)h)
g
(1)
1 g−1
1 ((c + t)h)
dt .
Using a Taylor expansion of order 2 on the function F g−1
1 (·) we have
F g−1
1 (2ch) = F(0) + f(0)2ch + f(1)
(0) − f(0)g
(2)
1 (0) 2c2
h2
+ o(h2
) .
By the existence and continuity of F(2)
(·) near 0, we obtain for x = ch
F(0) = F(x) − f(x)ch +
1
2
f(1)
(x)c2
h2
+ o(h2
)
f(x) = f(0) + f(1)
(0)ch + o(h)
f(1)
(x) = f(1)
(0) + o(1) .
Therefore,
F g−1
1 (2ch) = F(x) + f(0)ch +
3
2
f(1)
(0) − 2f(0)g
(2)
1 (0) c2
h2
+ o(h2
) . (7)
Now, (7) and a Taylor expansion of order 1 of the functions
f g−1
1 (·)
g
(1)
1 g−1
1 (·)
and
f g−1
2 (·)
g
(1)
2 g−1
2 (·)
28 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
give
E Fh,K(x) − F(x)
= h
−c
−1
W(t) 2f(1)
(0)ch − f(0)h (c − t)g
(2)
1 (0) + (c + t)g
(2)
2 (0) + o(h) dt
+ f(0)ch +
3
2
f(1)
(0) − 2f(0)g
(2)
1 (0) c2
h2
+ o(h2
)
− h
c
−c
W(t) f(0) + f(1)
(0) − f(0)g
(2)
1 (0) (c + t)h + o(h) dt
= h f(0)c − f(0)
c
−c
W(t)dt + h2 3
2
f(1)
(0)c2
+ 2f(1)
(0)c
−c
−1
W(t)dt
− 2f(0)g
(2)
1 (0)c2
− f(0)g
(2)
1 (0)
−c
−1
(c − t)W(t)dt − f(0)g
(2)
2 (0)
−c
−1
(c + t)W(t)dt
− f(1)
(0) − f(0)g
(2)
1 (0)
c
−c
(c + t)W(t)dt + o(h2
) .
From the symmetry of K and the deﬁnition W(x), one can write W(x) = 1
2
+ b(x),
where b(x) = −b(−x) for all x such that |x| ≤ 1. Thus
c
−c
W(t)dt = c and therefore the
coefﬁcient of h is zero. So after some algebra we obtain the bias expression as
E Fh,K(x) − F(x) = h2
f(1)
(0)
c2
2
+ 2c
−c
−1
W(t)dt −
c
−c
tW(t)dt
−f(0)g
(2)
1 (0)
c
−1
(c − t)W(t)dt − f(0)g
(2)
2 (0)
−c
−1
(c + t)W(t)dt + o(h2
) .
Proof of (5). Observe that for x = ch, 0 ≤ c ≤ 1, we have
nvar Fh,K(x) =
1
n
var
n
i=1
W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
= E W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
2
− E W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
2
= A1 − A2 ,
J. Kol´aˇcek and R. Karunamuni 29
where
A1 = E W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
2
=
∞
0
W
x − g1(y)
h
− W −
x + g2(y)
h
2
f(y)dy
=
∞
0
W2 x − g1(y)
h
+ W2
−
x + g2(y)
h
f(y)dy
−
∞
0
2W
x − g1(y)
h
W −
x + g2(y)
h
f(y)dy
= h
−c
−1
W2
(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
+
f g−1
2 ((−c − t)h)
g
(1)
2 g−1
2 ((−c − t)h)
dt
+h
c
−c
W2
(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt
−
∞
0
2W
x − g1(y)
h
W −
x + g2(y)
h
f(y)dy
= A1,1 + A1,2 − A1,3 .
Using a Taylor expansion as in the last proof, it can be shown that
A1,1 = h
−c
−1
W2
(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
+
f g−1
2 ((−c − t)h)
g
(1)
2 g−1
2 ((−c − t)h)
dt
= h
−c
−1
W2
(t) (2f(0) + o(1)) dt .
For A1,2 we use the identity W(t) = 1 − W(−t) and similarly as in the last proof we get
A1,2 = h
c
−c
W2
(t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt
= h
c
−c
1 − 2W(−t) + W2
(−t)
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt
= h
c
−c
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt − 2h
c
−c
W(t)
f g−1
1 ((c + t)h)
g
(1)
1 g−1
1 ((c + t)h)
dt
+h
c
−c
W2
(t)
f g−1
1 ((c + t)h)
g
(1)
1 g−1
1 ((c + t)h)
dt
= F g−1
1 (2ch) − 2h
c
−c
W(t) (f(0) + o(1)) dt + h
c
−c
W2
(t) (f(0) + o(1)) dt
= F(x) − f(0)ch + hf(0)
c
−c
W2
(t)dt + o(h) .
Using the continuity of g
(2)
i , gi(0) = 0, and g
(1)
i (0) = 1, i = 1, 2, and by a Taylor
30 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
expansion of order 2 on g2 g−1
1 (·) , we have
g2 g−1
1 ((c − t)h) = g2 g−1
1 (0) +
g
(1)
2 g−1
1 (0)
g
(1)
1 g−1
1 (0)
(c − t)h + o(h)
= (c − t)h + o(h) .
With the preceding expansion we obtain
A1,3 =
∞
0
2W
x − g1(y)
h
W −
x + g2(y)
h
f(y)dy
= 2h
c
−1
W(t)W −
x
h
−
g2 g−1
1 ((c − t)h)
h
f g−1
1 ((c − t)h)
g
(1)
1 g−1
1 ((c − t)h)
dt
= 2h
c
−1
W(t)W
−ch − (c − t)h − o(h)
h
(f(0) + o(1)) dt
= 2hf(0)
c
−1
W(t)W(t − 2c)dt + o(h) .
Now we can express A1 as
A1 = A1,1 + A1,2 − A1,3
= 2hf(0)
−c
−1
W2
(t)dt + F(x) − f(0)ch + hf(0)
c
−c
W2
(t)dt
−2hf(0)
c
−1
W(t)W(t − 2c)dt + o(h)
= F(x) + hf(0) 2
−c
−1
W2
(t)dt − c +
c
−c
W2
(t)dt − 2
c
−1
W(t)W(t − 2c)dt
+o(h) .
With the expression obtained for the bias we obtain the expression for A2 as
A2 = E W
x − g1(Xi)
h
− W −
x + g2(Xi)
h
2
= E Fh,K(x)
2
= F2
(x) + o(h) .
Finally, we obtain the variance of the estimator as
nvar Fh,K(x) = A1 − A2
= F(x) + hf(0) 2
−c
−1
W2
(t)dt − c +
c
−c
W2
(t)dt − 2
c
−1
W(t)W(t − 2c)dt
−F2
(x) + o(h)
= F(x)(1 − F(x))
+hf(0) 2
−c
−1
W2
(t)dt − c +
c
−c
W2
(t)dt − 2
c
−1
W(t)W(t − 2c)dt + o(h) .
J. Kol´aˇcek and R. Karunamuni 31
References
Azzalini, A. (1981). A note on the estimation of a distribution function and quantiles by
a kernel method. Biometrika, 68, 326-328.
Horov´a, I., Kol´aˇcek, J., Zelinka, J., and El-Shaarawi, A. H. (2008). Smooth estimates of
distribution functions with application in environmental studies. Advanced topics
on mathematical biology and ecology, 122-127.
Karunamuni, R. J., and Alberts, T. (2005a). A generalized reﬂection method of boundary
correction in kernel density estimation. Canadian Journal of Statistics, 33, 497-
509.
Karunamuni, R. J., and Alberts, T. (2005b). On boundary correction in kernel density
estimation. Statistical Methodology, 2, 191-212.
Karunamuni, R. J., and Alberts, T. (2006). A locally adaptive transformation method of
boundary correction in kernel density estimation. Journal of Statistical Planning
and Inference, 136, 2936-2960.
Karunamuni, R. J., and Zhang, S. (2008). Some improvements on a boundary corrected
kernel density estimator. Statistics & Probability Letters, 78, 497-507.
Lejeune, M., and Sarda, P. (1992). Smooth estimators of distribution and density functions.
Computational Statistics & Data Analysis, 14, 457-471.
Lloyd, C. J. (1998). The use of smoothed ROC curves to summarise and compare diagnostic
systems. Journal of the American Statistical Association, 93, 1356-1364.
Lloyd, C. J., and Yong, Z. (1999). Kernel estimators of the ROC curve are better than
empirical. Statistics and Probability Letters, 44, 221-228.
Nadaraya, E. A. (1964). Some new estimates for distribution functions. Theory of
Probability and its Application, 15, 497-500.
Reiss, R. D. (1981). Nonparametric estimation of smooth distribution functions. Scandinavian
Journal of Statistics, 8, 116-119.
Sheather, S. J., and Jones, M. C. (1991). A reliable data-based bandwidth selection
method for kernel density estimation. Journal of the Royal Statistical Society, Series
B, 53, 683-690.
Silverman, W. R. (1986). Density Estimation for Statistics and Data Analysis. London:
Chapman and Hall.
Wand, M. P., and Jones, M. C. (1995). Kernel Smoothing. London: Chapman and Hall.
Zhang, S., and Karunamuni, R. J. (1998). On kernel density estimation near endpoints.
J. Statist. Planning and Inference, 70, 301–316.
Zhang, S., and Karunamuni, R. J. (2000). On nonparametric density estimation at the
boundary. Nonparametric Statistics, 12, 197–221.
Zhang, S., Karunamuni, R. J., and Jones, M. C. (1999). An improved estimator of the
density function at the boundary. Journal of the American Statistical Association,
94, 1231–1241.
32 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32
Authors’ addresses:
Jan Kol´aˇcek
Department of Mathematics and Statistics
Faculty of Science
Kotl´aˇrsk´a 2
611 37 Brno
Czech Republic
E-Mail: kolacek@math.muni.cz
Rohana J. Karunamuni
Department of Mathematical and Statistical Sciences
University of Alberta
T6G 2G1 Edmonton
Canada
E-Mail: R.J.Karunamuni@ualberta.ca
Computational Statistics (2008) 23:63–78
DOI 10.1007/s00180-007-0068-6
ORIGINAL PAPER
Plug-in method for nonparametric regression
Jan Koláˇcek
Accepted: 5 October 2006 / Published online: 25 September 2007
© Springer-Verlag 2007
Abstract The problem of bandwidth selection for non-parametric kernel regression
is considered. We will follow the Nadaraya–Watson and local linear estimator especially.
The circular design is assumed in this work to avoid the difﬁculties caused by
boundary effects. Most of bandwidth selectors are based on the residual sum of squares
(RSS). It is often observed in simulation studies that these selectors are biased toward
undersmoothing. This leads to consideration of a procedure which stabilizes the RSS
by modifying the periodogram of the observations. As a result of this procedure, we
obtain an estimation of unknown parameters of average mean square error function
(AMSE). This process is known as a plug-in method. Simulation studies suggest that
the plug-in method could have preferable properties to the classical one.
Keywords Bandwidth selection · Fourier transform · Kernel estimation ·
Nonparametric regression
1 Introduction
In nonparametric regression estimation, a critical and inevitable step is to choose
the smoothing parameter (bandwidth) to control the smoothness of the curve estimate.
The smoothing parameter considerably affects the features of the estimated
curve. Although in practice one can try several bandwidths and choose a bandwidth
subjectively, automatic (data-driven) selection procedures could be useful for many
situations; see Silverman (1985) for more examples.
Supported by the MSMT: LC 06024.
J. Koláˇcek (B)
Faculty of Science, Masaryk University, Janackovo nam. 2a, Brno, Czech Republic
e-mail: kolacek@math.muni.cz
123
64 J. Koláˇcek
Several automatic bandwidth selectors have been proposed and studied in Craven
and Wahba (1979), Härdle (1990), Härdle et al. (1988), Droge (1996), and references
given therein. It is well recognized that these bandwidth estimates are subject to large
sample variation. The kernel estimates based on the bandwidths selected by these
procedures could have very different appearances. Due to the large sample variation,
classical bandwidth selectors might not be very useful in practice.
In the simulation study of Chiu (1990), it was observed that Mallows’ criterion gives
smaller bandwidths more frequently than predicted by the asymptotic theorems. Chiu
(1990) provided an explanation for the cause and suggested a procedure to overcome
the difﬁculty. By applying the procedure, we introduce a new method for bandwidth
selection which gives much more stable bandwidth estimates.
2 Kernel regression
Consider a standard regression model of the form
Yt = m(xt ) + εt , t = 0, . . . , T − 1, T ∈ N,
where m is an unknown regression function, xt are design points, Yt are measurements
and εt are independent random variables for which
E(εt ) = 0, var(εt ) = σ2
> 0, t = 0, . . . , T − 1.
The aim of kernel smoothing is to ﬁnd suitable approximation m of the unknown
function m.
In next we will assume
1. The design points xt are equidistantly distributed on the interval [0, 1], that is
xt = t/T, t = 0, . . . , T − 1.
2. We use a “cyclic design”, that is, suppose m(x) is a smooth periodic function and
the estimate is obtained by applying the kernel on the extended series Yt , where
Yt+kT = Yt for k ∈ Z. Similarly xt = t/T , t ∈ Z.
Lip[a, b] denotes the class of continuous functions satisfying the inequality
|g(x) − g(y)| ≤ L|x − y| ∀x, y ∈ [a, b], L > 0, L is a constant.
Definition Let κ be a nonnegative even integer and assume κ ≥ 2. The function
K ∈ Lip[−1, 1], support(K) = [−1, 1], satisfying the following conditions
(i) K(−1) = K(1) = 0
(ii)
1
−1
x j K(x)dx =
⎧
⎨
⎩
0, 0 < j < κ
1, j = 0
βκ = 0, j = κ,
is called a kernel of order κ and a class of all these kernels is marked S0κ.
These kernels are used for an estimation of the regression function (see Wand and
Jones 1995).
123
Plug-in method for nonparametric regression 65
Let K ∈ S0κ, set Kh(.) = 1
h K( .
h ), h ∈ (0, 1). A parameter h is called a bandwidth.
Commonly used non-parametric methods for estimating m(x) are the kernel esti-
mators
1. Nadaraya–Watson estimator (Nadaraya 1964; Watson 1964)
mNW (x; h) =
2T −1
k=−T Kh(xk − x)Yk
2T −1
k=−T Kh(xk − x)
2. Local linear estimator (Stone 1977; Cleveland 1979)
mLL(x; h) =
1
T
2T −1
k=−T
{ˆs2(x; h) − ˆs1(x; h)(xk − x)}Kh(xk − x)Yk
ˆs2(x; h)ˆs0(x; h) − ˆs1(x; h)2
where
ˆsr (x; h) =
1
T
2T −1
k=−T
(xk − x)r
Kh(xk − x).
In the cyclic design, the kernel estimators can be generally expressed as
m(x; h) =
2T −1
k=−T
W
( j)
k (x)Yk,
where the weights W
( j)
k (x), j ∈ {NW, LL} correspond to the weights of estimators
mNW , mLL. The assumption of the circular model leads to the fact, that the weights
of Nadaraya–Watson and local linear estimator are identical at design points, that is
W
(LL)
k (xt ) = W
(NW)
k (xt ),
for k ∈ {−T, −T − 1, . . . , 2T − 1}, t ∈ {0, 1, . . . , T − 1}, so in next, we will write
only Wk(xt ) without upper index.
Let K ∈ S0κ, h ∈ (0, 1), t ∈ {0, . . . , T −1}. Then the sum 2T −1
k=−T Kh(xk − xt ) =
T −1
k=−T +1 Kh(xk) is independent on t. Set CT := T −1
k=−T +1 Kh(xk). We can simply
write the value of weight functions at design points xt , t = 0, . . . , T − 1
Wk(xt ) =
1
CT
Kh(xk − xt ).
The optimal bandwidth considered here is hopt, the minimizer of the average mean
squared error
(AMSE) RT (h) =
1
T
E
T −1
t=0
{m(xt ) − m(xt ; h)}2
.
123
66 J. Koláˇcek
Let K ∈ S0κ. Under some mild conditions, AMSE converges to
RT (h) =
σ2V (K)
T h
+
h2κ
(κ!)2
β2
κ Aκ, (1)
where
V (K) =
1
−1
K2
(x)dx, βκ =
1
−1
xκ
K(x)dx, Aκ =
1
0
m(κ)
(x)
2
dx.
This function has an unique minimum hopt
hopt =
σ2V (K)(κ!)2
2κTβ2
κ Aκ
1
2κ+1
(2)
(for more details, see Wand and Jones 1995).
There exist many estimators of this error function, which are asymptotically equivalent
and asymptotically unbiased (see Härdle 1990; Chiu 1990, 1991). However, in
simulation studies, it is often observed that most selectors are biased toward undersmoothing
and give smaller bandwidths more frequently than predicted by asymptotic
results. Most bandwidth selectors are based on the residual sum of squares
(RSS) RSST (h) =
1
T
T −1
t=0
{Yt − m(xt ; h)}2
.
For example Rice (see Rice 1984) considered
RT (h) = RSST (h) − ˆσ2
+ 2 ˆσ2
w0, (3)
where ˆσ2 is an estimate of σ2
ˆσ2
=
1
2T − 2
T −1
t=1
(Yt − Yt−1)2
. (4)
The estimate ˆhopt of optimal bandwidth is deﬁned as
ˆhopt = arg min RT (h).
3 Use of Fourier transformation
Let Mt = m(xt ), t = 0, . . . , T − 1. The periodogram of the vector of observations
YYY is deﬁned by IYλ
IYλ = |Y−
λ |2
/2πT,
123
Plug-in method for nonparametric regression 67
where
Y−
λ =
T −1
k=0
Yke− i2πkλ
T
is the ﬁnite Fourier transform of the vector YYY. This transformation is denoted by
YYY− = DFT −(YYY).
The periodograms and Fourier transforms of the series εεε and MMM are deﬁned similarly.
Under mild conditions, the periodogram ordinates Iεt on Fourier frequencies
2πt
T , for t = 1, . . . , N = T −1
2 , are approximately independently and exponentially
distributed with means σ2
2π . Here [x] means the greatest integer less or equal to x.
Definition Let xxx = (x0, . . . , xT −1), yyy = (y0, . . . , yT −1) ∈ CT
;
zt =
T −1
k=0
x t−k T yk,
where t − k T marks (t − k)mod T . Then zzz = (z0, . . . , zT −1) is called the discrete
cyclic convolution of vectors xxx and yyy; we write zzz = xxx ⊛ yyy.
Let us deﬁne a vector www := (w0, w1, . . . , wT −1), where
wt = W0(xt − 1) + W0(xt ) + W0(xt + 1).
Let h ∈ (0, 1), K ∈ S0κ, t ∈ {0, . . . , T −1}. Then we can write m(xt ; h) as a discrete
cyclic convolution of vectors www and YYY.
m(xt ; h) =
T −1
k=0
w t−k T Yk. (5)
Applying Parseval’s formula yields
RSST (h) =
4π
T
N
t=1
IYt 1 − w−
t
2
, (6)
where w−
t = T −1
k=−T +1 W0(xk)e− i2πkt
T is the ﬁnite Fourier transform of www (see Chiu
1990, for details). From (3) and (6) we arrive at the equivalent expression for RT (h)
RT (h) =
4π
T
N
t=1
IYt {1 − w−
t }2
− ˆσ2
+ 2 ˆσ2
w0. (7)
123
68 J. Koláˇcek
Similarly,
RT (h) =
4π
T
N
t=1
IMt +
σ2
2π
{1 − w−
t }2
− σ2
+ 2σ2
w0. (8)
4 The motivation and the plug-in method
Let D(h) = RT (h) − RT (h). From previous expressions we obtain
D(h) =
4π
T
N
t=1
IYt − IMt −
σ2
2π
{1 − w−
t }2
. (9)
The periodogram ordinates IMt decrease rapidly for smooth m(x). So IYt do not contain
much information about IMt at high frequencies (for the rigorous proof see Rice
1984). This leads to the consideration of the procedure proposed by Chiu (1991). The
main idea is to modify RSS to make it less variable. We ﬁnd the ﬁrst index J1 such
that IYJ1
< c ˆσ2/2π for some constant c > 1, where ˆσ2 is an estimate of σ2. The
constant c sets a threshold. In our experience, setting 1 < c < 3 yields good results.
The modiﬁed residual sum of squares is deﬁned by
MRSST (h) =
2π
T
T −1
t=0
˜IYt {1 − w−
t }2
,
where
˜IYt =
IYt , t < J1
ˆσ2/2π, t ≥ J1,
(see Figs. 1, 2).
Thus, the proposed selector is
RT (h) = MRSST (h) − ˆσ2
+ 2 ˆσ2
w0 (10)
and the new estimate of optimal bandwidth
ˆhopt = arg min RT (h)
[for more details see Chiu (1990, 1991)].
To simplify the discussion below, set c = 2 and rewrite (10) to the formula in next
lemma.
Lemma 1 Let J1 be the least index, that IYJ1
< ˆσ2/πT . Then
RT (h) =
ˆσ2
T
T −1
t=0
(w−
t )2
+
4π
T
J1−1
t=1
IYt −
ˆσ2
2π
{1 − w−
t }2
.
123
Plug-in method for nonparametric regression 69
a
0 5 10 15 20 25 30 35 40
0
0.02
0.06
0.08
0.1
0.12
0.14
0.16
Fig. 1 The periodogram ordinates IYt as a function of t, a = 2 ˆσ2
2π
a
b
0 5 10 15 20 25 30 35 40
0
0.06
0.08
0.1
0.12
0.14
0.16
Fig. 2 The modiﬁed periodogram ordinates ˜IYt as a function of t, a = 2 ˆσ2
2π , b = ˆσ2
2π
The main idea of plug-in method is to estimate unknown parameters σ2 and Aκ in the
expression (2) for the optimal bandwidth hopt, which is the minimum of RT (h)
RT (h) =
σ2V (K)
T h
+
h2κ
(κ!)2
β2
κ Aκ.
As an estimate of σ2 we can use (4), but for Aκ the situation is more complicated. From
the previous considerations we can replace the error function RT (h) by the selector
RT (h) expressed in Lemma 1. If we compare these two error functions, we arrive at
results described in next theorems.
Theorem 1 Let www− be the discrete Fourier transformation of vector www. Then it holds
T −1
t=0
(w−
t )2
=
1
h
V (K) + O(T −1
). (11)
123
70 J. Koláˇcek
The previous theorem implies that the ﬁrst term of RT (h) estimates the ﬁrst term of
RT (h), that is
ˆσ2
T
T −1
t=0
(w−
t )2
=
σ2V (K)
T h
+ O(T −2
).
Innext,wewillcomparethesecondtermsintheseerrorfunctionstoobtainanestimator
for Aκ.
Let ε > 0, h ∈ (0, 1), set J2 the last index from {0, . . . , T − 1} for which
J2 ≤
κ+1
√
ε(κ + 1)!
2πh
.
Let’s remark that the parameter ε is an error of Taylor’s approximation used in the
proof of Theorem 2 and the parameter h is some “starting" approximation of hopt.
In our experience, setting ε = 10−3 and h = κ
T yields good results. In next we will
request both conditions for indexes J1 and J2 hold at the same time, so we will deﬁne
the index J
J = min{J1, J2 + 1}. (12)
Theorem 2 Let J be the index deﬁned by (12). Then for all j ∈ N, 1 ≤ j ≤ J − 1, it
holds
1
(2π j)κ
(1 − w−
j ) = (−1)
κ
2 +1 hκ
κ!
βκ + c + O(T −1
), (13)
where c is a constant satisfying |c| < ε.
By using the result of this theorem we can deduce the estimator of unknown parameter
Aκ.
Definition Let J be the index deﬁned by (12). Then the estimator of the parameter
Aκ is of the form
Aκ =
4π
T
J−1
j=1
(2π j)2κ
IYj −
ˆσ2
2π
.
So we can estimate the error function (1)
RT (h) =
ˆσ2V (K)
T h
+
h2κ
(κ!)2
β2
κ Aκ, (14)
and its minimum
ˆhopt =
ˆσ2V (K)(κ!)2
2κTβ2
κ Aκ
1
2κ+1
. (15)
123
Plug-in method for nonparametric regression 71
Table 1 Kernels of class S0κ κ K(x)
2 − 3
4 (x2 − 1)
4 15
32 (x2 − 1)(7x2 − 3)
6 − 105
256 (x2 − 1)(33x4 − 30x2 + 5)
Table 2 Summary of sample means and standard deviations of bandwidth estimates
κ = 2; hopt = 0.1374 κ = 4; hopt = 0.3521 κ = 6; hopt = 0.5783
E(ˆhopt) std(ˆhopt) E(ˆhopt) std(ˆhopt) E(ˆhopt) std(ˆhopt)
Rice 0.1269 0.0402 0.3354 0.0938 0.4432 0.1078
Plug-in 0.1383 0.0074 0.3422 0.0348 0.5604 0.0623
The parameter ˆhopt given by (15) is the estimator of the theoretical optimal bandwidth
hopt obtained by plug-in method. We would like to point out the computational aspect
of the plug-in method. It has preferable properties to classical methods, because there
is no problem of minimization of any error function. Also the sample size necessary
to compute the estimation is far less than for classical methods. On the other side, a
small disadvantage could be the fact, that we need some “starting” approximation of
unknown parameter h.
5 A simulation study
We carried out a small simulation study to compare the performance of the bandwidth
estimates. The observations, Yt , for t = 0, . . . , T − 1 = 74, were obtained by adding
independent Gaussian random variables with mean zero and variance σ2 = 0.2 to the
function
m(x) = sin(2πx).
Table 1 describes kernels used in our simulation study. The theoretical optimal bandwidth
(see Wand and Jones 1995; Koláˇcek 2005) for these cases are given in Table 2.
Two hundred series were generated. Table 2 summarizes the sample means and
the sample standard deviations of bandwidth estimates, E(ˆh) is the average of all 200
values and std(ˆh) is their standard deviation.
Figure 3 illustrates the histogram of results of all 200 experiments for κ = 2.
As we can see, the standard deviation of all results obtained by plug-in method is
less than the value of case of Rice’s selector and also the mean of these results is closer
to theoretical optimal bandwidth.
123
72 J. Koláˇcek
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0
50
100
150
h
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0
50
100
150
h
Fig. 3 The histogram of results of all 200 experiments obtained by Rice’s selector (grey) and by plug-in
method (black)
6 Examples
In this section, we will solve some practical examples. We used the data from Eurostat1
and followed the count of marriages in Austria and Switzerland in May in 1950–2003.
We transformed the data to the interval [0, 1] and used two selectors to get the optimal
bandwidth. Firstly, we found the optimal bandwidth by the Rice’s selector RT (h),
which is the classical bandwidth selector. Then we used our proposed selector RT (h).
We made estimations of the regression function with both bandwidths by using the
kernel of order (0, 4)
K(x) =
15
16 (7
2 x4 − 5x2 + 3
2 ), |x| ≤ 1
0, |x| > 1.
We used Nadaraya–Watson estimator to obtain ﬁnal result.
1 see http://epp.eurostat.cec.eu.int.
123
Plug-in method for nonparametric regression 73
1950 1960 1970 1980 1990 2000 2010
3500
4000
4500
5000
5500
6000
6500
7000
Fig. 4 Estimation of the regression function (solid line). The parameter h = 0.0740 was found by Rice’s
selector RT (h)
1950 1960 1970 1980 1990 2000 2010
3500
4000
4500
5000
5500
6000
6500
7000
Fig. 5 Estimation of the regression function (solid line). The parameter h = 0.2180 was found by plug-in
method RT (h)
Marriages in Switzerland
In this example we followed the count of marriages in Switzerland in May in 1950–
2003.
In this case, the bandwidth obtained by Rice’s selector is too small and the ﬁnal
curve is undersmoothed (Figs. 4, 5).
Marriages in Austria
In this example we followed the count of marriages in Austria in May in 1950–2003.
In this case, we think that the value of the bandwidth obtained by Rice’s selector is
too large and the ﬁnal curve is oversmoothed (Figs. 6, 7). If we compare results of both
examples we can see, that the plug-in method is more stable then the classical one.
7 Conclusion
The problem of bandwidth selection for non-parametric kernel regression is considered.
In many studies, there was often observed that classical methods give smaller
123
74 J. Koláˇcek
1950 1960 1970 1980 1990 2000 2010
4500
5000
5500
6000
6500
7000
7500
8000
Fig. 6 Estimation of the regression function (solid line). The parameter h = 0.4084 was found by Rice’s
selector RT (h)
1950 1960 1970 1980 1990 2000 2010
4500
5000
5500
6000
6500
7000
7500
8000
Fig. 7 Estimation of the regression function (solid line). The parameter h = 0.2945 was found by plug-in
method RT (h)
bandwidths more frequently than predicted by the asymptotic theorems. Chiu (1990)
provided an explanation for the cause and suggested a procedure to overcome the difﬁculty.
By applying the procedure, we introduced a new approach to estimate unknown
parameters of average mean square error function (AMSE) (this process is known as a
plug-in method). Let us remark that Chiu’s procedure was proposed for Pristley–Chao
estimator and for a special class of symmetric probability density functions from S02
as kernels. We followed the Nadaraya–Watson and local-linear estimator especially
and extended the procedure to these estimators. It was shown they are identical in
circular model (see Koláˇcek 2005). In this paper, this approach has been generalized
for kernels from the class S0κ, κ even. The main result of this work is in Theorem 2 and
in the resulting definition, where the unknown parameter Aκ is estimated. Simulation
study and practical examples suggest that our proposed method could have preferable
properties to the classical one.
We remark that the proposed method is developed for a rather limited case: circular
design and equally spaced design points. Further research is required for more general
situations.
123
Plug-in method for nonparametric regression 75
8 Appendix
Lemma 1 Let J1 be the least index, that IYJ1
< ˆσ2/πT . Then
RT (h) =
ˆσ2
T
T −1
t=0
(w−
t )2
+
4π
T
J1−1
t=1
IYt −
ˆσ2
2π
{1 − w−
t }2
.
Proof
RT (h) =
4π
T
N
t=1
˜IYt {1 − w−
t }2
− ˆσ2
+ 2 ˆσ2
w0
=
4π
T
J1−1
t=1
IYt {1 − w−
t }2
+
4π
T
N
t=J1
ˆσ2
2π
{1 − w−
t }2
− ˆσ2
+ 2 ˆσ2
w0
=
4π
T
J1−1
t=1
IYt −
ˆσ2
2π
{1 − w−
t }2
+
ˆσ2
T
T −1
t=0
{1 − w−
t }2
− ˆσ2
+ 2 ˆσ2
w0
=
4π
T
J1−1
t=1
IYt −
ˆσ2
2π
{1 − w−
t }2
+
ˆσ2
T
T − 2T w0 +
T −1
t=0
(w−
t )2
− ˆσ2
+ 2 ˆσ2
w0
=
4π
T
J1−1
t=1
IYt −
ˆσ2
2π
{1 − w−
t }2
+
ˆσ2
T
T −1
t=0
(w−
t )2
.
Lemma 2 Let t ∈ {0, . . . , T − 1}, then
W0(xt ) =
1
T
Kh(xt ) + O(T −2
).
Proof
W0(xt ) =
1
TCT
Kh(xt ),
where
CT =
1
T
T −1
k=−T +1
Kh(xk).
We can express this constant in another way
CT =
1
−1
K(x)dx + O(T −1
) = 1 + O(T −1
)
123
76 J. Koláˇcek
and after substitution we arrive at the result
W0(xt ) =
1
T (1 + O(T −1))
Kh(xt ) =
1
T
Kh(xt ) + O(T −2
).
Theorem 1 Let www− be the discrete Fourier transformation of vector www. Then it holds
T −1
t=0
(w−
t )2
=
1
h
V (K) + O(T −1
).
Proof
T −1
t=0
(w−
t )2
=
T −1
t=0
|w−
t |2
=
T −1
t=0
w−
t w−
t
=
T −1
t=0
T −1
j=−T +1
T −1
k=−T +1
W0(x j )W0(xk)e
i2π(k− j)t
T
=
T −1
j=−T +1
T −1
k=−T +1
W0(x j )W0(xk)
T −1
t=0
e
i2π(k− j)t
T
= T
T −1
k=−T +1
W2
0 (xk) =
T −1
k=−T +1
1
T
K2
h (xk) + O(T −1
)
=
1
−1
K2
h (u)du + O(T −1
) =
1
h
1
−1
K2
(x)dx + O(T −1
).
Theorem 2 Let J be the index deﬁned by (12). Then for all j ∈ N, 1 ≤ j ≤ J − 1, it
holds
1
(2π j)κ
(1 − w−
j ) = (−1)
κ
2 +1 hκ
κ!
βκ + c + O(T −1
), (16)
where c is a constant satisfying |c| < ε.
Proof
1
(2π j)κ
(1 − w−
j ) =
1
(2π j)κ
1 − 2
T −1
t=0
W0(xt ) cos
2πt j
T
=
1
(2π j)κ
1 − 2
T −1
t=0
1
T
Kh(xt ) cos
2πt j
T
+ O(T −1
)
123
Plug-in method for nonparametric regression 77
=
1
(2π j)κ
⎧
⎨
⎩
1 − 2
1
0
Kh(u) cos(2π ju)du
⎫
⎬
⎭
+ O(T −1
)
=
1
(2π j)κ
⎧
⎨
⎩
1
−1
Kh(u)du −
1
−1
Kh(u) cos(2π ju)du
⎫
⎬
⎭
+ O(T −1
)
=
1
(2π j)κ
1
−1
{1 − cos(2π ju)}Kh(u)du + O(T −1
).
We can replace the function 1 − cos(2π ju) by Taylor’s polynomial of degree κ. Let
Rκ is an error of this approximation
1
(2π j)κ
(1 − w−
j ) =
1
(2π j)κ
1
−1
(2π ju)2
2
−
(2π ju)4
24
+ · · · +
(−1)
κ
2 +1
(2π ju)κ
κ!
×Kh(u)du +
Rκ
(2π j)κ
+ O(T −1
)
=
(−1)
κ
2 +1
κ!
1
−1
uκ
Kh(u)du +
Rκ
(2π j)κ
+ O(T −1
)
= (−1)
κ
2 +1 hκ
κ!
1
−1
xκ
K(x)dx +
Rκ
(2π j)κ
+ O(T −1
).
The last two terms are negligible, because O(T −1) tends to zero with T → ∞ and
from the assumptions for index j holds Rκ
(2π j)κ ≤ ε
(2π)κ for any ε > 0.
References
Cleveland WS (1979) Robust locally weighted regression and smoothing scatter plots. J Am Stat Assoc
74:829–836
Craven P, Wahba G (1979) Smoothing noisy data with spline function. Numer Math 31:377–403
Chiu ST (1991) Some stabilized bandwidth selectors for nonparametric regression. Ann Stat 19:1528–1546
Chiu ST (1990) Why bandwidth selectors tend to choose smaller bandwidths, and a remedy. Biometrika
77:222–226
Droge B (1996) Some comments on cross-validation. Stat Theory Comput Aspects Smooth 178–199
Härdle W (1990) Applied nonparametric regression. Cambridge University Press, Cambridge
Härdle W, Hall P, Marron JS (1988) How far are automatically chosen regression smoothing parameters
from their optimum? J Am Stat Assoc 83:86–95
Koláˇcek J (2005) Kernel estimation of the regression function. PhD-thesis, Brno
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 10:186–190
Rice J (1984) Bandwidth choice for nonparametric regression. Ann Stat 12:1215–1230
123
78 J. Koláˇcek
Silverman BW (1985) Some aspects of the spline smoothing approach to non-parametric regression curve
ﬁtting. J Roy Stat Soc Ser B 47:1–52
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645
Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, London
Watson GS (1964) Smooth regression analysis. Shankya Ser A 26:359–372
123
AUSTRIAN JOURNAL OF STATISTICS
Volume 35 (2006), Number 2&3, 281–288
A Comparative Study of Boundary Effects for
Kernel Smoothing
Jan Kol´aˇcek1
and Jitka Pomˇenkov´a
Masaryk University, Brno, Czech Republic
Abstract: The problem of boundary effects for nonparametric kernel regression
is considered. We will follow the problem of bandwidth selection for
Gasser-M¨uller estimator especially. There are two ways to avoid the difﬁculties
caused by boundary effects in this work. The ﬁrst one is to assume the
circular design. This idea is effective for smooth periodic regression functions
mainly. The second presented method is reﬂection method for kernel of
the second order. The reﬂection method has an inﬂuence on the estimate outside
edge points. The method of penalizing functions is used as a bandwidth
selector. This work compares both techniques in a simulation study.
Keywords: Bandwidth Selection, Kernel Estimation, Nonparametric Regres-
sion.
1 Basic Terms and Deﬁnitions
Consider a standard regression model of the form
Yi = m(xi) + εi , i = 1, . . . , n , n ∈ N ,
where m is an unknown regression function, xi are design points, Yi are measurements
and εi are independent random variables for which
E(εi) = 0 , var(εi) = σ2
> 0 , i = 0, . . . , n .
The aim of kernel smoothing is to ﬁnd suitable approximation m of an unknown function
m.
In next we will assume the design points xi are equidistantly distributed on the interval
[0, 1], that is xi = (i − 1)/n, i = 1, . . . , n.
Lip[a, b] denotes the class of continuous functions satisfying the inequality
|g(x) − g(y)| ≤ L|x − y| , ∀x, y ∈ [a, b] , L > 0 , L is a constant.
Deﬁnition. Let κ be a nonnegative even integer and assume κ ≥ 2. The function K ∈
Lip[−1, 1], support(K) = [−1, 1], satisfying the following conditions
1. K(−1) = K(1) = 0
2.
1
−1
xj
K(x) dx =



0, 0 < j < κ
1, j = 0
βκ = 0, j = κ,
is called a kernel of order κ and a class of all these kernels is marked S0κ. These kernels
are used for an estimation of the regression function (see Wand and Jones, 1995). Let
K ∈ S0κ, set Kh(·) = 1
h
K( ·
h
), h ∈ (0, 1). A parameter h is called a bandwidth.
1
Supported by the GACR: 402/04/1308
282 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288
2 Kernel Estimation of the Regression Function
Commonly used non-parametric methods for estimating m(x) are the kernel estimators
Gasser–M¨uller estimators (1979)
mGM (x; h) =
n
i=1
Yi
si
si−1
Kh(t − x) dt ,
where
si =
xi + xi+1
2
, i = 1, . . . , n − 1 , s0 = 0 , sn = 1 .
The kernel estimators can be generally expressed as
m(x; h) =
n
i=1
Wi(x)Yi ,
where the weights Wi(x) correspond to the weights of the estimators mGM .
The quality of the estimated curve is affected by the smoothing parameter h, which is
called a bandwidth. The optimal bandwidth considered here is hopt, the minimizer of the
average mean squared error
(AMSE) Rn(h) =
1
n
E
n
i=1
(m(xi) − m(xi; h))2
.
Let K ∈ S0κ. There exist many estimators of this error function, which are asymptotically
equivalent and asymptotically unbiased (see Chiu, 1991, 1990; H¨ardle, 1990). Most of
them are based on the residual sum of squares
(RSS) RSSn(h) =
1
n
n
i=1
[Yi − m(xi; h)]2
.
We will use the method of penalizing functions (see Kol´aˇcek, 2005, 2002) for choosing
the smoothing parameter. So the prediction error RSSn(h) is adjusted by some penalizing
function Ξ(n−1
Wi(xi)), that is, modiﬁed to
Rn(h) =
1
n
n
i=1
[m(xi; h) − Yi]2
· Ξ(n−1
Wi(xi)) .
The reason for this adjustment is that the correction function Ξ(n−1
Wi(xi)) penalizes
values of h too low. For example Rice (see Rice, 1984) considered
ΞR(u) =
1
1 − 2u
.
This penalizing function will be used.
J. Kol´aˇcek and J. Pomˇenkov´a 283
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
−1.5
−1
−0.5
0
0.5
1
1.5
x x+hx−h
Figure 1: Demonstration of boundary effects.
3 Boundary Effects
In the ﬁnite sample situation, the quality of the estimate in the boundary region [0, h] ∪
[1 − h, 1] is affected since the effective window is [x − h, x + h] ⊂ [0, 1] so, that the ﬁnite
equivalent of the moment conditions on the kernel function does not apply any more.
There are several methods to avoid the difﬁculties caused by boundary effects.
3.1 Cyclic Model
One of possible ways to solve problem of boundary effects is to use a cyclic design. That
is, suppose m(x) is a smooth periodic function and the estimate is obtained by applying
the kernel on the extended series Yi, where Yi+kn = Yi for k ∈ Z. Similarly xi = (i−1)/n,
i ∈ Z.
In the cyclic design, the kernel estimators can be generally expressed as
m(x; h) =
2n
i=−n+1
Wi(x)Yi ,
where the weights Wi(x) correspond to the weights of estimators mGM
Wi(x) =
si
si−1
Kh(t − x)dt,
where
si =
xi + xi+1
2
, i = −n + 1, . . . , 2n − 1 , s−n = −1 , s2n = 2 .
284 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288
Let us deﬁne a vector www := (w1, . . . , wn), where
wi = W1(xi − 1) + W1(xi) + W1(xi + 1) .
Let h ∈ (0, 1), K ∈ S0κ, i ∈ {1, . . . , n}. Then we can write m(xi; h) as a discrete cyclic
convolution of vectors www and YYY .
m(xi; h) =
n
k=1
w<i−k>n Yk , (1)
where < i − k >n marks (i − k)mod n. We write
mmm = www ⊛ YYY ,
where mmm = (m(x1; h), . . . , m(xn; h)).
As the bandwidth selector the method of Rice’s penalizing function will be used. In
the case of cyclic model, we can simplify the error function Rn(h), because the weights
Wi(xi) are independent on i. Set
I(h) :=
1/2n
−1/2n
Kh(x)dx .
Then we can express Rn(h) as
Rn(h) =
n
n − 2 I(h)
RSSn(h) (2)
and the estimate ˆhopt of optimal bandwidth is deﬁned as
ˆhopt = arg min
h∈(0,1)
RT (h) .
3.2 Reﬂection Technique
Let’s have observations (xi, Yi), i = 1, . . . , n, regression model described in Section 1
and design points xi ∈ [0, 1] such that
0 = a ≤ x1 ≤ · · · ≤ xn ≤ b = 1 .
Now, technique for design points reﬂection will be discussed. We may begin by estimating
the function m at edge points a and b with corresponding bandwidth for these points, ha
and hb, and edge kernels KL, KR ∈ S02:
m(a) =
1
ha
n
i=1
Yi
si
si−1
KL
a − u
ha
du ,
m(b) =
1
hb
n
i=1
Yi
si
si−1
KR
b − u
hb
du .
J. Kol´aˇcek and J. Pomˇenkov´a 285
For the bandwidth choice ha, hb and the edge kernels KL, KR for m(a), m(b) see Pomˇenkov´a
(2005). Further data reﬂection will be made. We proceed from original data set (xi, Yi),
i = 1, . . . , n. For obtaining left mirrors point (a, m(a)) and following relations
xLi = 2a − xi ,
YLi = 2m(a) − Yi
are used. For obtaining right mirrors point (b, m(b)) and following relations
xRi = 2b − xn−i+1 ,
YRi = 2m(b) − Yn−i+1
are used. Then original data set (xi, Yi) is connected with left mirrors (xLi, YLi) and with
right mirrors (xRi, YRi). By this connection new data set which is called pseudodata and
denoted as (xj, Y j), j = 1, . . . , 3n.
How to ﬁnd the bandwidth for an estimate on pseudodata at the design points will
be in next. Finally, the function m in design points including points a and b using the
pseudodata is estimated.
Let K ∈ S02 be a symmetric second-order kernel with support [−1, 1]. The ﬁnal
estimate of function m at points of plan xi, i = 0, . . . , n + 1, where x0 = a, xn+1 = b on
pseudodata xj, j = 1, . . . , 3n, with kernel K and bandwidth h is deﬁned
m(x) =
1
h
3n
j=1
Y j
sj
sj−1
K
x − u
h
du ,
where
sj =
xi + xi+1
2
, j = 1, . . . , 3n − 1 , s0 = −1 , s3n = 2 .
Bandwidth selection for pseudodata
In this part an estimate of the bandwidth for pseudodata will be searched. Note that
estimates at edge points m(a), m(b) are functions of h. Therefore, for any chosen value
h ∈ H = [1/n, 2] values m(a), m(b) have to be enumerated, then data reﬂection is made
and pseudodata are obtained. Hereafter, on this pseudodata minimum of the function is
searched.
To ﬁnd value h using a Rice penalization function is proposed. Consider pseudodata
(xj, Y j), j = 1, . . . , 3n, ¯xj ∈ [−1, 2], m(x) deﬁned as above. Then
Rn(h) =
1
n
n
i=1
[m(xi; h) − Yi]2
·
1
1 − 2xi
.
The resulting bandwidth h = ˆhopt is the value h that corresponds to the minimum of the
function Rn(h), i.e.
ˆhopt = arg min
h∈H
Rn(h) . (3)
286 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288
4 A Simulation Study
We carried out a small simulation study to compare the performance of the bandwidth estimates.
The observations Yi, for i = 1, . . . , n = 75, were obtained by adding independent
Gaussian random variables with mean zero and variance σ2
= 0.2 to the function
m(x) = cos(9x − 7) − (3 + x12
)/6 + 8x−1
.
We made estimations of the regression function by using the kernel of order 2
K(x) =
−3
4
(x2
− 1), |x| ≤ 1
0, |x| > 1 .
In this case, there was selected ˆh = 0.0367 by using an estimate without any elimination
of boundary effects (Figure 2). At the second, there was selected ˆh = 0.0867 by using
the method of cyclic model (Figure 3) and at the third, there was selected ˆh = 0.2036 by
using the reﬂection method (Figure 4).
From the ﬁgures it can be seen that both, cyclic model and reﬂection method, are very
useful for removing problems caused by boundary effects.
5 A Practical Example
We carried out a short real application to compare the performance of the bandwidth
estimates. The observations Yi, for i = 1, . . . , n = 230, were average spring temperatures
measured in Prague between 1771 – 2000. The data were obtained from Department of
Geography, Masaryk University. We made estimations of the regression function by using
the kernel of order 2
K(x) =
−3
4
(x2
− 1), |x| ≤ 1
0, |x| > 1 .
In this case, there was selected ˆh = 0.0671 by using an estimate without any elimination
of boundary effects (Figure 5). At the second, there was selected ˆh = 0.0671 by using
the method of cyclic model (Figure 6) and at the third, there was selected ˆh = 0.2211 by
using the reﬂection method (Figure 7). These ﬁgures show that both, cyclic model and
reﬂection method, are very useful for removing problems caused by boundary effects.
References
Chiu, S. (1990). Why bandwidth selectors tend to choose smaller bandwidths, and a
remedy. Biometrika, 77, 222-226.
Chiu, S. (1991). Some stabilized bandwidth selectors for nonparametric regression. Annals
of Statistics, 19, 1528-1546.
H¨ardle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University
Press.
Kol´aˇcek, J. (2002). Kernel estimation of the regression function – bandwidth selection.
Summer School DATASTAT’01 Proceedings FOLIA, 1, 129-138.
J. Kol´aˇcek and J. Pomˇenkov´a 287
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−13.5
−13
−12.5
−12
−11.5
−11
−10.5
−10
−9.5
−9
Figure 2: Graph of smoothness function with bandwidth h = 0.0367, the real regression
function m, an estimate of m.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−13.5
−13
−12.5
−12
−11.5
−11
−10.5
−10
−9.5
−9
Figure 3: Graph of smoothness function with bandwidth h = 0.0867, the real regression
function m, an estimate of m.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−13.5
−13
−12.5
−12
−11.5
−11
−10.5
−10
−9.5
−9
Figure 4: Graph of smoothness function with bandwidth h = 0.2036, the real regression
function m an estimate of m.
Kol´aˇcek, J. (2005). Kernel Estimators of the Regression Function. Brno: PhD-Thesis.
Pomˇenkov´a, J. (2005). Some Aspects of Regression Function Smoothing (in Czech).
Ostrava: PhD-Thesis.
Rice, J. (1984). Bandwidth choice for nonparametric regression. The Annals of Statistics,
12, 1215-1230.
Wand, M., and Jones, M. (1995). Kernel Smoothing. London: Chapman & Hall.
288 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288
1750 1800 1850 1900 1950 2000
4
5
6
7
8
9
10
11
12
13
Figure 5: Graph of smoothness function with bandwidth h = 0.0671, an estimate of m.
1750 1800 1850 1900 1950 2000
4
5
6
7
8
9
10
11
12
13
Figure 6: Graph of smoothness function with bandwidth h = 0.0671, an estimate of m.
1750 1800 1850 1900 1950 2000
4
5
6
7
8
9
10
11
12
13
Figure 7: Graph of smoothness function with bandwidth h = 0.2211, an estimate of m.
Authors’ address:
Jan Kol´aˇcek, Jitka Pomnˇenkov´a
Masaryk University in Brno
Department of Applied Mathematics
Jan´aˇckovo n´amˇest´ı 2a
CZ-602 00 Brno
Czech Republic
E-mail: kolacek@math.muni.cz