MASARYK UNIVERSITY FACULTY OF SCIENCE DEPARTMENT OF MATHEMATICS AND STATISTICS Habilitation Thesis BRNO 2014 JAN KOLÁČEK MASARYK UNIVERSITY FACULTY OF SCIENCE DEPARTMENT OF MATHEMATICS AND STATISTICS Theory and Practice of Kernel Smoothing Habilitation Thesis Jan Koláček Brno 2014 Contents Abstract (in Czech) 2 Preface 3 1 Introduction 4 2 Assumptions and notations 6 2.1 The univariate case . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 The multivariate case . . . . . . . . . . . . . . . . . . . . . . . 7 3 Kernel estimation of a regression function 8 3.1 Choosing the shape of the kernel . . . . . . . . . . . . . . . . 9 3.2 Choosing the optimal bandwidth . . . . . . . . . . . . . . . . 9 3.2.1 Plug-in method . . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 Iterative method . . . . . . . . . . . . . . . . . . . . . 10 3.3 Kernel regression for correlated data . . . . . . . . . . . . . . 11 4 Boundary effects in kernel estimation 12 4.1 Boundary effects in kernel regression . . . . . . . . . . . . . . 12 4.2 Boundary effects in kernel estimation of a distribution function 12 5 Kernel estimation and reliability assessment 13 6 Multivariate kernel density estimation 14 7 The monograph 15 8 Conclusion and further research 16 References 17 Reprints of articles 24 1 Abstrakt Habilitační práce je souborem článků [2 – 10] publikovaných v mezinárodních časopisech, z nichž čtyři jsou evidovány v databázi Web of Science. Většina těchto článků má spoluautory, jimiž jsou Ivana Horová, Kamila Vopatová, R.J. Karunamuni a další, přičemž podíl všech autorů na společných článcích je rovnocenný. Práce také odkazuje na knihu [1], která vyšla v roce 2012 v nakladatelství World Scientific a je shrnutím získaných poznatků a praktickou aplikací našich výsledků v jazyce Matlab. Oblastí našeho výzkumu je teorie jádrového vyhlazování, které zaznamenalo v posledních dvaceti letech nebývalý rozmach. V současnosti patří jádrové vyhlazování ke standardním neparametrickým technikám používaných při zpracování a modelování dat. Základy teorie jádrového vyhlazování jsou popsány v monografiích [48, 74, 79]. Řídícím faktorem při jádrovém vyhlazování je vyhlazovací parametr, který se v jednorozměrném případě nazývá šířka vyhlazovacího okna, ve vícerozměrném případě jej nazýváme vyhlazovací matice. V našem výzkumu jsme se tedy zaměřili především na volbu tohoto vyhlazovacího parametru. V případě jádrových odhadů regresní funkce byly navrženy dvě nové metody. První předpokládá cyklický plán, kdy se data periodicky opakují, a byla publikována v [10]. Druhá metoda byla představena v článku [7] a její statistické vlastnosti byly odvozeny v článku [3], který byl loni přijat k publikaci. V souvislosti s odhady regresní funkce byly také studovány hraniční efekty (viz [25]) a dále problematika jádrových odhadů regresní funkce pro korelovaná data (viz [4]). Neméně zajímavým tématem v této oblasti je problematika hraničních efektů, které při jádrových odhadech nastávají. Zaměřili jsme se zejména na hraniční efekty při jádrových odhadech distribuční funkce. V článku [9] jsme se zabývali potlačením těchto efektů při odhadech ROC křivky. V článku [8] jsme studovali vliv a potlačení efektů při odhadech rizikové funkce. Dále jsme se také zabývali využitím jádrových odhadů ve financích, konkrétně při odhadování indexů a křivek, které popisují kvalitu skóringových modelů (viz [5]). Velmi významnou část našeho výzkumu tvoří zobecnění principů jádrových odhadů v jednorozměrném případě na vícerozměrný prostor. Zaměřili jsme se nejprve na jádrové odhady hustoty. V článku [6] byla představena iterační metoda pro hledání optimální vyhlazovací matice, zejména její grafická interpretace ve speciálním případě dvourozměrného prostoru a za předpokladu diagonální matice. Statistické vlastnosti a zobecnění na plnou matici pro tuto metodu byly odvozeny v článku [2]. 2 Preface The thesis is a collection of articles [2 – 10]. Four of them have been published in international journals indexed by Web of Science. The paper [3] was accepted in December 2013. The thesis also refers to the book [1] which is a summary of all results in our research area. Our main research interest lies in the theory of kernel smoothing. Kernel methods are well-known and intensively used by the community of nonparametricians because they are a useful tool for local weighting. Kernel estimators combine two main advantages: simple expression and ease of im- plementation. It is well known that the most important factor in kernel estimation is a choice of smoothing parameters. This choice is particularly important because of its role in controlling both the amount and the direction of smoothing. This problem has been widely discussed in many monographs and pa- pers. The following overview starts with a motivation of the theory of kernel smoothing and then briefly describes the main contributions of the book [1] and the papers [2 – 10]. In order to make the presentation more compact, the thesis consists of the author’s selected papers in the area. In References one can find the list of other related publications of the author [11 – 28]. Pronouncement Almost all papers included in this thesis have co-authors, namely I. Horová, K. Vopatová, R. J. Karunamuni, J. Zelinka, M. Řezáč and D. Lajdová. In all cases, the contributions of all authors were equivalent, since the results were based on common discussions. Formally, the author’s contribution to the paper [10] was 100%, the author’s contribution to the papers [3, 5, 7, 8, 9] was 50% and the author’s contribution to the monograph [1] and the papers [2, 4, 6] was 33%. Acknowledgement I wish to thank all the co-authors for their friendly and always very helpful collaboration. I would like to express my gratitude to my colleague Prof. Ivana Horová for our numerous interesting discussions. And most importantly, I would like to thank my wife Veronika. Her support, encouragement, patience and love were the bedrock upon which the past eight years of my life have been built. 3 1 Introduction Kernel smoothing belongs to a general category of techniques for nonparametric curve estimations including nonparametric regression, nonparametric density estimators and nonparametric hazard functions. These estimations depend on a smoothing parameter called a bandwidth which controls the smoothness of the estimate and on a kernel which plays a role of weight function. As far as the kernel function is concerned, a key parameter is its order which is related both to the number of its vanishing moments and to the number of existing derivatives for the underlying curve to be estimated. As concerns a bandwidth choice – it is the crucial problem in the kernel smoothing and this is the main topic of our research. The first part of our research includes a methodology for nonparametric regression analysis, complemented with practical applications. In nonparametric regression estimation, a critical and inevitable step is to choose the smoothing parameter (bandwidth) to control the smoothness of the curve estimate. The smoothing parameter considerably affects the features of the estimated curve. Although in practice one can try several bandwidths and choose a bandwidth subjectively, automatic (data-driven) selection procedures could be useful for many situations; see [73] for more examples. Several automatic bandwidth selectors were proposed and studied in [37], [50], [49], [38] with the references included. It is well recognized that these bandwidth estimates are the subject to large sample variation. The kernel estimates based on the bandwidths selected by these procedures could have very different appearances. Due to the large sample variation, classical bandwidth selectors might not be very useful in practice. This fact has motivated us in our research to find new methods for bandwidth selection which give much more stable bandwidth estimates. In connection with the kernel regression analysis we have to mention one essential fact. The regression model assumes no correlation in measurements. In the case of independent observations the literature on bandwidth selection methods is quite extensive. Nevertheless, if an autocorrelation structure of errors occurs in data, then classical bandwidth selectors have not always provided applicable results (see [35]). Many real data sets (especially time series) show the autocorrelation. This has led us to study possibilities for overcoming the effect of dependence on the bandwidth selection. The next part of our research is focused on the studying of boundary effects in kernel estimation. In practical processing we encounter data which are bounded in some interval. The quality of the estimate in the boundary region is affected since the “effective” window does not belong to this interval, therefore the finite equivalent of the moment conditions on the kernel 4 function does not apply any more. This phenomenon is called the boundary effect. Although there is a vast literature on boundary correction in density estimation context, the boundary effects problem in the cumulative distribution function and the regression function context has been less studied. Thus, we have focused our research to these areas of kernel smoothing. As we have already mentioned, kernel smoothing is widely used in many statistical research areas. One of them is focused on studying discrimination measures used to determine the quality of models at separating in a binary classification system. There are many possible ways to measure the performance of the classification rules. It is often very helpful to be given a method for displaying and summarizing performance over a wide range of conditions. This aim is fulfilled, e.g., by the ROC (Receiver Operating Characteristic) curve, Information value curve, Lift, Kolmogorov-Smirnov statistics and others. There are many problems in the estimation of these curves in practice and the kernel smoothing approach seems to be very helpful. Thus, our research has been directed also to this area. The important part of our research is devoted to the extension of the univariate kernel density estimate to the multivariate setting. As we have already explained, the typical question, motivated by the origins of this research area, asks to determine the optimal smoothing parameter (matrix). Some “classical” methods in the multivariate case were developed and widely discussed in papers [31], [42], [41], [68], [39]. Tarn Duong’s PhD thesis ([39]) provided a comprehensive survey of bandwidth matrix selection methods for kernel density estimation. Papers [32], [40] investigated general density derivative estimators, i.e., kernel estimators of multivariate density derivatives using general (or unconstrained) bandwidth matrix selectors. We have followed mentioned papers and we have proposed a new data-driven bandwidth matrix selection method. Ideas similar to this method have been applied to kernel estimates of multivariate regression functions. We would like to emphasize a great interest and usefulness of all mentioned problems in many fields of applied sciences (environmetrics, chemometrics, biometrics, medicine, econometrics, . . . ). Thus our works deal not only with the theoretical background of the considered problems but also with the application to real data. For example, see [4] where the utility of the proposed method was illustrated through an application to the time series of ozone data. For applications of smoothing methods in medicine see [14]. The wide range of applications in finance can be found in [5, 16, 17, 20, 21, 18]. The use of some proposed methods for modeling in environmetrics was described in [22]. See the list in “Other Publications of the Author” at the end of the thesis for more references. 5 Author’s Contribution Our interest is focused on an outstanding open problem of the optimal bandwidth matrix selection in the multivariate case. Although there exist several classical approaches, it is problematic to implement them in practice because of their computational difficulty. Our results concerning this problem are described in Section 6. The author considers these results to be the most valuable part of the thesis since they can potentially constitute a significant step towards a more effective computable solution of the problem. In Section 3 we overview our results concerning two other related problems. The main part describes results concerning optimal bandwidth selection for the univariate kernel regression and the remaining part deals with the problem of autocorrelated data in kernel regression. Our investigations of boundary effects in kernel smoothing (Section 4) serve as a supporting ground for new techniques in reliability assessment (Section 5), and the results obtained there could be beneficial for applications in other research areas. Finally, Section 7 presents a monograph, where all results of our research are summarized. An integral part of the book is a special toolbox in MATLAB. The toolbox is described in the book in detail and provides a practical implementation of presented methods. 2 Assumptions and notations In this section, we introduce a definition of the kernel and show notations and general assumptions used in our research. Definition 1. Let ν, k be nonnegative integers, 0 ≤ ν < k. Let K be a real valued function satisfying K ∈ Sν,k, where Sν,k =    K ∈ Lip[−1, 1], support(K) = [−1, 1] 1 −1 xj K(x)dx =    0, 0 ≤ j < k, j = ν (−1)ν ν!, j = ν βk = 0, j = k. (1) Such a function is called a kernel of order k. The integral conditions are often called moment conditions. A commonly used kernel function is the Gaussian kernel K(x) = 1 √ 2π exp(−x2 /2). 6 Nevertheless, this kernel has an unbounded support and thus it does not belong to the class Sν,k. 2.1 The univariate case Let us consider a univariate function f (a density function or a regression function) which should be estimated. We present a short overview of the notation and assumptions used in our research. (N1) The positive number h is a smoothing parameter called also a bandwidth. The bandwidth h is depending on n, h = h(n): {h(n)}∞ n=1 is a nonrandom sequence of positive numbers. (N2) Kh(t) = 1 h K t h , K ∈ S0,k, k is even, h > 0. (N3) V (ρ) = R ρ2 (x)dx for any square integrable scalar valued function ρ. (A1) K ∈ S0,k ∩ Cν [−1, 1], K(j) (−1) = K(j) (1) = 0, j = 0, 1, . . . , ν, ν ∈ N, i.e., K(ν) ∈ Sν,k+ν (see [46, 60]). (A2) f ∈ Ck0 , ν + k ≤ k0, f(ν+k) is square integrable. (A3) lim n→∞ h = 0, lim n→∞ nh2ν+1 = ∞. 2.2 The multivariate case This part is devoted to the extension of assumptions for the univariate case to the multivariate setting. Let us consider a d-dimensional space as the domain of the estimated function f. (N1) H denotes a class of d × d symmetric positive definite matrices. (N2) V (g) = Rd g(x)gT (x)dx for any square integrable vector valued function g. (A1) The kernel function K satisfies the moment conditions K(x)dx = 1, xK(x)dx = 000, xxT K(x)dx = β2Id, Id is the d × d identity matrix. (A2) H ∈ H, H = Hn is a sequence of bandwidth matrices such that n−1/2 |H|−1/2 (H−1 )j , j = 0, 1, . . . , ν, ν ∈ N, and entries of H approach zero ((H−1 )0 is considered as equal to 1). (A3) Each partial derivative of f of order j +2, j = 0, 1, . . . , ν, is continuous and square integrable. 7 3 Kernel estimation of a regression function One of our research interest includes the methodology for nonparametric regression analysis, combined with practical applications. The aim of regression analysis is to produce a reasonable analysis of an unknown regression function m. By reducing the observational errors it allows the interpretation to concentrate on important details of the mean dependence of Y on X. Kernel regression estimates are one of the most popular nonparametric estimates. Let us consider a standard regression model of the form Yi = m(xi) + εi, i = 1, . . . , n, (2) where m is an unknown regression function, Y1, . . . , Yn are observable data variables with respect to the design points x1, . . . , xn. The residuals ε1, . . . , εn are independent identically distributed random variables for which E(εi) = 0, var(εi) = σ2 > 0, i = 1, . . . , n. We suppose the fixed equally spaced design, i.e., design variables are not random and xi = i/n, i = 1, . . . , n. In the case of random design, where the design points X1, . . . , Xn are random variables with the same density f, all considerations are similar to the fixed design. A more detailed description of the random design can be found, e.g., in [79]. The most popular regression estimator was proposed by Nadaraya and Watson ([64] and [80]) and it is defined as mNW (x, h) = n i=1 Kh(xi − x)Yi n i=1 Kh(xi − x) . (3) In order to complete the overview of commonly used nonparametric methods for estimating m(x) we mention these estimators: • local – linear estimator ([76, 36]) mLL(x, h) = 1 n n i=1 {ˆs2(x, h) − ˆs1(x, h)(xi − x)}Kh(xi − x)Yi ˆs2(x, h)ˆs0(x, h) − ˆs1(x, h)2 , (4) where ˆsr(x, h) = 1 n n i=1 (xi − x)r Kh(xi − x), r = 0, 1, 2, 8 • Priestley – Chao estimator ([66]) mPCH(x, h) = 1 n n i=1 Kh(xi − x)Yi, (5) • Gasser – Müller estimator ([44]) mGM (x, h) = n i=1 Yi si si−1 Kh(t − x)dt, (6) where si = xi + xi+1 2 = 2i + 1 2n , i = 1, . . . , n − 1, s0 = 0, sn = 1. One can see from these formulas that kernel estimators can be generally expressed as m(x, h) = n i=1 W (j) i (x, h)Yi, (7) where weights W (j) i (x, h), j ∈ {NW, LL, PCH, GM} correspond to weights of estimators mNW , mLL, mPCH and mGM defined above. In the univariate case, these estimators depend on a bandwidth, which is a smoothing parameter controlling the smoothness of an estimated curve and a kernel which is considered as a weight function. 3.1 Choosing the shape of the kernel The choice of the kernel does not influence the asymptotic behavior of the estimate so significantly as the bandwidth does. We assume K ∈ S0,k and under the additional assumption that k is even, k > 0. More detailed procedures for choosing the optimal kernel are described in [1]. 3.2 Choosing the optimal bandwidth The choice of the smoothing parameter is a crucial problem in the kernel regression. The literature on bandwidth selection is quite extensive, e.g., monographs [79, 74, 75], papers [48, 33, 34, 67, 77, 37, 38, 58, 10]. Although in practice one can try several bandwidths and choose a bandwidth subjectively, automatic (data-driven) selection procedures could be 9 useful for many situations; see [73] for more examples. Most of these procedures are based on estimating Average Mean Square Error. They are asymptotically equivalent and asymptotically unbiased (see [48, 33, 34]). However, in simulation studies ([58]), it is often observed that most selectors are biased toward undersmoothing and yield smaller bandwidths more frequently than predicted by asymptotic results. As a part of our research we developed two methods for the optimal bandwidth selections. 3.2.1 Plug-in method In the simulation study of [33], it was observed that standard criterions give smaller bandwidths more frequently than predicted by the asymptotic theorems. [33] provided an explanation for the cause and suggested a procedure to overcome the difficulty. By applying the procedure, we have introduced a method for bandwidth selection which gives much more stable bandwidth estimates (see [10]). As a result, we have obtained a type of plug-in method. Our ideas are based on an assumption of a “cyclic design”, that is, we suppose m to be a smooth periodic function and the estimate is obtained by applying the kernel on the extended series Yi, i = −n + 1, −n + 2, . . . , 2n, where generally Yj+ln = Yj for j = 1, . . . , n and l ∈ Z. Similarly xi = i/n, i = −n + 1, −n + 2, . . . , 2n. The main result of the paper [10] is the plug-in estimator of the optimal bandwidth h ˆhPI = ˆσ2 V (K)(k!)2 2knβ2 kAk 1 2k+1 . (8) We would like to point out the computational aspect of the proposed estimator. It has preferable properties compared to the classical methods because there is no problem of minimization of any error function. Also, the sample size which is necessary for computing the estimation is far less than for classical methods. On the other hand, a minor disadvantage could be the fact that we need a “starting” approximation of the unknown parameter h. We would also like to specify the proposed method was developed for a rather limited case: the cyclic design. 3.2.2 Iterative method Successful approaches to the bandwidth selection in kernel density estimation can be transferred to the case of kernel regression. The iterative method for the kernel density was developed and widely discussed in [54]. The ideas 10 of this paper were extended to the regression case. The obtained selector was introduced in [7] and its statistical properties were derived in [3]. The proposed method is based on an optimally balanced relation between the integrated variance and the integrated square bias AIV {m(·, hopt)} − 2k AISB{ m(·, hopt)} = 0, (9) where AIV {m(·, hopt)} = σ2 V (K) nh and AISB{ m(·, hopt)} = 1 n n i=1 (Em(xi, h) − m(xi))2 . The main idea consists in finding a fixed point of the equation h = ˆσ2 V (K) 2knhAISB {m(·, h)} . (10) We use Steffensen’s iterative method with the starting approximation ˆh0 = 2/n. This approach leads to an iterative quadratically convergent process (see [54]). 3.3 Kernel regression for correlated data As mentioned above, the literature on bandwidth selection methods is quite extensive in the case of independent observations. Nevertheless, if an autocorrelation structure of errors occurs in data, then classical bandwidth selectors have not always provided applicable results (see [35]). There exist several possibilities for overcoming the effect of dependence on the bandwidth selection. In the paper [4] we used the results of [35] and [10] and developed a new flexible plug-in approach for estimating the optimal smoothing parameter. The utility of the method was illustrated through a simulation study and application to the time series of ozone data obtained from the Vernadsky station in Antarctica. 11 4 Boundary effects in kernel estimation In practical processing we encounter data which are bounded in some interval. The quality of the estimate in the boundary region is affected since the “effective” window [x − h, x + h] does not belong to this interval, so the finite equivalent of the moment conditions on the kernel function does not apply any more. This phenomenon is called the boundary effect. There are several methods to cope with boundary effects. One of them is based on the construction of special boundary kernels. Their construction was described in details for instance in [63] or [51]. These kernels can be used successfully in kernel regression but their use in density or distribution function estimates gives often inappropriate results. Although there is a vast literature on the boundary correction in density estimation context, the boundary effects problem in distribution function and regression function context has been less studied. Thus we focused our research on these areas of kernel smoothing. 4.1 Boundary effects in kernel regression If the support of the true regression curve is bounded then most nonparametric methods give estimates that are severely biased in regions near the endpoints. To be specific, the bias of m(x) is of order O(h) rather than O(h2 ) for x ∈ [0, h] ∪ [1 − h, 1]. This boundary problem affects the global performance visually and also in terms of a slower rate of convergence in the usual asymptotic analysis. It has been recognized as a serious problem and many works are devoted to reducing the effects. [44, 45, 46] and [63] discussed boundary kernel methods. Another approach to the boundary problems are reflection methods which generally consist in reflecting data about the boundary points and then estimating the regression function. These methods were discussed, e.g., in [69, 47]. The reflection principles used in kernel density estimation can be also adapted to kernel regression. The regression estimator with the assumption of the “cyclic" model described in [10] can be also considered as the special case of a reflection technique. A short comparative study of methods for boundary effects eliminating was given in [25]. 4.2 Boundary effects in kernel estimation of a distribution function We have focused also on the boundary correction in kernel estimation of a cumulative distribution function (CDF) which is important for other applica- 12 tions – especially for kernel estimation of ROC curves and hazard functions. In the paper [9], we developed a new kernel type estimator of the ROC curve that removes boundary effects near the end points of the support. The estimator is based on a new boundary corrected kernel estimator of distribution functions and it is based on ideas of [56, 57], developed for boundary correction in kernel density estimation. The basic technique of construction of the proposed estimator is a type of a generalized reflection method involving reflecting a transformation of the observed data. In fact, the proposed method generates a class of boundary corrected estimators. We have derived expressions for the bias and variance of the proposed estimator. Furthermore, the proposed estimator has been compared with the "classical estimator" using simulation studies. Using similar ideas as in [9] we have developed a new kernel estimator of the hazard function. The method was proposed in [8] and successfully removes boundary effects and performs considerably better than classical estimators. 5 Kernel estimation and reliability assessment The following part of our research is focused on studying discrimination measures used for detecting the quality of models at separating in a binary classification system. There are many possible ways of measuring the performance of the classification rules. It is often very helpful to know a way of displaying and summarizing performance over a wide range of conditions. This aim is fulfilled by the ROC (Receiver Operating Characteristic) curve. It is a single curve summarizing the distribution functions of the scores of two classes. In our research, we have followed the financial sphere, where the discrimination power of scoring models is evaluated. However, most of all studied indices have wide application in many other areas, where models with binary output are used, like biology, medicine, engineering and so on. References on this topic are quite extensive, see, e.g., [72, 29, 78]. In [5], we summarized the most important quality measures and gave some alternatives to them. All of the mentioned indices are based on the density or on the distribution function, therefore one can suggest the technique of kernel smoothing for estimation. More detailed studies on all indices can be also found, e.g., in [20, 21]. Finally, a new conservative approach to quality assessment was proposed in [18]. 13 6 Multivariate kernel density estimation An important part of our research is devoted to the extension of the univariate kernel density estimate to the multivariate setting. Let a d-variate random sample X1, . . . , Xn be drawn from a density f. The kernel density estimator ˆf at the point x ∈ Rd is defined as ˆf(x, H) = 1 n n i=1 KH(x − Xi), (11) where K is a kernel function, which is often taken to be a d-variate symmetric probability function, H is a d×d symmetric positive definite matrix and KH is the scaled kernel function KH(x) = |H|−1/2 K(H−1/2 x) with |H| the determinant of the matrix H. In a univariate case, kernel estimates depend on a bandwidth, which is a smoothing parameter controlling smoothness of an estimated curve and a kernel which is considered as a weight function. The choice of the smoothing parameter is a crucial problem in the kernel density estimation. The literature on bandwidth selection is quite extensive, e.g., monographs [79], [74], [75], papers [61], [65], [71], [55], [30]. As far as the kernel estimate of density derivatives is concerned, this problem has received significantly less attention. In paper [50], an adaptation of the least squares cross-validation method was proposed for the bandwidth choice in the kernel density derivative estimation. In paper [52], the automatic procedure of simultaneous choice of the bandwidth, the kernel and its order for kernel density and its derivative estimates was proposed. But this procedure can be only applied in case that the explicit minimum of the Asymptotic Mean Integrated Square Error of the estimate is available. It is known that this minimum exists only for d = 2 and the diagonal matrix H. In paper [6], the basic formula for the corresponding procedure was given. The need for nonparametric density estimates for recovering the structure in multivariate data is greater since a parametric modelling is more difficult than in the univariate case. The extension of the univariate kernel methodology is not without problems. The most general smoothing parameterization of the kernel estimator in d dimensions requires the specification entries of d × d positive definite bandwidth matrix. The multivariate kernel density estimator we have dealt with is a direct extension of the univariate estimator (see, e.g., [79]). 14 Successful approaches to the univariate bandwidth selection can be transferred to the multivariate settings. The least squares cross-validation and plug-in methods in the multivariate case were developed and widely discussed in papers [31], [42], [41], [68], [39]. Some papers (e.g., [23], [6], [19]) were focused on constrained parameterization of the bandwidth matrix such as a diagonal matrix. It is a well-known fact that a visualization is an important component of the nonparametric data analysis. In paper [6], this effective strategy was used to clarify the process of the bandwidth matrix choice using bivariate functional surfaces. The paper [53] brought a short communication on a kernel gradient estimator. Tarn Duong’s PhD thesis ([39]) provided a comprehensive survey of bandwidth matrix selection methods for kernel density estimation. The papers [32], [40] investigated general density derivative estimators, i.e., kernel estimators of multivariate density derivatives using general (or unconstrained) bandwidth matrix selectors. They defined the kernel estimator of the multivariate density derivative and provided results for the Mean Integrated Square Error convergence asymptotically and for finite samples. Moreover, the relationship between the convergence rate and the bandwidth matrix was established here. They also developed estimates for the class of normal mixture densities. We have followed the mentioned papers and in [2] we proposed a new data-driven bandwidth matrix selection method. This method is based on an optimally balanced relation between the integrated variance and the integrated squared bias, see [54]. Similar ideas have been applied to kernel estimates of regression functions (see [7] or [3]). We have discussed the statistical properties and relative rates of convergence of the proposed method as well. 7 The monograph The knowledge obtained in our research in kernel smoothing theory has resulted in writing a monograph [1]. The book provides a brief comprehensive overview of statistical theory. We do not concentrate on details since there exists a number of excellent monographs developing statistical theory ([79, 48, 62, 74, 75, 70] etc.). Instead, the emphasis is given to the implementation of presented methods in MATLAB. All created programs are included into a special toolbox which is an integral part of the book. This toolbox contains many MATLAB scripts useful for kernel smoothing of density, distribution function, regression function, hazard function, multivariate density and also for kernel estimation and reliability assessment. The toolbox can be downloaded from the public web page (see [59]). 15 The toolbox is divided into six parts according to the chapters of the book. All scripts are included in a user interface and it is easy to manipulate with this interface. Each chapter of the book contains a detailed help for the related part of the toolbox. The monograph is intended for newcomers to the field of smoothing techniques and would be also appropriate for a wide audience: advanced graduate and PhD students, researchers from both the statistical science and interface disciplines. 8 Conclusion and further research The previous text summarizes all our results in kernel smoothing which belongs to a general category of techniques for nonparametric curve estimations. We have studied several parts of kernel smoothing theory. The most interesting theoretical results were obtained in multivariate kernel estimating and in the choosing of the optimal smoothing parameter. We have also paid attention to the use of our results in many fields of applied sciences like environmetrics, biometrics, medicine or econometrics. Thus our works deal not only with the theoretical background of the considered problems but also with the application to real data. In the further research we would like to aim at extending our previous results to modeling for functional data sets. The functional data set can be defined as the observation of the random variable which takes values in an infinite dimensional space (or functional space). Thus the analysis of functional data seems to be a natural extension of our ideas. For more about functional data analysis see, e.g., [43]. 16 References Publications Included in the Thesis [1] I. Horová, J. Koláček, and J. Zelinka, Kernel Smoothing in MATLAB: Theory and Practice of Kernel Smoothing. Singapore: World Scientific Publishing Co. Pte. Ltd., 2012. [2] I. Horová, J. Koláček, and K. Vopatová, “Full bandwidth matrix selectors for gradient kernel density estimate,” Computational Statistics & Data Analysis, vol. 57, no. 1, pp. 364–376, 2013. [3] J. Koláček and I. Horová, “Selection of bandwidth for kernel regression,” Communications in Statistics - Theory and Methods. to appear. [4] I. Horová, J. Koláček, and D. Lajdová, “Kernel regression model for total ozone data,” Journal of Environmental Statistics, vol. 4, no. 2, pp. 1–12, 2013. [5] M. Řezáč and J. Koláček, “Lift-based quality indexes for credit scoring models as an alternative to gini and ks,” Journal of Statistics: Advances in Theory and Applications, vol. 7, no. 1, pp. 1–23, 2012. [6] I. Horová, J. Koláček, and K. Vopatová, “Visualization and bandwidth matrix choice,” Communications in Statistics – Theory and Methods, vol. 41, no. 4, pp. 759–777, 2012. [7] J. Koláček and I. Horová, “Iterative bandwidth method for kernel regression,” Journal of Statistics: Advances in Theory and Applications, vol. 8, no. 2, pp. 91–103, 2012. [8] J. Koláček and R. J. Karunamuni, “A generalized reflection method for kernel distribution and hazard functions estimation,” Journal of Applied Probability and Statistics, vol. 6, no. 2, pp. 73–85, 2011. [9] J. Koláček and R. J. Karunamuni, “On boundary correction in kernel estimation of ROC curves,” Austrian Journal of Statistics, vol. 38, no. 1, pp. 17–32, 2009. [10] J. Koláček, “Plug-in method for nonparametric regression,” Computational Statistics, vol. 23, no. 1, pp. 63–78, 2008. 17 Other Publications of the Author [11] K. Vopatová, I. Horová, and J. Koláček, “Bandwidth matrix selectors for multivariate kernel density estimation,” in Theoretical and Applied Issues in Statistics and Demography, pp. 123–130, Barcelona: International Society for the Advancement of Science and Technology (ISAST), 2013. [12] K. Konečná, I. Horová, and J. Koláček, “Conditional density estimations,” in Theoretical and Applied Issues in Statistics and Demography, pp. 39–45, Barcelona: International Society for the Advancement of Science and Technology (ISAST), 2013. [13] D. Lajdová, J. Koláček, and I. Horová, “Kernel regression model with correlated errors,” in Theoretical and Applied Issues in Statistics and Demography, pp. 81–88, Barcelona: International Society for the Advancement of Science and Technology (ISAST), 2013. [14] M. Trhlík, R. Soumarová, P. Bartoš, M. Těžká, J. Koláček, K. Vopatová, I. Horová, and P. Šupíková, “Neoadjuvant chemotherapy for primary advanced ovarian cancer,” in The International Journal of Gynecological Cancer – October 2012, vol 22, issue 8, supplement 3, E517, 2013. [15] I. Horová, J. Koláček, K. Vopatová, and J. Zelinka, “Contribution to bandwidth matrix choice for multivariate kernel density estimate,” in ISI 2011, Proceedings of the 58th World Statistics Congress, ISI 2011, 2011. [16] M. Řezáč and J. Koláček, “Adjusted empirical estimate of information value for credit scoring models,” in PROCEEDINGS ASMDA 2011, (Rome), pp. 1162–1169, Edizioni ETS, 2011. [17] J. Koláček and M. M. Řezáč, “Quality measures for predictive scoring models,” in PROCEEDINGS ASMDA 2011 (C. H. S. Raimondo Manca, ed.), (Rome, Italy), pp. 720–727, Edizioni ETS, 2011. [18] J. Koláček and M. Řezáč, “A conservative approach to assessment of discriminatory models,” in Workshop of the Jaroslav Hájek Center and Financial Mathematics in Practice I, Book of short papers (I. H. Jiří Zelinka, ed.), (Brno), pp. 30–36, Masaryk University, 2011. [19] K. Vopatová, I. Horová, and J. Jan Koláček, “Bandwidth matrix choice for bivariate kernel density derivative,” in Proceedings of the 25th International Workshop on Statistical Modelling, (Glasgow (UK)), pp. 561– 564, 2010. 18 [20] J. Koláček and M. Řezáč, “Assessment of scoring models using information value,” in 19th International Conference on Computational Statistics, Paris France, August 22-27, 2010 Keynote, Invited and Contributed Papers, (Paris), pp. 1191–1198, SpringerLink, 2010. [21] M. Řezáč and J. Koláček, “On aspects of quality indexes for scoring models,” in 19th International Conference on Computational Statistics, Paris France, August 22-27, 2010 Keynote, Invited and Contributed Papers, (Paris), pp. 1517–1524, SpringerLink, 2010. [22] I. Horová, J. Koláček, J. Zelinka, and A. H. El-Shaarawi, “Smooth estimates of distribution functions with application in environmental studies,” in Advanced topics on mathematical biology and ecology, (Mexico), pp. 122–127, WSEAS Press, 2008. [23] I. Horová, J. Koláček, J. Zelinka, and K. Vopatová, “Bandwidth choice for kernel density estimates.,” in Proceedings IASC, (Yokohama), pp. 542–551, IASC, 2008. [24] J. Koláček, “An improved estimator for removing boundary bias in kernel cumulative distribution function estimation,” in Proceedings in Computational Statistics COMPSTAT’08, (Porto), pp. 549–556, PhysicaVerlag, 2008. [25] J. Koláček and J. Poměnková, “A comparative study of boundary effects for kernel smoothing,” Austrian Journal of Statistics, vol. 35, no. 2, pp. 281–289, 2006. [26] J. Koláček, “Use of fourier transformation for kernel smoothing,” in Proceedings in Computational Statistics COMPSTAT’04, pp. 1329 – 1336, Springer, 2004. [27] J. Koláček, “Some stabilized bandwidth selectors for nonparametric regression,” Journal of Electrical Engineering, vol. 54, no. 12, pp. 65–68, 2003. [28] J. Koláček, “Problems of automatic data-driven bandwidth selectors for nonparametric regression,” Journal of Electrical Engineering, vol. 53, no. 12, pp. 48–51, 2002. 19 Other References [29] R. Anderson. The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press, 2007. [30] R. Cao, A. Cuevas, and W. González Manteiga. A comparative study of several smoothing methods in density estimation. Computational Statistics and Data Analysis, 17(2):153–176, 1994. [31] J. E. Chacón and T. Duong. Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test, 19(2):375–398, 2010. [32] J. E. Chacón, T. Duong, and M. P. Wand. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica, 21(2):807– 840, 2011. [33] S. Chiu. Why bandwidth selectors tend to choose smaller bandwidths, and a remedy. Biometrika, 77(1):222–226, 1990. [34] S. Chiu. Some stabilized bandwidth selectors for nonparametric regression. Annals of Statistics, 19(3):1528–1546, 1991. [35] C. K. Chu and J. S. Marron. Choosing a kernel regression estimator. Statistical Science, 6(4):404–419, 1991. [36] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368):829– 836, 1979. [37] P. Craven and G. Wahba. Smoothing noisy data with spline functions estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31(4):377–403, 1979. [38] B. Droge. Some comments on cross-validation. Technical Report 1994-7, Humboldt Universitaet Berlin, 1996. [39] T. Duong. Bandwidth selectors for multivariate kernel density estimation. PhD thesis, School of Mathematics and Statistics, University of Western Australia, oct 2004. [40] T. Duong, A. Cowling, I. Koch, and M. P. Wand. Feature significance for multivariate kernel density estimation. Computational Statistics & Data Analysis, 52(9):4225–4242, 2008. 20 [41] T. Duong and M. Hazelton. Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. Journal of Multivariate Analysis, 93(2):417–433, 2005. [42] T. Duong and M. Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485–506, 2005. [43] F. Ferraty and P. Vieu. Nonparametric functional data analysis: theory and practice. Springer, 2006. [44] T. Gasser and H.-G. Müller. Kernel estimation of regression functions. In T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Estimation, volume 757 of Lecture Notes in Mathematics, pages 23–68. Springer Berlin / Heidelberg, 1979. [45] T. Gasser, H.-G. Müller, and V. Mammitzsch. Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society. Series B (Methodological), 47(2):238–252, 1985. [46] B. Granovsky and H.-G. Müller. Optimizing kernel methods - a unifying variational principle. International Statistical Review, 59(3):373–388, 1991. [47] P. Hall and T. E. Wehrly. A geometrical method for removing edge effects from kernel-type nonparametric regression estimators. Journal of the American Statistical Association, 86(415):pp. 665–672, 1991. [48] W. Härdle. Applied Nonparametric Regression. Cambridge University Press, Cambridge, 1st edition, 1990. [49] W. Härdle, P. Hall, and J. Marron. How far are automatically chosen regression smoothing parameters from their optimum. Journal of the American Statistical Association, 83(401):86–95, 1988. [50] W. Härdle, J. S. Marron, and M. P. Wand. Bandwidth choice for density derivatives. Journal of the Royal Statistical Society. Series B (Methodological), 52(1):223–232, 1990. [51] I. Horová. Boundary kernels. In Summer schools MATLAB 94, 95, pages 17–24. Brno: Masaryk University, 1997. [52] I. Horová, P. Vieu, and J. Zelinka. Optimal choice of nonparametric estimates of a density and of its derivatives. Statistics & Decisions, 20(4):355–378, 2002. 21 [53] I. Horová and K. Vopatová. Kernel gradient estimate. In F. Ferraty, editor, Recent Advances in Functional Data Analysis and Related Topics, pages 177–182. Springer-Verlag Berlin Heidelberg, 2011. [54] I. Horová and J. Zelinka. Contribution to the bandwidth choice for kernel density estimates. Computational Statistics, 22(1):31–47, 2007. [55] M. C. Jones and R. F. Kappenman. On a class of kernel density estimate bandwidth selectors. Scandinavian Journal of Statistics, 19(4):337–349, 1991. [56] R. Karunamuni and T. Alberts. A generalized reflection method of boundary correction in kernel density estimation. Canad. J. Statist., 33:497–509, 2005b. [57] R. Karunamuni and S. Zhang. Some improvements on a boundary corrected kernel density estimator. Statistics & Probability Letters, 78:497– 507, 2008. [58] J. Koláček. Kernel Estimation of the Regression Function (in Czech). PhD thesis, Masaryk University, Brno, feb 2005. [59] J. Koláček and J. Zelinka. MATLAB toolbox, 2012. [60] J. S. Marron and D. Nolan. Canonical kernels for density-estimation. Statistics & Probability Letters, 7(3):195–199, 1988. [61] J. S. Marron and D. Ruppert. Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 56(4):653–671, 1994. [62] H.-G. Müller. Nonparametric regression analysis of longitudinal data. Springer, New York, 1988. [63] H.-G. Müller. Smooth optimum kernel estimators near endpoints. Biometrika, 78(3):521–530, 1991. [64] E. A. Nadaraya. On estimating regression. Theory of Probability and its Applications, 9(1):141–142, 1964. [65] B. Park and J. Marron. Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association, 85(409):66–72, 1990. [66] M. B. Priestley and M. T. Chao. Non-parametric function fitting. Journal of the Royal Statistical Society. Series B (Methodological), 34(3):385– 392, 1972. 22 [67] J. Rice. Bandwidth choice for nonparametric regression. Annals of Statistics, 12(4):1215–1230, 1984. [68] S. Sain, K. Baggerly, and D. Scott. Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427):807– 817, 1994. [69] E. Schuster. Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics-Theory end Methods, 14(5):1123–1136, 1985. [70] D. W. Scott. Multivariate density estimation: theory, practice, and visualization. Wiley, 1992. [71] D. W. Scott and G. R. Terrell. Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association, 82(400):1131–1146, 1987. [72] N. Siddiqi. Credit risk scorecards: developing and implementing intelligent credit scoring. Wiley and SAS Business Series. Wiley, 2006. [73] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society. Series B (Methodological), 47:1–52, 1985. [74] B. W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, London, 1986. [75] J. S. Simonoff. Smoothing Methods in Statistics. Springer-Verlag, New York, 1996. [76] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–620, 1977. [77] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B-Statistical Methodology, 36(2):111–147, 1974. [78] L. Thomas. Consumer credit models: pricing, profit, and portfolios. Oxford University Press, 2009. [79] M. Wand and M. Jones. Kernel smoothing. Chapman and Hall, London, 1995. [80] G. S. Watson. Smooth regression analysis. Sankhya: The Indian Journal of Statistics, Series A, 26(4):359–372, 1964. 23 Reprints of articles 24 Computational Statistics and Data Analysis 57 (2013) 364–376 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Full bandwidth matrix selectors for gradient kernel density estimate Ivana Horováa,∗ , Jan Koláčeka , Kamila Vopatováb a Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic b Department of Econometrics, University of Defence, Brno, Czech Republic a r t i c l e i n f o Article history: Received 4 July 2011 Received in revised form 2 July 2012 Accepted 5 July 2012 Available online 10 July 2012 Keywords: Asymptotic mean integrated square error Multivariate kernel density Unconstrained bandwidth matrix a b s t r a c t The most important factor in multivariate kernel density estimation is a choice of a bandwidth matrix. This choice is particularly important, because of its role in controlling both the amount and the direction of multivariate smoothing. Considerable attention has been paid to constrained parameterization of the bandwidth matrix such as a diagonal matrix or a pre-transformation of the data. A general multivariate kernel density derivative estimator has been investigated. Data-driven selectors of full bandwidth matrices for a density and its gradient are considered. The proposed method is based on an optimally balanced relation between the integrated variance and the integrated squared bias. The analysis of statistical properties shows the rationale of the proposed method. In order to compare this method with cross-validation and plug-in methods the relative rate of convergence is determined. The utility of the method is illustrated through a simulation study and real data applications. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Kernel density estimates are one of the most popular nonparametric estimates. In a univariate case, these estimates depend on a bandwidth, which is a smoothing parameter controlling smoothness of an estimated curve and a kernel which is considered as a weight function. The choice of the smoothing parameter is a crucial problem in the kernel density estimation. The literature on bandwidth selection is quite extensive, e.g., monographs Wand and Jones (1995), Silverman (1986) and Simonoff (1996), papers Marron and Ruppert (1994), Park and Marron (1990), Scott and Terrell (1987), Jones and Kappenman (1991) and Cao et al. (1994). As far as the kernel estimate of density derivatives is concerned, this problem has received significantly less attention. In paper Härdle et al. (1990), an adaptation of the least squares cross-validation method is proposed for the bandwidth choice in the kernel density derivative estimation. In paper Horová et al. (2002), the automatic procedure of simultaneous choice of the bandwidth, the kernel and its order for kernel density and its derivative estimates was proposed. But this procedure can be only applied in case that the explicit minimum of the Asymptotic Mean Integrated Square Error of the estimate is available. It is known that this minimum exists only for d = 2 and the diagonal matrix H. In paper Horová et al. (2012), the basic formula for the corresponding procedure is given. The need for nonparametric density estimates for recovering structure in multivariate data is greater since a parametric modeling is more difficult than in the univariate case. The extension of the univariate kernel methodology is not without its problems. The most general smoothing parameterization of the kernel estimator in d dimensions requires the specification entries of d × d positive definite bandwidth matrix. The multivariate kernel density estimator we are going to deal with is a direct extension of the univariate estimator (see, e.g., Wand and Jones (1995)). Successful approaches to the univariate bandwidth selection can be transferred to the multivariate settings. The least squares cross-validation and plug-in methods in the multivariate case have been developed and widely discussed in papers ∗ Correspondence to: Department of Mathematics and Statistics, Kotlářská 2, 61137, Brno, Czech Republic. Tel.: +420 549494429; fax: +420 549491421. E-mail addresses: horova@math.muni.cz (I. Horová), kolacek@math.muni.cz (J. Koláček), 63985@mail.muni.cz (K. Vopatová). 0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.07.006 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 365 Chacón and Duong (2010), Duong and Hazelton (2005b,a), Sain et al. (1994) and Duong (2004). Some papers (e.g., Horová et al. (2008, 2012) and Vopatová et al. (2010)) have been focused on constrained parameterization of the bandwidth matrix such as a diagonal matrix. It is well-known fact that a visualization is an important component of the nonparametric data analysis. In paper Horová et al. (2012), this effective strategy was used to clarify the process of the bandwidth matrix choice using bivariate functional surfaces. The paper Horová and Vopatová (2011) brings a short communication on a kernel gradient estimator. Tarn Duong’s PhD thesis (Duong, 2004) provides a comprehensive survey of bandwidth matrix selection methods for kernel density estimation. Papers Chacón et al. (2011) and Duong et al. (2008) investigated general density derivative estimators, i.e., kernel estimators of multivariate density derivatives using general (or unconstrained) bandwidth matrix selectors. They defined the kernel estimator of the multivariate density derivative and provided results for the Mean Integrated Square Error convergence asymptotically and for finite samples. Moreover, the relationship between the convergence rate and the bandwidth matrix has been established here. They also developed estimates for the class of normal mixture densities. The paper is organized as follows: In Section 2 we describe kernel estimates of a density and its gradient and give a form of the Mean Integrated Square Error and the exact MISE calculation for a d-variate normal kernel as well. The next sections are devoted to a data-driven bandwidth matrix selection method. This method is based on an optimally balanced relation between the integrated variance and the integrated squared bias, see Horová and Zelinka (2007a). Similar ideas were applied to kernel estimates of hazard functions (see Horová et al. (2006) or Horová and Zelinka (2007b)). It seems that the basic idea can be also extended to a kernel regression and we are going to investigate this possibility. We discuss the statistical properties and relative rates of convergence of the proposed method as well. Section 5 brings a simulation study and in the last section the developed theory is applied to real data sets. 2. Estimates of a density and its gradient Let a d-variate random sample X1, . . . , Xn be drawn from a density f . The kernel density estimator ˆf at the point x ∈ Rd is defined as ˆf (x, H) = 1 n n i=1 KH(x − Xi), (1) where K is a kernel function, which is often taken to be a d-variate symmetric probability function, H is a d × d symmetric positive definite matrix and KH is the scaled kernel function KH(x) = |H|−1/2 K(H−1/2 x) with |H| the determinant of the matrix H. The kernel estimator of the gradient Df at the point x ∈ Rd is Df (x, H) = 1 n n i=1 DKH(x − Xi), (2) where DKH(x) = |H|−1/2 H−1/2 DK(H−1/2 x) and DK is the column vector of the partial derivatives of K. Since we aim to investigate both density itself and its gradient in a similar way, we introduce the notation Dr f (x, H) = 1 n n i=1 Dr KH(x − Xi), r = 0, 1, (3) where D0 f = f , D1 f = Df . We make some additional assumptions and notations: (A1) The kernel function K satisfies the moment conditions  K(x)dx = 1,  xK(x)dx = 0,  xxT K(x)dx = β2Id, Id is the d × d identity matrix. (A2) H = Hn is a sequence of bandwidth matrices such that n−1/2 |H|−1/2 (H−1 )r , r = 0, 1, and entries of H approach zero ((H−1 )0 is considered as equal to 1). (A3) Each partial density derivative of order r + 2, r = 0, 1, is continuous and square integrable. (N1) H is a class of d × d symmetric positive definite matrices. (N2) V(ρ) =  R ρ2 (x)dx for any square integrable scalar valued function ρ. (N3) V(g) =  Rd g(x)gT (x)dx for any square integrable vector valued function g. In the rest of the text,  stands for  Rd unless it is stated otherwise. (N4) DDT = D2 is a Hessian operator. Expressions like DDT = D2 involve ‘‘multiplications’’ of differentials in the sense that ∂ ∂xi ∂ ∂xj = ∂2 ∂xi∂xj . This means that (D2 )m , m ∈ N, is a matrix of the 2m-th order partial differential operators. 366 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 (N5) vecH is a d2 × 1 vector obtained by stacking columns of H. (N6) Let d∗ = d(d + 1)/2, vechH is d∗ × 1 a vector-half obtained from vecH by eliminating each of the above diagonal entries. (N7) The matrix Dd of size d2 × d∗ of ones and zeros such that DdvechH = vecH is called the duplication matrix of order d. (N8) Jd denotes d × d matrix of ones. The quality of the estimate Dr f can be expressed in terms of the Mean Integrated Square Error MISEr { Dr f (·, H)} = E  ∥ Dr f (x, H) − Dr f (x)∥2 dx, with ∥ · ∥ standing for the Euclidean norm, i.e., ∥v∥2 = vT v = tr(vvT ). For the sake of simplicity we write the argument of MISEr as H. This error function can be also expressed as the standard decomposition MISEr (H) = IVr (H) + ISBr (H), where IVr (H) =  Var{ Dr f (x, H)}dx is the integrated variance and ISBr (H) =  ∥E Dr f (x, H) − Dr f (x)∥2 dx =       K(z)Dr f (x − H1/2 z)dz − Dr f (x)     2 dx =  ∥(KH ∗ Dr f )(x) − Dr f (x)∥2 dx is the integrated square bias (the symbol ∗ denotes convolution). Since MISEr is not mathematically tractable, we employ the Asymptotic Mean Integrated Square Error. The AMISEr theorem has been proved (e.g., in Duong et al. (2008)) and reads as follows: Theorem 1. Let assumptions (A1) –(A3) be satisfied. Then MISEr (H) ≃ AMISEr (H), where AMISEr (H) = n−1 |H|−1/2 tr  (H−1 )r V(Dr K)     AIVr + β2 2 4 vechT HΨ4+2r vechH    AISBr . (4) The term Ψ4+2r involves higher order derivatives of f and its subscript 4 + 2r, r = 0, 1, indicates the order of derivatives used. It is a d∗ × d∗ symmetric matrix. It can be shown that  ∥{tr(HD2 )Dr }f (x)∥2 dx = vechT H Ψ4+2r vechH. Then (4) can be rewritten as AMISEr (H) = n−1 |H|−1/2 tr(H−1 )r V(Dr K) + β2 2 4  ∥{tr(HD2 )Dr }f (x)∥2 dx, r = 0, 1. (5) Let K = φI be the d-variate normal kernel and suppose that f is the normal mixture density f (x) = k l=1 wlφΣl (x − µl), where for each l = 1, . . . , k, φΣl is the d-variate N(0, Σl) normal density and w = (w1, . . . , wk)T is a vector of positive numbers summing to one. In this case, the exact formula for MISEr was derived in Chacón et al. (2011). For r = 0, 1 it takes the form MISEr (H) = 2−r n−1 (4π)−d/2 |H|−1/2 (trH−1 )r + wT  (1 − n−1 )Ω2 − 2Ω1 + Ω0  w, (6) where (Ωc )ij = (−1)r φcH+Σij (µij)  µT ij (cH + Σij)−2 µij − 2tr(cH + Σij)−1 r with Σij = Σi + Σj, µij = µi − µj. I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 367 3. Bandwidth matrix selection The most important factor in multivariate kernel density estimates is the bandwidth matrix H. Because of its role in controlling both the amount and the direction of smoothing this choice is particularly important. Let H(A)MISE,r stand for a bandwidth matrix minimizing (A)MISEr , i.e., HMISE,r = arg min H∈H MISEr (H) and HAMISE,r = arg min H∈H AMISEr (H). As it has been mentioned in former works (see, e.g., Duong and Hazelton (2005a,b)), the discrepancy between HMISE,r and HAMISE,r is asymptotically negligible in comparison with the random variation in the bandwidth matrix selectors that we consider. The problems of estimating HMISE,r and HAMISE,r are equivalent for most practical purposes. If we denote DH = ∂ ∂vechH , then using matrix differential calculus yields DHAMISEr (H) = −(2n)−1 |H|−1/2 tr  (H−1 )r V(Dr K)  DT d vecH−1 + n−1 |H|−1/2 r  −DT d vec(H−1 V(Dr K)H−1 )  + β2 2 2 Ψ4+2r vechH. Unfortunately, there is no explicit solution for the equation DHAMISEr (H) = 0 (7) (with an exception of d = 2, r = 0 and a diagonal bandwidth matrix H, see, e.g., Wand and Jones (1995)). But nevertheless the following lemma holds. Lemma 2. AIVr (HAMISE,r ) = 4 d + 2r AISBr (HAMISE,r ). (8) Proof. See Complements for the proof. It can be shown (Chacón et al., 2011) that HAMISE,r = C0,r n−2/(d+2r+4) = O(n−2/(d+2r+4) Jd) and then AMISEr (HAMISE,r ) is of order n−4/(d+2r+4) . Since HAMISE,r resp. HMISE,r cannot be found in practice, the data-driven methods for selection of H have been proposed in papers Chacón and Duong (2010), Duong (2004), Duong and Hazelton (2005b), Sain et al. (1994) and Wand and Jones (1994) etc.. The performance of bandwidth matrix selectors can be assessed by its relative rate of convergence. We generalize the definition for the relative rate of convergence for the univariate case to the multivariate one. Let Hr be a data-driven bandwidth matrix selector. We say that Hr converges to HAMISE,r with relative rate n−α if vech(Hr − HAMISE,r ) = Op(Jd∗ n−α )vechHAMISE,r . (9) This definition was introduced by Duong (2004). Now, we remind cross-validation methods CVr (H) (Duong and Hazelton, 2005b; Chacón and Duong, 2012) which aim to estimate MISEr . CVr (H) is an unbiased estimate of MISEr (H) − trV(Dr f ) and CVr (H) = (−1)r tr    1 n2 n i,j=1 D2r (KH ∗ KH)(Xi − Xj) − 2 n(n − 1) n i,j=1 i̸=j D2r KH(Xi − Xj)    , (10) HCVr = arg min H∈H CVr (H). It can be shown that the relative rate of convergence to HMISE,r is n−d/(2d+4r+8) (Chacón and Duong, 2012) and to HAMISE,r is n− min{d,4}/(2d+4r+8) (see Duong and Hazelton (2005b) for r = 0). Plug-in methods for the bandwidth matrix selection were generalized to the multivariate case in Wand and Jones (1994). The idea consists of estimating the unknown matrix Ψ4+2r . The relative rate of convergence to HMISE,r and HAMISE,r is the same n−2/(d+2r+6) when d ≥ 2 (see, e.g., Chacón (2010) and Chacón and Duong (2012)). In papers Horová et al. (2008, 2012), a special method for bandwidth matrix selection for a bivariate density for the case of diagonal bandwidth matrix has been developed and the rationale of this method has been explained. This method is based on formula (8). As concerns the bandwidth matrix selection for the kernel gradient estimator, the aforementioned method was extended to this case in Vopatová et al. (2010) and Horová and Vopatová (2011). Because the problem of the bandwidth matrix choice both for density itself and its gradient are closely related one to each other, we address the problem of these choices together. 368 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 4. Proposed method and its statistical properties As mentioned, our method is based on Eq. (8) in the sense that a solution of DHAMISEr (H) = 0 is equivalent to solving Eq. (8). But AISBr (H) depends on the unknown density. Thus we adapt the similar idea as in the univariate case (Horová and Zelinka (2007a)) and use a suitable estimate of AISBr . Eq. (8) can be rewritten as (d + 2r)n−1 |H|−1/2 tr  (H−1 )r V(Dr K)  − β2 2  ∥{tr(HD2 )Dr }f (x)∥2 dx = 0. (11) Let us denote Λ(z) = (K ∗ K ∗ K ∗ K − 2K ∗ K ∗ K + K ∗ K)(z), ΛH(z) = |H|−1/2 Λ(H−1/2 z). Then the estimate of AISBr (H) can be considered as AISBr (H) =  ∥(KH ∗ Dr f )(x, H) − Dr f (x, H)∥2 dx. This estimate involves non-stochastic terms, therefore, according to Taylor (1989), Jones and Kappenman (1991) and Jones et al. (1991), we eliminated these terms and propose an (asymptotically unbiased) estimate AISBr (H) = tr    (−1)r n2 n i,j=1 i̸=j D2r ΛH(Xi − Xj)    . Now, instead of Eq. (11) we aim to solve the equation (d + 2r)n−1 |H|−1/2 tr  (H−1 )r V(Dr K)  − 4tr    (−1)r n2 n i,j=1 i̸=j D2r ΛH(Xi − Xj)    = 0. (12) Remark 1. The bandwidth matrix selection method based on Eq. (12) is called the Iterative method (IT method) and the bandwidth estimate is denoted HITr . Remark 2. In the following we assume that K is the standard normal density φI. Thus Λ(z) = φ4I(z) − 2φ3I(z) + φ2I(z) and β2 = 1. We are going to discuss statistical properties of the Iterative method which will show its rationality. Let Γr (H) stand for the left hand side of (11) and Γr (H) for the left hand side of (12). Theorem 3. Let the assumptions (A1) –(A3) be satisfied and K = φI. Then E(Γr (H)) = Γr (H) + o(∥vecH∥5/2 ), Var(Γr (H)) = 32n−2 |H|−1/2 ∥vecH∥−2r V(vecD2r Λ)V(f ) + o(n−2 |H|−1/2 ∥vecH∥−2r ). Proof. For the proof see Complements. As far as the convergence rate of the IT method is concerned, we are inspired with AMSE lemma (Duong, 2004; Duong and Hazelton, 2005a). The following theorem takes place. Theorem 4. Let the assumptions (A1) –(A3) be satisfied and K = φI. Then MSE{vechHITr } = O  n− min{d,4}/(d+2r+4) Jd∗  × vechHAMISE,r vechT HAMISE,r . Proof. Proof of theorem can be found in Complements. Corollary 5. The convergence rate to HAMISE,r is n− min{d,4}/(2d+4r+8) for the IT method. Remark 3. For the r-th derivative the cross-validation method is of order n− min{d,4}/(2d+4r+8) and the plug-in method is of order n−2/(d+2r+6) (with respect to HAMISE,r ). I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 369 5. Computational aspects and simulations Eq. (12) can be rewritten as |HITr |1/2 4 tr    (−1)r n n i,j=1 i̸=j D2r ΛHITr (Xi − Xj)    = (d + 2r)tr  (H−1 ITr )r V  Dr K  . This equation represents a nonlinear equation for d∗ unknown entries of HITr . In order to find all these entries we need additional d∗ − 1 equations. Below, we describe a possibility of obtaining these equations. We adopt a similar idea as in the case of the diagonal matrix (see also Terrell (1990), Scott (1992), Duong et al. (2008) and Horová and Vopatová (2011)). We explain this approach for the case d = 2 with the matrix HITr =  ˆh11,r ˆh12,r ˆh12,r ˆh22,r  . Let Σ be a sample covariance matrix Σ =  ˆσ2 11 ˆσ12 ˆσ12 ˆσ2 22  . The initial estimates of entries of HITr can be chosen as ˆh11,r = ˆh2 1,r = ( ˆσ2 11)(12+r)/12 n(r−4)/12 , ˆh22,r = ˆh2 2,r = ( ˆσ2 22)(12+r)/12 n(r−4)/12 , ˆh12,r = sign ˆσ12| ˆσ12|(12+r)/12 n(r−4)/12 . For details see Horová and Vopatová (2011). Hence ˆh22,r =  ˆσ2 22 ˆσ2 11 (12+r)/12 ˆh11,r , (13) ˆh2 12,r =  ˆσ2 12 ˆσ2 11 (12+r)/12 ˆh11,r (14) and further |HITr | = ˆh2 11,r  ( ˆσ11 ˆσ22)(12+r)/6 − ˆσ (12+r)/6 12   ˆσ (12+r)/3 11 = ˆh2 11,r S( ˆσij). Thus we arrive at the equation for the unknown ˆh11,r 4ˆh11,r  S( ˆσij)tr    (−1)r n n i,j=1 i̸=j D2r ΛHITr (Xi − Xj)    = (d + 2r)tr  (H−1 ITr )r V  Dr K  . (15) This approach is very important for computational aspects of solving Eq. (12). Putting Eqs. (13)–(15) forms one nonlinear equation for the unknown ˆh11,r and it can be solved by means of an appropriate iterative numerical method. This procedure gives the name of the proposed method. Evidently, this approach is computationally much faster than a general minimization process. To test the effectiveness of our estimator, we simulated its performance against the least squares cross-validation method. All simulations and computations were done in MATLAB. The simulation is based on 100 replications of 6 bivariate normal mixture densities, labeled A–F. Means and covariance matrices of these distributions were generated randomly. Table 1 brings the list of the normal mixture densities. Densities A and B are unimodal, C and D are bimodal and E and F are trimodal. Their contour plots are displayed in Fig. 1. The sample size of n = 100 was used in all replications. We calculated the Integrated Square Error (ISE) ISEr { Dr f (·, H)} =  ∥ Dr f (x, H) − Dr f (x)∥2 dx for each estimated density and its derivative over all 100 replications. The logarithm of results is displayed in Tables 2 and 3 and in Fig. 2. Here ‘‘ITER’’ denotes the results for our proposed method, ‘‘LSCV’’ stands for the results of the Least Squares Cross-validation method (10) and ‘‘MISE’’ is a tag for the results obtained by minimizing (6). Finally, we compared computational times of all methods. Results are listed in Table 4. 370 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 Table 1 Normal mixture densities. Density Formula N(vecT µ, vecT Σ) A N  (−0.2686, −1.7905), (7.9294, −10.0673; −10.0673, 22.1150)  B N  (−0.6847, 2.6963), (16.9022, 9.8173; 9.8173, 6.0090)  C 1 2 N  (0.3151, −1.6877), (0.1783, −0.1821; −0.1821, 1.0116)  + 1 2 N  (1.1768, 0.3731), (0.2414, −0.8834; −0.8834, 4.2934)  D 1 2 N  (1.8569, 0.1897), (1.5023, −0.9259; −0.9259, 0.8553)  + 1 2 N  (0.3349, −0.2397), (2.3050, 0.8895; 0.8895, 1.2977)  E 1 3 N  (0.0564, −0.9041), (0.9648, −0.8582; −0.8582, 0.9332)  + 1 3 N  (−0.7769, 1.6001), (2.8197, −1.4269; −1.4269, 0.9398)  + 1 3 N  (1.0132, 0.4508), (3.9982, −3.7291; −3.7291, 5.5409)  F 1 3 N  (2.2337, −2.9718), (0.6336, −0.9279; −0.9279, 3.1289)  + 1 3 N  (−4.3854, 0.5678), (2.1399, −0.6208; −0.6208, 0.7967)  + 1 3 N  (1.5513, 2.2186), (1.1207, 0.8044; 0.8044, 1.0428)  Fig. 1. Contour plots for target densities. Table 2 Logarithm of ISE0 for bandwidth matrices. Target density A B C D E F ITER Mean −7.562 −6.345 −4.319 −4.918 −4.779 −5.103 Std 0.459 0.448 0.264 0.274 0.203 0.180 LSCV Mean −7.110 −5.781 −4.332 −4.957 −4.917 −5.138 Std 0.531 0.610 0.407 0.518 0.385 0.325 MISE Mean −7.865 −4.256 −4.168 −3.521 −2.763 −3.903 Std 0.397 0.418 0.188 0.340 0.237 0.164 6. Application to real data An important question arising in application to real data is which observed features – such as a local extremes – are really there. Chaudhuri and Marron (1999) introduced the SiZer (Significant Zero) method for finding structure in smooth data. Duong et al. (2008) proposed a framework for feature significance in d-dimensional data which combines kernel density derivative estimators and hypothesis tests for modal regions. Distributional properties are given for the gradient and curvature estimators, and pointwise tests extend the two-dimensional feature significance ideas of Godtliebsen et al. (2002). I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 371 Table 3 Logarithm of ISE1 for bandwidth matrices. Target density A B C D E F ITER Mean −7.618 −4.005 −0.888 −2.698 −1.991 −3.203 Std 0.289 0.405 0.055 0.099 0.030 0.032 LSCV Mean −5.364 −0.210 0.503 −1.214 −0.501 −1.544 Std 2.638 2.960 2.364 2.437 2.373 1.914 MISE Mean −7.939 −4.314 −1.813 −3.544 −2.732 −3.864 Std 0.391 0.443 0.311 0.359 0.241 0.172 Fig. 2. Box plots for log(ISE). Table 4 Average computational times (in seconds). Target density r A B C D E F ITER 0 0.0826 0.0685 0.0596 0.0801 0.0754 0.0591 1 0.8295 0.8201 0.8542 0.8605 0.8538 0.8786 LSCV 0 0.5486 0.5732 0.5182 0.4844 0.5004 0.5004 1 1.7936 1.6483 1.3113 1.3128 1.6495 1.5581 MISE 0 0.1927 0.1982 0.7126 0.5540 1.8881 2.4000 1 0.5236 0.3112 1.2653 1.3452 2.3089 4.1172 We started with the well-known ‘Old Faithful’ data set (Simonoff, 1996), which contains characteristics of 222 eruptions of the ‘Old Faithful Geyser’ in Yellowstone National Park, USA, during August 1978 and August 1979. Kernel density and first derivative estimates using the standard normal kernel based on the following bandwidth matrices obtained by the IT method HIT0 =  0.0703 0.7281 0.7281 9.801  , HIT1 =  0.2388 3.006 3.006 50.24  are displayed in Fig. 3. The intersections of ∂f /∂x1 = 0 and ∂f /∂x2 = 0 show the existence of extremes. The second data set is taken from UNICEF—‘‘The State of the World’s Children 2003’’. It contains 72 pairs of observations for countries with a GNI less than 1000 US dollars per capita in 2001. X1 variable describes the under-five mortality rate, i.e., the probability of dying between birth and exactly five years of age expressed per 1000 live births, and X2 is a life expectancy at birth, i.e., the number of years newborn children would live if subject to the mortality risks prevailing for 372 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 Fig. 3. ‘Old Faithful’ data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right). Fig. 4. ‘UNICEF Children’ data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right). Fig. 5. Swiss bank notes data contour plots—estimated density ˆf (left) and estimated partial derivatives ∂f /∂x1 = 0, ∂f /∂x2 = 0 (right). the cross-section of population at the time of their birth (UNICEF, 2003). These data have also been analyzed in Duong and Hazelton (2005b). Bandwidth matrices for the estimated density ˆf and its gradient Df are HIT0 =  1112.0 −138.3 −138.3 24.20  and HIT1 =  2426 −253.7 −253.7 38.38  , respectively. Fig. 4 illustrates the use of the iterative bandwidth matrices for the ‘UNICEF Children’ data set. We also analyzed a Swiss bank notes data set from Simonoff (1996). It contains measurements of the bottom margin and diagonal length of 100 real Swiss bank notes and 100 forged Swiss bank notes. Contour plots in Fig. 5 represent kernel estimates of the joint distribution of the bottom margin and diagonal length of the bills using bandwidth matrices HIT0 =  0.1227 −0.0610 −0.0610 0.0781  , HIT1 =  0.6740 −0.3159 −0.3159 0.4129  . The bills with longer diagonal and shorter bottom margin correspond to real bills. The density estimate shows a bimodal structure for the forged bills (bottom right part of the plot) and it seems that the gradient estimate does not match this structure. The elements of the bandwidth matrix for the gradient estimate are bigger I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 373 in magnitude than the ones of the bandwidth matrix for density estimate, as expected from the theory. Three bumps in the tails are too small and the gradient estimator is not able to distinguish them. 7. Conclusion We restricted ourselves on the use of the standard normal kernel. This kernel satisfies smoothness conditions and provides easy computations of convolutions. Due to these facts it was possible to compare the IT method with the LSCV method. The simulation study and application to real data show that the IT method provides a sufficiently reliable way of estimating arbitrary density and its gradient. The IT method is also easy implementable and seems to be less time consuming (see Horová and Zelinka (2007a) for d = 1, see also Table 4 for d = 2). Further assessment of the practical performance and an extension to a curvature density estimate would be very important further research. Although the theoretical comparison also involves PI methods, they are not included in the simulation study. This would be an interesting task for further research. 8. Complements We start with introducing some facts on matrix differential calculus and on the Gaussian density (see Magnus and Neudecker (1979, 1999) and Aldershof et al. (1995)). Let A, B be d × d matrices and r = 0, 1: 1◦ . tr(AT B) = vecT AvecB 2◦ . DH|H|−1/2 = −1 2 |H|−1/2 DT d vecH−1 3◦ . DHtr(H−1 A) = −DT d vec(H−1 AH−1 ) 4◦ .  φcI(z){tr(H1/2 D2 H1/2 zzT )D2r }f (x)dz = c{tr(HD2 )D2r }f (x) φcI(z){tr2 (H1/2 D2 H1/2 zzT )D2r }f (x)dz = 3c2 {tr2 (HD2 )D2r }f (x) φcI(z){trk (H1/2 D2 H1/2 zzT )tr(H1/2 DzT )D2r }f (x)dz = 0, k ∈ N0 5◦ . Λ(z) = φ4I(z) − 2φ3I(z) + φ2I(z), then using 4◦ yields Λ(z)dz = 0 Λ(z){tr(H1/2 D2 H1/2 zzT )D2r }f (x)dz = 0 Λ(z){tr2 (H1/2 D2 H1/2 zzT )D2r }f (x)dz = 6{tr2 (HD2 )D2r }f (x) Λ(z){trk (H1/2 D2 H1/2 zzT )tr(H1/2 DzT )D2r }f (x)dz = 0, k ∈ N0 6◦ .  Dk f (x)[Dk f (x)]T dx = (−1)k  D2k f (x)f (x)dx, k ∈ N 7◦ . Taylor expansion in the form (for r = 0, 1) D2r f (x − H1/2 z) = D2r f (x) − {zT H1/2 DD2r }f (x) + 1 2! {(zT H1/2 D)2 D2r }f (x) + · · · + (−1)k k! {(zT H1/2 D)k D2r }f (x) + o(∥H1/2 z∥k Jdr ). Sketch of the proof of Lemma 2: Proof. Consider Eq. (7) and multiply it from the left by 1 2 vechT H. Then (4n)−1 |H|−1/2 vechT Htr  (H−1 )r V(Dr K)  DT d vecH−1 + (2n)−1 |H|−1/2 rvechT H  DT d vec(H−1 V(Dr K))H−1  = β2 2 4 vechT HΨ4+2r vechH. The right hand side of this equation is AISBr . Further, if we use the facts on matrix calculus, we arrive at formula (8). We only present a sketch of proofs of theorems. Detailed proofs are available on request from the first author. Sketch of the proof of Theorem 3: Proof. In order to show the validity of the relation for the expected value of Γr (H), we evaluate E(AISBr (H)) and start with E tr  D2r ΛH(X1 − X2)  = tr  D2r ΛH(x − y)f (x)f (y)dxdy = tr  ΛH(x − y)f (x)D2r f (y)dxdy = tr  Λ(z)D2r f (x − H1/2 z)f (x)dzdx. 374 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 Taylor expansion, defined in 7◦ , and using 5◦ yields = tr  Λ(z)  5 i=0 (−1)i i! {(zT H1/2 D)i D2r }f (x) + o(∥H1/2 z∥5 )Jdr  f (x)dzdx = tr  Λ(z)  1 4! {(zT H1/2 D)4 D2r }f (x) + o(∥H1/2 z∥5 )Jdr  f (x)dzdx = 1 4! tr  Λ(z){tr2 (H1/2 D2 H1/2 zzT )D2r }f (x)f (x)dzdx + o(∥vecH∥5/2 ), using properties 5◦ and 6◦ we arrive at = 1 4 tr  {tr2 (HD2 )D2r }f (x)f (x)dx + o(∥vecH∥5/2 ) = (−1)r 4  ∥{tr(HD2 )Dr }f (x)∥2 dx + o(∥vecH∥5/2 ). To prove the second part of the Theorem it is sufficient to derive Var(AISBr (H)) Var(AISBr (H)) = Var    4 n2 n i,j=1 i̸=j trD2r ΛH(Xi − Xj)    . Since trD2r ΛH is symmetric about zero, we can use U-statistics, e.g., Wand and Jones (1995). In our case Var 4 n2 n i,j=1 i̸=j trD2r ΛH(Xi − Xj) = 32n−3 (n − 1)Var trD2r ΛH(X1 − X2) + 64n−3 (n − 1)(n − 2) × Cov{trD2r ΛH(X1 − X2), trD2r ΛH(X1 − X3)}. Most of terms are asymptotically negligible, therefore the formula written above reduces to 32n−2 E(trD2r ΛH(X1 − X2))2    ξ2 −64n−1 E2 trD2r ΛH(X1 − X2)    ξ0 + 64n−1 E(trD2r ΛH(X1 − X2)trD2r ΛH(X1 − X3))    ξ1 . (16) Let us express ξ0, ξ1 and ξ2. From previous computations of the expected value one can see that ξ0 is of order o(∥vecH∥3 ). ξ1 =  trD2r ΛH(x − y)trD2r ΛH(x − z)f (x)f (y)f (z)dxdydz =  Λ(u)Λ(v)f (x)trD2r f (x − H1/2 u)trD2r f (x − H1/2 v)dxdudv =  Λ(u)Λ(v)f (x)tr  5 i=0 (−1)i i! {D2r ai }f (x) + o(∥H1/2 u∥5 )Jdr  × tr  5 i=0 (−1)i i! {D2r bi }f (x) + o(∥H1/2 v∥5 )Jdr  dxdudv, where a = uT H1/2 D, b = vT H1/2 D =  Λ(u)Λ(v)f (x) 1 4!4! tr{D2r a4 }f (x)tr{D2r b4 }f (x)dxdudv + o(∥vecH∥4 ) = 1 4!4!  f (x)  Λ(z){tr2 (H1/2 D2 H1/2 zzT )D2r }f (x)dz 2 dx + o(∥vecH∥4 ) = 1 16  {tr2 (HD2 )D2r }f (x){tr2 (HD2 )D2r }f (x)f (x)dx + o(∥vecH∥4 ). Thus ξ1 is of order o(∥vecH∥3 ) and is negligible. I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 375 Finally ξ2 =  trD2r ΛH(x − y)trD2r ΛH(x − y)f (x)f (y)dxdy = |H|−1/2  tr2 (H−r D2r Λ(z))f (x)f (x − H1/2 z)dxdz = |H|−1/2 ∥vecH∥−2r V(vecD2r Λ)V(f ) + o(|H|−1/2 ∥vecH∥−2r ), which completes the proof of Theorem 3. Sketch of the proof of Theorem 4: Proof. Since Γr (H) P → Γr (H) then HITr P → HAMISE,r as n → ∞ and we can adopt ideas of AMSE lemma (Duong, 2004). We expand Γr (HITr ) = (Γr − Γr )(HITr ) + Γr (HITr ) = (1 + o(1))(Γr − Γr )(HAMISE,r ) + Γr (HAMISE,r ) + (1 + o(1))DT HΓr (HAMISE,r )vech(HITr − HAMISE,r ). We multiply the equation by vechHAMISE,r from the left side and remove all negligible terms. Then we obtain 0 = vechHAMISE,r (Γr − Γr )(HAMISE,r ) + vechHAMISE,r DT HΓr (HAMISE,r )vech(HITr − HAMISE,r ). It is easy to see that DT HΓr (HAMISE,r ) = aT n−2/(d+2r+4) and vechHAMISE,r = bn−2/(d+2r+4) for constant vectors a and b, which implies vech(HITr − HAMISE,r ) = −(baT )−1    C n4/(d+2r+4) vechHAMISE,r (Γr − Γr )(HAMISE,r ). Let us note that the matrix baT can be singular in some cases (e.g., for a diagonal bandwidth matrix) and thus the matrix C = −(baT )−1 does not exist. But this fact does not take any effect for the rate of convergence. Using results of Theorem 3 we express the convergence rate of MSE  (Γr − Γr )(HAMISE,r )  = Bias2 (Γr (HAMISE,r )) + Var(Γr (HAMISE,r )) = (o(∥vecHAMISE,r ∥5/2 ))2 + O(n−2 |HAMISE,r |−1/2 ∥vecHAMISE,r ∥−2r ) = (O(∥vecHAMISE,r ∥3 ))2 + O(n−2 |HAMISE,r |−1/2 ∥vecHAMISE,r ∥−2r ) = O(n−12/(d+2r+4) ) + O(n−(d+8)/(d+2r+4) ) = O(n− min{d+8,12}/(d+2r+4) ). Then MSE{vechHITr } = MSE  (Γr − Γr )(HAMISE,r )  C vechHAMISE,r vechT HAMISE,r CT n8/(d+2r+4) = O  n− min{d+8,12}/(d+2r+4)  O  n8/(d+2r+4) Jd∗  vechHAMISE,r vechT HAMISE,r = O  n− min{d,4}/(d+2r+4) Jd∗  vechHAMISE,r vechT HAMISE,r . Acknowledgments The research was supported by The Jaroslav Hájek Center for Theoretical and Applied Statistics (MŠMT LC 06024). K. Vopatová has been supported by the University of Defence through the Institutional development project UO FEM ‘‘Economics Laboratory’’. The authors thank the anonymous referees for their helpful comments and are also grateful to J.E. Chacón for a valuable discussion which contributed to improvement of this paper. References Aldershof, B., Marron, J., Park, B., Wand, M., 1995. Facts about the Gaussian probability density function. Applicable Analysis 59, 289–306. Cao, R., Cuevas, A., González Manteiga, W., 1994. A comparative study of several smoothing methods in density estimation. Computational Statistics and Data Analysis 17, 153–176. Chacón, J.E., Duong, T., 2012. Bandwidth selection for multivariate density derivative estimation, with applications to clustering and bump hunting. e-prints. http://arxiv.org/abs/1204.6160. Chacón, J.E., 2010. Multivariate kernel estimation, lecture. Masaryk University, Brno. Chacón, J.E., Duong, T., 2010. Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19, 375–398. Chacón, J.E., Duong, T., Wand, M.P., 2011. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica 21, 807–840. 376 I. Horová et al. / Computational Statistics and Data Analysis 57 (2013) 364–376 Chaudhuri, P., Marron, J.S., 1999. SiZer for exploration of structures in curves. Journal of the American Statistical Association 94, 807–823. Duong, T., 2004. Bandwidth selectors for multivariate kernel density estimation. Ph.D. Thesis. School of Mathematics and Statistics. University of Western Australia. Duong, T., Cowling, A., Koch, I., Wand, M.P., 2008. Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis 52, 4225–4242. Duong, T., Hazelton, M., 2005b. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics 32, 485–506. Duong, T., Hazelton, M., 2005a. Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. Journal of Multivariate Analysis 93, 417–433. Godtliebsen, F., Marron, J.S., Chaudhuri, P., 2002. Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics 11, 1–21. Horová, I., Koláček, J., Vopatová, K., 2012. Visualization and bandwidth matrix choice. Communications in Statistics—Theory and Methods 759–777. Horová, I., Koláček, J., Zelinka, J., Vopatová, K., 2008. Bandwidth choice for kernel density estimates. In: Proceedings IASC. IASC, Yokohama, pp. 542–551. Horová, I., Vieu, P., Zelinka, J., 2002. Optimal choice of nonparametric estimates of a density and of its derivatives. Statistics and Decisions 20, 355–378. Horová, I., Vopatová, K., 2011. Kernel gradient estimate. In: Ferraty, F. (Ed.), Recent Advances in Functional Data Analysis and Related Topics. SpringerVerlag, Berlin, Heidelberg, pp. 177–182. Horová, I., Zelinka, J., 2007a. Contribution to the bandwidth choice for kernel density estimates. Computational Statistics 22, 31–47. Horová, I., Zelinka, J., 2007b. Kernel estimation of hazard functions for biomedical data sets. In: Härdle, W., Mori, Y., Vieu, P. (Eds.), Statistical Methods for Biostatistics and Related Fields. In: Mathematics and Statistics, Springer-Verlag, Berlin, Heidelberg, pp. 64–86. Horová, I., Zelinka, J., Budíková, M., 2006. Kernel estimates of hazard functions for carcinoma data sets. Environmetrics 17, 239–255. Härdle, W., Marron, J.S., Wand, M.P., 1990. Bandwidth choice for density derivatives. Journal of the Royal Statistical Society, Series B (Methodological) 52, 223–232. Jones, M.C., Kappenman, R.F., 1991. On a class of kernel density estimate bandwidth selectors. Scandinavian Journal of Statistics 19, 337–349. Jones, M.C., Marron, J.S., Park, B.U., 1991. A simple root n bandwidth selector. Annals of Statistics 19, 1919–1932. Magnus, J.R., Neudecker, H., 1979. Commutation matrix—some properties and application. Annals of Statistics 7, 381–394. Magnus, J.R., Neudecker, H., 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics, second ed.. Wiley. Marron, J.S., Ruppert, D., 1994. Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society, Series B (Methodological) 56, 653–671. Park, B., Marron, J., 1990. Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association 85, 66–72. Sain, S., Baggerly, K., Scott, D., 1994. Cross-validation of multivariate densities. Journal of the American Statistical Association 89, 807–817. Scott, D.W., 1992. Multivariate density estimation: theory, practice, and visualization. In: Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley. Scott, D.W., Terrell, G.R., 1987. Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association 82, 1131–1146. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Simonoff, J.S., 1996. Smoothing Methods in Statistics. Springer-Verlag, New York. Taylor, C.C., 1989. Bootstrap choice of the smoothing parameter in kernel density estimation. Biometrika 76, 705–712. Terrell, G.R., 1990. The maximal smoothing principle in density estimation. Journal of the American Statistical Association 85, 470–477. UNICEF, 2003. The state of the world’s children 2003. http://www.unicef.org/sowc03/index.html. Vopatová, K., Horová, I., Koláček, J., 2010. Bandwidth choice for kernel density derivative. In: Proceedings of the 25th International Workshop on Statistical Modelling. Glasgow, Scotland, pp. 561–564. Wand, M., Jones, M., 1995. Kernel Smoothing. Chapman and Hall, London. Wand, M.P., Jones, M.C., 1994. Multivariate plug-in bandwidth selection. Computational Statistics 9, 97–116. ForPeerReview Only! " #$ % # & ' $ ( ) ' $ ' % # & ' $ ( " * # $ * $ ' # * ( + ' # * # ( * ' , ( ' , ' ( * ( ( ( - * * ' *$ ' .( / ' * 0 ( ( ' 1 ( ( + ( URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION JAN KOL´AˇCEK, IVANA HOROV´A Abstract. The most important factor in kernel regression is a choice of a bandwidth. Considerable attention has been paid to extension the idea of an iterative method known for a kernel density estimate to kernel regression. Data-driven selectors of the bandwidth for kernel regression are considered. The proposed method is based on an optimally balanced relation between the integrated variance and the integrated square bias. This approach leads to an iterative quadratically convergent process. The analysis of statistical properties shows the rationale of the proposed method. In order to see statistical properties of this method the consistency is determined. The utility of the method is illustrated through a simulation study and real data applications. Keywords and Phrases: kernel regression, bandwidth selection, iterative method. Mathematics Subject Classification: 62G08 1. Introduction Kernel regression estimates are one of the most popular nonparametric estimates. In a univariate case, these estimates depend on a bandwidth, which is a smoothing parameter controlling smoothness of an estimated curve and a kernel which is considered as a weight function. The choice of the smoothing parameter is a crucial problem in the kernel regression. The literature on bandwidth selection is quite extensive, e.g., monographs [20, 17, 18], papers [7, 2, 3, 15, 19, 4, 5, 12, 13]. Although in practice one can try several bandwidths and choose a bandwidth subjectively, automatic (data-driven) selection procedures could be useful for many situations; see [16] for more examples. Most of these procedures are based on estimating of Average Mean Square Error. They are asymptotically equivalent and asymptotically unbiased (see [7, 2, 3]). However, in simulation studies ([12]), it is often observed that most selectors are biased toward undersmoothing and yield smaller bandwidths more frequently than predicted by asymptotic results. Successful approaches to the bandwidth selection in kernel density estimation can be transferred to the case of kernel regression. The iterative method for the kernel density has been developed and widely discussed in [9]. The proposed method is based on an optimally balanced relation between the integrated variance and the integrated square bias. The paper is organized as follows: In Section 2 we describe kernel estimates of a regression function and give a form of the Mean Integrated Square Error and its asymptotic alternative. The next section is devoted to a data-driven bandwidth selection method. This method is based on an optimally balanced relation between the integrated variance and the integrated squared bias, see [9]. Similar ideas were Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic. 1 Page 1 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 2 JAN KOL´AˇCEK, IVANA HOROV´A applied to kernel estimates of hazard functions (see [11] or [10]). It seems that the basic idea can be also extended to a kernel regression and we are going to investigate this possibility. We discuss the statistical properties of the proposed method as well. Section 4 brings a simulation study and in the last section the developed theory is applied to real data sets. 2. Univariate kernel regression Consider a standard regression model of the form (2.1) Yi = m(xi) + εi, i = 1, . . . , n, where m is an unknown regression function, Y1, . . . , Yn are observable data variables with respect to the design points x1, . . . , xn. The residuals ε1, . . . , εn are independent identically distributed random variables for which E(εi) = 0, var(εi) = σ2 > 0, i = 1, . . . , n. We suppose the fixed equally spaced design, i.e., design variables are not random and xi = i/n, i = 1, . . . , n. In the case of random design, where the design points X1, . . . , Xn are random variables with the same density f, all considerations are similar as for the fixed design. More detailed description of the random design can be found, e.g., in [20]. The aim of kernel smoothing is to find a suitable approximation m of the unknown function m. We consider the estimator proposed by Pristley and Chao [14] which is defined as (2.2) m(x, h) = 1 n n i=1 Kh(x − xi)Yi, for x ∈ (0, 1). The function K is called the kernel which is assumed to be symmetric about zero and be supported on [−1, 1], be such that V (K) = K(u)2 du < ∞ and have a finite second moment (i.e., u2 K(u)du = β2 < ∞). Set Kh(.) = 1 h K( . h ), h > 0. A parameter h is called a bandwidth. The quality of a kernel regression estimator can be locally described by the Mean Square Error (MSE) or by a global criterion the Mean Integrated Square Error (MISE), which can be written as a sum of the Integrated Variance (IV) and the Integrated Square Bias (ISB) MISE m(·, h) = E 1 0 [m(x, h) − m(x)]2 dx = 1 0 Var m(x, h)dx IV + 1 0 [(Kh ∗ m)(x) − m(x)]2 dx ISB +O n−1 , (2.3) where ∗ denotes a convolution. Page 2 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 3 Since the MISE is not mathematically tractable we employ the Asymptotic Mean Integrated Square Error (AMISE) (2.4) AMISE{m(·, h)} = V (K)σ2 nh AIV + β2 2 2 V (m′′ )h4 AISB , where V (m′′ ) = 1 0 (m′′ (x)) 2 dx. The optimal bandwidth considered here is hopt, the minimizer of (2.4), i.e., hopt = arg min h∈Hn AMISE{m(·, h)}, where Hn = [an−1/5 , bn−1/5 ] for some 0 < a < b < ∞. The calculation gives (2.5) hopt = σ2 V (K) nβ2 2V (m′′) 1 5 . In nonparametric regression estimation a critical and inevitable step is to choose the smoothing parameter (bandwidth) to control the smoothness of the curve estimate. The smoothing parameter considerably affects the features of the estimated curve. One of the most widespread procedures for bandwidth selection is the crossvalidation method, also known as “leave-one-out” method. The method is based on modified regression smoother (2.2) in which one, say the j-th, observation is left out: m−j(xj, h) = 1 n n i=1 i=j Kh(xi − xj)Yi, j = 1, . . . , n. With using these modified smoothers, the error function which should be minimized takes the form (2.6) CV(h) = 1 n n i=1 {m−i(xi) − Yi}2 . The function CV(h) is commonly called a “cross-validation” function. Let ˆhCV stand for minimization of CV(h), i.e., ˆhCV = arg min h∈Hn CV(h). The literature on this criterion is quite extensive, e.g., [19, 4, 7, 5]. 3. Iterative method for kernel regression The proposed method is based on the following relation. It is easy to show that the equation holds (3.1) AIV {m(·, hopt)} − 4 AISB{ m(·, hopt)} = 0, Page 3 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 4 JAN KOL´AˇCEK, IVANA HOROV´A where AIV and AISB are terms used in (2.4). For estimating of AIV and AISB in (3.1) we use AIV {m(·, h)} = ˆσ2 V (K) nh , with ˆσ2 = 1 2n − 2 n i=2 (Yi − Yi−1)2 and AISB {m(·, h)} = 1 0 [(Kh ∗ m)(x, h) − m(x, h)]2 dx = 1 4n2h n i,j=1 i=j Λ xi − xj h YiYj, where Λ(z) = (K ∗K ∗K ∗K −2K ∗K ∗K +K ∗K)(z) (see Complements for more details, for properties of Λ(z) see [8]). To find the bandwidth estimate ˆhIT we solve the equation (3.2) AIV {m(·, h)} − 4AISB {m(·, h)} = 0, which leads to finding a fixed point of the equation (3.3) h = ˆσ2 V (K) 4nhAISB {m(·, h)} . We use Steffensen’s iterative method with the starting approximation ˆh0 = 2/n. This approach leads to an iterative quadratically convergent process (see [9]). The solution ˆhIT of the equation (3.2) can be considered as a suitable approximation of hopt as it is confirmed by the following theorem. Theorem 3.1. Let m ∈ C2 [0, 1], m′′ be square integrable, lim h n→∞ = 0, lim nh n→∞ = ∞. Let P(h) stand for the left side of (3.1) and P(h) for the left side of (3.2). Then (3.4) E(P(h)) = P(h) + O n−1 , var(P(h)) = O n−1 . Theorem 3.1 states that P(h) is a consistent estimate of P(h). This result confirms that the solution of (3.3) may be expected to be reasonably close to hopt. Proof of Theorem 3.1 can be found in Complements. 4. Simulation study We carry out two simulation studies to compare the performance of the bandwidth estimates. The comparison is done in the following way. The observations, Yi, for i = 1, . . . , n = 100, are obtained by adding independent Gaussian random variables with mean zero and variance σ2 to some known regression function. Both regression functions used in our simulations are illustrated in Fig. 1. One hundred series are generated. For each data set, we estimate the optimal bandwidth by both mentioned methods, i.e., for each method we obtain 100 estimates. Since we know the optimal bandwidth, we compare it with the mean of estimates and look at their standard deviation, which describes the variability of methods. The Epanechnikov kernel K(x) = 3 4 (1 − x2 )I[−1,1] is used in all cases. Page 4 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 5 0 0.2 0.4 0.6 0.8 1 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0 0.2 0.4 0.6 0.8 1 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1. Regression functions. Finally, we calculate the Integrated Square Error (ISE) ISE{m(·, h)} = 1 0 (m(x, h) − m(x)) 2 dx for each estimated regression function over all 100 replications. The logarithm of results are displayed in Tables 2, 4 and in Figures 3, 5. Here “IT” denotes the results for our proposed method, “CV” stands for the results of the cross-validation method. 4.1. Simulation 1. In this case, we use the regression function m(x) = x3 (1 − x)3 with σ2 = 0.0032 . Table 1 summarizes the sample means and the sample standard deviations of bandwidth estimates, E(ˆh) is the average of all 100 values and std(ˆh) is their standard deviation. Figure 2 illustrates the histogram of results of all 100 experiments. hopt = 0.1188 E(ˆh) std(ˆh) CV 0.1057 0.0297 IT 0.1184 0.0200 Table 1. Means and standard deviations Table 2 gives the mean and the standard deviations of log(ISE) for each method compared with log(ISE) for the regression estimate obtained with hopt. Figure 3 illustrates the histogram of log(ISE) of all 100 experiments. As we see, the standard deviation of all results obtained by the proposed method is less than the value for the case of cross-validation method and also the mean of these results is slightly closer to the theoretical optimal bandwidth. The comparison of results with respect to log(ISE) leads to the similar result. The reason is that the regression function is smooth and satisfies all the conditions supposed in the previous section. Thus the proposed method works very well in this case. Page 5 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 6 JAN KOL´AˇCEK, IVANA HOROV´A 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0 20 40 60 80 100 hopt CV method 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0 20 40 60 80 100 hopt IT method Figure 2. Distribution of ˆh for both methods. E(log(ISE)) std(log(ISE)) hopt −14.4452 0.5421 IT −14.3481 0.5193 CV −14.2160 0.6276 Table 2. Means and standard deviations of log(ISE) 4.2. Simulation 2. In the second example, we use the regression function m(x) = sin(π x) cos 3 π x5 with σ2 = 0.05. Table 3 summarizes the sample means and the sample standard deviations of bandwidth estimates, E(ˆh) is the average of all 100 values and std(ˆh) is their standard deviation. Figure 4 illustrates the histogram of results of all 100 experiments. Table 4 brings the mean and the standard deviations of log(ISE) for each method compared with log(ISE) for the regression estimate obtained with hopt. Figure 5 illustrates the histogram of log(ISE) of all 100 experiments. Although the mean of ˆhIT is not so close to hopt as the mean of ˆhCV , the values of ISE are better. Also the variability of the proposed method seems to be smaller in this case. Thus we make a conclusion that the proposed method can provide better results for this regression model. Page 6 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 7 −16.5 −16 −15.5 −15 −14.5 −14 −13.5 −13 −12.5 h_opt IT CV Figure 3. Logarithm of ISE. hopt = 0.0585 E(ˆh) std(ˆh) CV 0.0633 0.0168 IT 0.0708 0.0072 Table 3. Means and standard deviations E(log(ISE)) std(log(ISE)) hopt −5.0932 0.3908 IT −5.0560 0.3741 CV −4.9525 0.3966 Table 4. Means and standard deviations of log(ISE) 5. Application to real data The main goal of this section is to make a comparison of mentioned bandwidth estimators on a real data set. We use data from [1] and follow annual measurements of the level, in feet, of Lake Huron 1875 – 1972, i.e., the sample size is n = 98. We transform data to the interval [0, 1] and use both selectors considered in the previous section to get the optimal bandwidth. We use the Epanechnikov kernel K(x) = 3 4 (1 − x2 )I[−1,1]. All estimates of optimal bandwidth are listed in Table 5. Page 7 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 8 JAN KOL´AˇCEK, IVANA HOROV´A 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 20 40 60 80 100 hopt CV method 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 20 40 60 80 100 hopt IT method Figure 4. Distribution of ˆh for both methods. −6 −5.5 −5 −4.5 −4 h_opt IT CV Figure 5. Logarithm of ISE. Figure 6 illustrates the kernel regression estimate with the smoothing parameter ˆhCV = 0.0204 which was obtained by cross-validation method. Figure 7 shows the kernel regression estimate with the smoothing parameter ˆhIT = 0.0501. This value was found by our proposed method Page 8 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 9 Table 5. Optimal bandwidth estimates for Lake Huron data. iterative method ˆhIT = 0.0501 cross-validation ˆhCV = 0.0204 1860 1880 1900 1920 1940 1960 1980 5 6 7 8 9 10 11 12 Figure 6. Kernel regression estimate with ˆhCV = 0.0204. 1860 1880 1900 1920 1940 1960 1980 5 6 7 8 9 10 11 12 Figure 7. Kernel regression estimate with ˆhIT = 0.0501. Since we do not know the true regression function m(x) it is hard to assess objectively which one of kernel estimates is better. It is very important to realize the fact that the final decision about the estimate is partially subjective because the estimates of the bandwidth are only asymptotically optimal. The values summarized in the table and figures show that the estimate with the smoothing parameter Page 9 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 10 JAN KOL´AˇCEK, IVANA HOROV´A obtained by cross-validation criterion is undersmoothed. In the context of these considerations, the estimate with parameter obtained by the iterative method appears to be sufficient. 6. Conclusion A new bandwidth selector for kernel regression was proposed. The analysis of statistical properties shows the rationale of the proposed method. The advantage of the method is in computational aspects, since it makes possible to avoid the minimization process and only solves one nonlinear equation. 7. Acknowledgments This research was supported by Masaryk University, project MUNI/A/1001/2009. 8. Complements Proof of Theorem 3.1. Let us denote (8.1) P(h) = V (K)σ2 nh − h4 β2 2V (m′′ ) and let (8.2) P(h) = V (K)ˆσ2 nh − 1 n2h n i,j=1 i=j Λ xi − xj h YiYj stand for an estimate of P. The proposed method aims to solve the equation P(h) = 0. For a better clarity we use the notation for 1 0 in next. As the first step, we prove the following lemma. Lemma 8.1. For i, j = 1, . . . , n, i = j the formula holds hΛ xi − xj h YiYj = (K ∗ K) x − xi h − K x − xi h × (K ∗ K) x − xj h − K x − xj h YiYjdx. Proof. (K ∗ K) x − xi h − K x − xi h (K ∗ K) x − xj h − K x − xj h dx = (K ∗ K) x − xi h (K ∗ K) x − xj h dx − 2 (K ∗ K) x − xi h K x − xj h dx + K x − xi h K x − xj h dx. Set the three integrals in the sum as η1, η2, η3. We modify η3 by substitution t = x−xj h . Using the parity of K we get η3 = h 1−xj h −xj h K(t)K t − xi − xj h dt. Page 10 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 11 Provided xj ∈ [0, 1] then, as h → ∞, −xj/h → −∞ and (1−xj)/h → ∞. Therefore η3 = h(K ∗ K) xi − xj h . Similarly we can obtain η2 = h(K ∗ K ∗ K) xi − xj h , η1 = h(K ∗ K ∗ K ∗ K) xi − xj h . Thus η1 − 2η2 + η3 = hΛ xi−xj h . We start with an evaluation of 1 n2h E n i,j=1 i=j Λ xi−xj h YiYj: 1 n2h E n i,j=1 i=j Λ xi − xj h YiYj L.8.1 = 1 n2h2 E n i,j=1 i=j (K ∗ K) x − xi h − K x − xi h × (K ∗ K) x − xj h − K x − xj h YiYjdx = 1 n2h2 n i,j=1 i=j (K ∗ K) x − xi h − K x − xi h × (K ∗ K) x − xj h − K x − xj h m(xi)m(xj)dx =    ∞ −∞ [(K ∗ K)(t) − K(t)] m(x − ht)dt I1    2 dx + O n−1 . Now, we approximate the integral I1 by the Taylor’s expansion of m(x − th) I1 = ∞ −∞ [(K ∗ K)(t) − K(t)] m(x) − thm′ (x) + t2 h2 2 m′′ (x) + O(t3 h3 ) dt. It is an easy exercise to see the moment conditions for (K ∗ K)(t) − K(t): ∞ −∞ (K ∗ K)(t) − K(t)dt = ∞ −∞ t2k+1 [(K ∗ K)(t) − K(t)]dt = 0, k ∈ N, ∞ −∞ t2 [(K ∗ K)(t) − K(t)]dt = 2β2. Thus I1 = h2 β2m′′ (x) + O h4 Page 11 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 12 JAN KOL´AˇCEK, IVANA HOROV´A and 1 n2h E n i,j=1 i=j Λ xi − xj h YiYj = h4 β2 2V (m′′ ) + O h6 + O n−1 . Finally EP(h) = V (K)σ2 nh − β2 2V (m′′ )h4 + O n−1 and (8.3) EP(h) = P(h) + O n−1 . Since it is assumed lim n→∞ nh = ∞ then EP(h) → P(h). Now, we derive the formula for varP(h). As the first we express varAISB = E(AISB)2 − E2 AISB. E(AISB)2 = 1 16n4h2 E    n i,j=1 i=j Λ xi − xj h YiYj    2 = 1 16n4h2 E    n i,j,k,l=1 i=j=k=l Λ xi − xj h Λ xk − xl h YiYjYkYl ζ1 + n i,j,k=1 i=j=k Λ xi − xj h Λ xi − xk h Y 2 i YjYk ζ2 + n i,j=1 i=j Λ2 xi − xj h Y 2 i Y 2 j ζ3    . Then we compute 1 16n4h2 Eζ1 = 1 16n4h2 n i,j,k,l=1 i=j=k=l Λ xi − xj h Λ xk − xl h m(xi)m(xj)m(xk)m(xl) = 1 16h2 Λ x − y h Λ u − v h m(x)m(y)m(u)m(v)dxdydudv + O n−1 = 1 16h2 Λ x − y h m(x)m(y)dxdy 2 + O n−1 = 1 16    ∞ −∞ Λ(t)m(x − th)m(x)dtdx    2 + O n−1 It is easy to see the moment conditions for Λ(z): ∞ −∞ Λ(z)dz = ∞ −∞ z2 Λ(z)dz = 0, ∞ −∞ z2k−1 Λ(z)dz = 0, k ∈ N, z4 Λ(z)dz = 6β2 2 ([8]). By using the second order Page 12 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only SELECTION OF BANDWIDTH FOR KERNEL REGRESSION 13 Taylor’s expansion of m(x − th) we obtain the result 1 16n4h2 Eζ1 = 1 32    m′′ (x)m(x)dx ∞ −∞ Λ(t)t2 h2 + O h3 dt    2 +O n−1 = O n−1 . Similarly, 1 16n4h2 Eζ2 = 1 16n4h2 n i,j,k=1 i=j=k Λ xi − xj h Λ xi − xk h m2 (xi) + σ2 m(xj)m(xk) = 1 16nh2 Λ x − y h Λ x − z h m2 (x) + σ2 m(y)m(z)dxdydz + O n−1 = 1 16n ∞ −∞ ∞ −∞ Λ(t)Λ(u) m2 (x) + σ2 m(x − th)m(x − uh)dtdudx + O n−1 = 1 64n m′′2 (x) m2 (x) + σ2 dx    ∞ −∞ Λ(t)t2 h2 dt    2 + O h6 n−1 + O n−1 . 1 16n4h2 Eζ3 = 1 16n4h2 n i,j=1 i=j Λ2 xi − xj h m2 (xi) + σ2 m2 (xj) + σ2 = 1 16n2h2 Λ2 x − y h m2 (x) + σ2 m2 (y) + σ2 dxdy + O n−1 = 1 16n2h ∞ −∞ Λ2 (t) m2 (x) + σ2 m2 (x − th) + σ2 dtdx + O n−1 = V (Λ)V (m2 + σ2 ) 16n2h + O n−1 . By combining results for E(AISB)2 and E2 AISB we arrive at the expression varAISB = O n−1 . Since ˆσ2 is a consistent estimator of σ2 (see [6]) and varAISB is of order O n−1 , varP is a consistent estimator of varP. References [1] P.J. Brockwell and R.A. Davis. Time Series: Theory and Methods. Springer Series in Statistics. Springer, 2009. [2] S.T. Chiu. Why bandwidth selectors tend to choose smaller bandwidths, and a remedy. Biometrika, 77(1):222–226, 1990. [3] S.T. Chiu. Some stabilized bandwidth selectors for nonparametric regression. Annals of Statistics, 19(3):1528–1546, 1991. [4] P Craven and G Wahba. Smoothing noisy data with spline functions - estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31(4):377–403, 1979. [5] Bernd Droge. Some comments on cross-validation. Technical Report 1994-7, Humboldt Universitaet Berlin, 1996. [6] Peter Hall, J. W. Kay, and D. M. Titterington. Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika, 77(3):521–528, 1990. Page 13 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ForPeerReview Only 14 JAN KOL´AˇCEK, IVANA HOROV´A [7] W. H¨ardle. Applied Nonparametric Regression. Cambridge University Press, Cambridge, 1st edition, 1990. [8] I. Horov´a, J. Kol´aˇcek, and J. Zelinka. Kernel Smoothing in MATLAB. World Scientific, Singapore, 2012. [9] I. Horov´a and J. Zelinka. Contribution to the bandwidth choice for kernel density estimates. Computational Statistics, 22(1):31–47, 2007. [10] I. Horov´a and J. Zelinka. Kernel estimation of hazard functions for biomedical data sets. In Wolfgang. H¨ardle, Yuichi. Mori, and Philippe Vieu, editors, Statistical Methods for Biostatistics and Related Fields, Mathematics and Statistics, pages 64–86. Springer-Verlag Berlin Heidelberg, 2007. [11] I. Horov´a, J. Zelinka, and M. Bud´ıkov´a. Kernel estimates of hazard funcions for carcinoma data sets. Environmetrics, 17(3):239–255, 2006. [12] Jan Kol´aˇcek. Kernel Estimation of the Regression Function (in Czech). PhD thesis, Masaryk University, Brno, feb 2005. [13] Jan Kol´aˇcek. Plug-in method for nonparametric regression. Computational Statistics, 23(1):63–78, 2008. [14] M. B. Priestley and M. T. Chao. Non-parametric function fitting. Journal of the Royal Statistical Society. Series B (Methodological), 34(3):385–392, 1972. [15] J. Rice. Bandwidth choice for nonparametric regression. Annals of Statistics, 12(4):1215– 1230, 1984. [16] Bernard W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society. Series B (Methodological), 47:1–52, 1985. [17] Bernard W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, London, 1986. [18] J. S. Simonoff. Smoothing Methods in Statistics. Springer-Verlag, New York, 1996. [19] M Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B-Statistical Methodology, 36(2):111–147, 1974. [20] M.P. Wand and M.C. Jones. Kernel smoothing. Chapman and Hall, London, 1995. Page 14 of 22 URL: http://mc.manuscriptcentral.com/lsta E-mail: comstat@univmail.cis.mcmaster.ca Communications in Statistics ? Theory and Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 jes Journal of Environmental Statistics February 2013, Volume 4, Issue 2. http://www.jenvstat.org Kernel Regression Model for Total Ozone Data Horov´a I., Kol´aˇcek J., Lajdov´a D. Department of Mathematics and Statistics Masaryk University Brno Abstract The present paper is focused on a fully nonparametric regression model for autocorrelation structure of errors in time series over total ozone data. We propose kernel methods which represent one of the most effective nonparametric methods. But there is a serious difficulty connected with them – the choice of a smoothing parameter called a bandwidth. In the case of independent observations the literature on bandwidth selection methods is quite extensive. Nevertheless, if the observations are dependent, then classical bandwidth selectors have not always provided applicable results. There exist several possibilities for overcoming the effect of dependence on the bandwidth selection. In the present paper we use the results of Chu and Marron (1991) and Kol´aˇcek (2008) and develop two methods for the bandwidth choice. We apply the above mentioned methods to the time series of ozone data obtained from the Vernadsky station in Antarctica. All discussed methods are implemented in Matlab. Keywords: total ozone, kernel, bandwidth selection. 1. Introduction Antarctica is significantly related to many environmental aspects and processes of the Earth. And thus its impact on the global climate system and water circulation in the world ocean is essential. The stratosphere ozone depletion over Antarctica was discovered at the beginning of the 1990s. The lowest total ozone contents (TOC) in Antarctica are usually observed in the first week of October. The formation of ozone depletion begins approximately in the second half of August, culminates in the first half of October, and dissolves in November. During the ozone depletion, the average ozone concentration varied at the time of its culmination in October from the original value over 300 Dobson Units (DU) in 1950s and 1960s to a level between 100 and 150 DU in 1990-2000 (see L´aska et al. (2009)). One DU is set as a 0.001 mm strong 2 Kernel Regression Model for Total Ozone Data layer of ozone under the pressure 1013 hPa and temperature 273 K. One of the issues resolved within the Czech–Ukrainian scientific cooperation implemented on the Vernadsky Station in Antarctica is the measurement of total ozone content (TOC) in the stratosphere. The Vernadsky station is located on the west coast of Antarctic peninsula (65◦S, 64◦W). These data were obtained from ground measurements predominantly taken with the Dobson No 031 spectrophotometer. Data can be found at UAC (2012). The data sets were processed as time points measuring the average daily amount of ozone. In order to analyze these data we have to take into account the autocorrelation structure of errors on such time series. We focus on kernel regression estimators of series of ozone data. These estimators depend on a smoothing parameter and it is well-known that selecting the correct smoothing parameter is difficult in the presence of correlated errors. There exist methods which are modifications of a classical cross-validation method for independent errors (the modified cross-validation method or the partitioned cross-validation method - see Chu and Marron (1991), H¨ardle and Vieu (1992)). In the present paper we develop a new flexible plug-in approach for estimating the optimal smoothing parameter. The utility of this method is illustrated through a simulation study and application to TOC data measured in periods August to April 2004-2005, 2005-2006, 2006-2007. 2. Procedure Development 2.1. Kernel regression model In nonparametric regression problems we are interested in estimating the mean function E(Y |x) = m(x) from a set of observations (xi, Yi), i = 1, . . . , n. Many methods such as kernel methods, regression splines and wavelet methods are currently available. The papers in this filed have been mostly focused on case where an unknown function m is hidden by a certain amount of a white noise. The aim of a regression analysis is to remove the white noise and produce a reasonable approximation to the unknown function m. Consider now the case when the noise is no longer white and instead contains a certain amount of a structure in the form of correlation. In particular, if data sets have been recorded over time from one object under a study, it is very likely that another response of the object will depend on its previous response. In this context we will be dealing with a time series case, where design points are fixed and equally spaced and thus our model takes the form Yi =m(i/n)+εi, i = 1, . . . , n, (1) and εi is an unknown ARMA process, i.e., E(εi) =0, var(εi) = σ2 , i = 1, . . . , n, cov(εi, εj) =γ|i−j| = σ2 ρ|i−j|, corr(εi, εj) = ρ|i−j| (2) and the stationary process γ0 = σ2 , ρt = γt γ0 , Journal of Environmental Statistics 3 where ρt is an autocorrelation function and γt is an autocovariance function. We consider the simplest situation (Opsomer et al. (2001), Chu and Marron (1991)) ρt/n = ρt. Simple and the most widely used regression smoothers are based on kernel methods (see e.g. monographs M¨uller (1987), H¨ardle (1990), Wand and Jones (1995)). These methods are local weighted averages of the response Y . They depend on a kernel which plays the role of a weighted function, and a smoothing parameter called a bandwidth which controls the smoothness of the estimate. Appropriate kernel regression estimators were proposed by Priestley and Chao (1972), Nadaraya (1964) and Watson (1964), Stone (1977), Cleveland (1979) and Gasser and M¨uller (1979). These estimators were shown to be asymptotically equivalent (Lejeune (1985), M¨uller (1987), Wand and Jones (1995)) and without the lost of generality we consider the Nadaraya–Watson (NW) estimators m of m. The NW estimator of m at the point x ∈ (0, 1) is defined as m(x, h) = n i=1 Kh(xi − x)Yi n i=1 Kh(xi − x) , (3) for a kernel function K, where Kh(.) = 1 h K( . h ), and h is a nonrandom positive number h = h(n) called the bandwidth. Before studying the statistical properties of m several additional assumptions on the statistical model and the parameters of the estimator are needed: I. Let m ∈ C2[0, 1]. II. Let K be a real valued function continuous on R and satisfying the conditions: (i) |K(x) − K(y)| ≤ L|x − y| for a constant L > 0, ∀x, y ∈ [−1, 1], (ii) support(K) = [−1, 1], K(−1) = K(1) = 0, (iii) 1 −1 xjK(x)dx =    1 j = 0, 0 j = 1, β2 = 0 j = 2. Such a function is called a kernel of order 2 and a class of these kernels is denoted as S02. III. Let h = h(n) be a sequence of nonrandom positive numbers, such that h → 0 and nh → ∞ as n → ∞. IV. lim n→∞ ∞ k=1 |ρk| < ∞, i.e., R = ∞ k=1 ρk exists, V. 1 n ∞ k=1 k|ρk| = 0. 4 Kernel Regression Model for Total Ozone Data Remark. The well-known kernels are, e.g., Epanechnikov kernel K(x) = 3 4 (1 − x2)I[−1,1], quartic kernel K(x) = 3 4 (1 − x2)2I[−1,1], triweight kernel K(x) = 35 32 (1 − x2)2I[−1,1], Gaussian kernel K(x) = 1√ 2π e −x2 2 , where I[−1,1] is an indicator function. Though the Gaussian kernel does not satisfy the assumption II.(ii), it is very popular in many applications. There is no problem with a choice of a suitable kernel. Symmetric probability density functions are commonly used (see Remark above). But choosing the smoothing parameter is a crucial problem in all kernel estimates. The literature on bandwidth selections is quite extensive in case of independent errors. It is well known that when the kernel method is used to recover m, that correlated errors trouble bandwidth selection severely (see Altman (1990), Opsomer et al. (2001)). De Brabanter et al. (2010) developed a bandwidth selection procedure based on bimodal kernels which successfully removes the error correlation without requiring any prior knowledge about its structure. The global quality of the estimate m can be expressed by means of the Mean Integrated Squared Error (Altman (1990), Opsomer et al. (2001)). However more mathematically tractable is the Asymptotic Mean Integrated Squared Error (AMISE): AMISE(m, h) = V (K) nh S AIV(m,h) + β2 2 4 h4 A2 AISB(m,h) , where V (K) = K2(x)dx, S = σ2(1 + 2 ∞ k=1 ρk) = σ2(1 + 2R), A2 = 1 0 m′′(x)2dx. The first term is called the asymptotic integrated variance (AIV) and the second one the asymptotic integrated squared bias (AISB). This decomposition provides an easier analysis and interpretation of the performance of the kernel regression estimator. Using a standard procedure of mathematical analysis one can easily find that the bandwidth hopt minimizing the AMISE is given by the formula hopt = V (K)S nβ2 2A2 1/5 = O(n−1/5 ). (4) This formula provides a good insight into an optimal bandwidth, but unfortunately it depends on the unknown S and A2. Let us explain the impact of assuming an uncorrelated model. Journal of Environmental Statistics 5 If R > 0 (error correlation is positive), then AIV(m, h) is larger than in the corresponding uncorrelated case and AMISE(m, h) is minimized by a value h that is larger than in the uncorrelated case. It means that assuming wrongly uncorrelated errors causes that the bandwidth becomes too small. If R < 0 (error correlation is negative), then AIV(m, h) is smaller and AMISE(m, h) optimal bandwidth is smaller than in the uncorrelated case. In the next section the choosing of parameters S and A2 will be treated. 2.2. Choosing the parameters There are a number of data-driven bandwidth selection methods, but it can be shown that they fail in the case of correlated errors. Among the earliest fully automatic and consistent bandwidth selectors are those based on cross-validation ideas. The cross-validation method employs an objective function CV (h) = 1 n n j=1 m−j(xj, h) − Yj 2 , (5) where m−j(xj, h) is the estimate of m(xj, h) with xj deleted, i.e., the leave-one-out estimator. The estimate of hopt is then hopt = arg min h∈Hn CV (h), where Hn = [an−1/5, bn−1/5], 0 < a < b < ∞. Remark. If the design points are equally spaced then a recommended interval is [ 1 n , 1). However, this ordinary method is not suitable in the case of correlated observations. As it was shown in the papers Altman (1990) and Opsomer et al. (2001), if the observations are positively correlated, then the CV method produces too small a bandwidth, and if the observations are negatively correlated, then the CV method produces a large bandwidth. We demonstrate this fact by the following example. Consider the regression model (1), where m(x) = cos (3.15πx), εi = φεi−1 + ei, ei – i.i.d. normal random variables N(0, σ2), ε1 – N(0, σ2/(1 − φ2)), φ = 0.6, σ = 0.5, i.e, the regression errors are AR(1) process. Figure 1 shows the result obtained by the CV method. It is evident, that the estimate is undersmoothed. In order to overcome this problem, modified and partitioned CV methods were proposed by H¨ardle and Vieu (1992) and Chu and Marron (1991), respectively. 6 Kernel Regression Model for Total Ozone Data 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −3 −2 −1 0 1 2 3 4 Estimate obtained with bandwidth selected by CV Simulated data with AR(1) correlation Figure 1: The estimate of simulated data with AR(1) errors The modified cross-validation (MCV) method is a ”leave-(2l + 1)-out” version of CV (l ≥ 0). The idea consists in minimizing of the modified cross-validation score: CVl(h) = 1 n n j=1 m−j(xj, h) − Yj 2 , (6) where m−j(xj, h) is the ”leave-(2l+1)-out”estimate of m(xj, h), i.e., the observations (xj+i, Yj+i), −l ≤ i ≤ l are left out in constructing m(xj, h). Then hMCV = arg min h∈Hn CVl(h). The principle of the partitioned cross-validation method (PCV) can be described as follows. For any natural number g ≥ 1, the PCV involves splitting the observations into g groups by taking every g-th observation, calculating the ordinary cross-validation score CV0,k(h) of the k-th group of observations separately, for k = 1, 2, . . . , g, and minimizing the average of these ordinary cross-validation scores CV ∗ (h) = 1 g g k=1 CV0,k(h). (7) Let h∗ CV stand for the minimizer of CV ∗(h): h∗ CV = arg min h∈Hn CV ∗ (h). Since h∗ CV is appropriate for the sample size n/g, the partitioned cross-validated bandwidth hPCV (g) is defined to be rescaled h∗ CV : hPCV (g) = g−1/5 h∗ CV . When g = 1, the PCV is an ordinary cross-validation. Journal of Environmental Statistics 7 Remark. The number of subgroups is g and the number of observations in each group is η = n/g. If n is not a multiplier of g, then the values Yj, 1 ≤ j ≤ g[n/g] are applied and the rest of the observations are dropped out ([n/g] is the highest integer less or equal to n/g). The asymptotic behavior of hMCV (l) and hPCV (g) was studied in the paper by Chu and Marron (1991). Furthemore we focus on the PCV method. The PCV method needs to determine the factor g. A possible approach for the practical choice of g is based on an analogue of the mean squared error. Using the asymptotic variance and the asymptotic mean of hPCV (g)/hopt, the asymptotic mean squared error (AMSE) of this ratio is defined by AMSE hPCV (g)/hopt = n−1/5 VARPCV (g) + CPCV (g)/C − 1 2 , (8) where VARPCV (g), CPCV (g), C depend on γk, K, A2 (see Chu and Marron (1991)). Theoretically, if there exists a value g which minimizes AMSE over g ≥ 1, then this value is taken as the optimal value of g in the sense of AMSE: gopt = arg min g≥1 AMSE hPCV (g)/hopt . Unfortunately the minimization of AMSE also depends on the unknown γk and A2. As far as the estimation of the variance component S is concerned, a common approach is the following (see e.g. Herrmann et al. (1992), Hart (1991), Opsomer et al. (2001), Chu and Marron (1991)): S = ˆγ0 1 + 2 n−1 k=1 ˆρk , ˆγ0 = ˆσ2 , ˆρk = ˆγk ˆγ0 , ˆγk = 1 n − k n−k t=1 Yt − Y Yt+k − Y , k = 0, . . . , n − 1. (9) Nevertheless there is still a problem of how to estimate A2. In paper Chu and Marron (1991) a simulation study was only conducted and no idea of estimating A2 was given there. We complete this method by adding a suitable estimate of A2 and recommend to use an estimate of A2 proposed by Kol´aˇcek (2008). By means of the Fourier transformation he derived a suitable estimate A2 of A2. Therefore, A2 in the AMSE formula is replaced by A2. This approach is commonly known as a plug-in method. Plug-in methods are also commonly used for selecting the bandwidth in the kernel regression. But these methods perform badly when the errors are correlated. In the paper Herrmann et al. (1992) a modified version of an existing plug-in bandwidth selectors is proposed. This method is based on the Gasser–M¨uller estimator of the second derivative and an iterative process is constructed. It is shown that under some additional assumptions this iterative process converges to a suitable estimate of the optimal bandwidth. However we do not use this iterative method and propose to directly plug-in A2 in the formula (4). This new version of a plug-in method is denoted as PI and the bandwidth estimate takes the form: hPI = V (K)S nβ2 2A2 1/5 . 8 Kernel Regression Model for Total Ozone Data 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −6 −5 −4 −3 −2 −1 0 1 2 3 4 Figure 2: The regression function m(x) hopt = 0.759 E(h) std(h) PCV 0.1927 0.0649 PI 0.1513 0.0083 Table 1: The estimates h We would like to point out the computational aspect of the plug-in method. It has preferable properties to classical methods, because it does not need any additional calculations such as the PCV method (see Kol´aˇcek (2008) for details). 3. Case study We conduct a simulation study to compare the PCV method and the PI method. The Epanechnikov kernel is used both in simulations and in applications. Consider the regression model (1), where m(x) = −6 sin 11x + 5 cotg(x − 7) , εi = φεi−1 + ei ei – i.i.d. normal random variables N(0, σ2) ε1 – N(0, σ2/(1 − φ2)) φ = 0.6, σ = 0.5, for i = 1, . . . , n = 100. The graph of the regression function m is presented in Figure 2. One hundred series are generated. For each data set, the optimal bandwidth is estimated by the PCV and PI method. Table 1 shows the comparison of means and standard deviations for these two methods. Journal of Environmental Statistics 9 ISE (PCV) ISE (Plug−in) 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure 3: ISE(m(., h)) = 1 0 m(x, h) − m(x) 2 dx. 0 5 10 15 20 25 30 35 40 45 50 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Autocorrelation function Figure 4: The autocorrelation function of the data set August 2004 – April 2005 The Integrated Square Error (ISE) is calculated for each estimate m(., h): ISE(m(., h)) = 1 0 m(x, h) − m(x) 2 dx for both PCV and PI methods and the results are displayed by means of the boxplots in Figure 3. 4. Results and discussion In this section we apply the methods described above to ozone data. We analyze data which were measured in the period August to April in years 2004–2005, 2005–2006, 2006–2007. The sample size is n = 273 days. The observations are correlated as it can be seen in Figure 4. We transform data to the interval [0,1] and use the PCV method and the PI method to get the optimal bandwidth. Then we re-transform the bandwidth to the original sample and obtain the final kernel estimate. Kernel estimates based on the PCV and PI methods are presented in Figure 6, Figure 7, or in Figure 8, respectively. 10 Kernel Regression Model for Total Ozone Data 0 50 100 150 200 250 100 150 200 250 300 350 400 450 8 9 10 11 12 1 2 3 4 Time DobsonUnits August 2004 − April 2005 PI method RLWR Figure 5: RLWR estimate with span = 40 (dashed line) and PI estimate with the bandwidth = 17.8 (solid line). 0 50 100 150 200 250 100 150 200 250 300 350 400 450 8 9 10 11 12 1 2 3 4 Time DobsonUnits August 2004 − April 2005 PI method PCV method Figure 6: PCV estimate with the bandwidth = 20.9 (dashed line) and PI estimate with the bandwidth = 17.8 (solid line). In paper Kalvov´a and Dubrovsk´y (1995) the robust locally wighted regression (RLWR) is employed for data processing of TOC. They recommended to optimize h subjectively. This approach needs an experience and a special knowledge of the given data sets. The advantage of our methods consists in more complex approach. These methods are general and they allow to choose the value of h automatically. We used their methodology for data April 2004 - August 2005 and the comparison of the estimate obtained by the PI method and by the robust locally weighted regression can be seen in Figure 5. The PI method yields a rather oversmoothed estimate. Our experience shows that both methods could be considered as a suitable tool for the choice of the bandwidth. But it seems that the PI method is sufficiently reliable and less time consuming than the PCV method. Presented methods can be applied to other time series not only in environmetrics but also in economics or other fields. Journal of Environmental Statistics 11 0 50 100 150 200 250 100 150 200 250 300 350 400 450 8 9 10 11 12 1 2 3 4 Time DobsonUnits August 2005 − April 2006 PI method PCV method Figure 7: PCV estimate with the bandwidth = 20.4 (dashed line) and PI estimate with the bandwidth = 21.9 (solid line). 0 50 100 150 200 250 100 150 200 250 300 350 400 450 8 9 10 11 12 1 2 3 4 Time DobsonUnits August 2006 − April 2007 PI method PCV method Figure 8: PCV estimate with the bandwidth = 17.2 (dashed line) and PI estimate with the bandwidth = 22.3 (solid line). 12 Kernel Regression Model for Total Ozone Data Acknowledgments The research was supported by The Jaroslav H´ajek Center for Theoretical and Applied Statistics (MˇSMT LC 06024). The work was supported by the Student Project Grant at Masaryk university, rector’s programme no. MUNI/A/1001/2009. References Altman N (1990). “Kernel Smoothing of Data With Correlated Errors.” Journal of the American Statistical Association, 85, 749–759. Chu CK, Marron JS (1991). “Choosing a Kernel Regression Estimator.” Statistical Science, 6(4), 404–419. ISSN 08834237. Cleveland WS (1979). “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association, 74(368), 829–836. ISSN 01621459. De Brabanter K, De Brabanter J, Suykens J, De Moor B (2010). “Kernel Regression with Correlated Errors.” Computer Applications in Biotechnology, pp. 13–18. Gasser T, M¨uller HG (1979). “Kernel estimation of regression functions.” In T Gasser, M Rosenblatt (eds.), Smoothing Techniques for Curve Estimation, volume 757 of Lecture Notes in Mathematics, pp. 23–68. Springer Berlin / Heidelberg. H¨ardle W (1990). Applied Nonparametric Regression. 1st edition. Cambridge University Press, Cambridge. H¨ardle W, Vieu P (1992). “Kernel Regression Smoothing of Time Series.” Journal of Time Series Analysis, 13(3), 209–232. Hart JD (1991). “Kernel Regression Estimation with Time Series Errors.” Journal of the Royal Statistical Society, 53, 173–187. Herrmann E, Gasser T, Kneip A (1992). “Choice of Bandwidth for Kernel Regression when Residuals are Correlated.” Biometrika, 79, 783–795. Kalvov´a J, Dubrovsk´y M (1995). “Assessment of the Limits Between Which Daily Average Values of Total Ozone Can Normally Vary.” Meteorol. Bulletin, 48, 9–17. Kol´aˇcek J (2008). “Plug-in Method for Nonparametric Regression.” Computational Statistics, 23(1), 63–78. ISSN 0943-4062. L´aska K, Proˇsek P, Bud´ık L, Bud´ıkov´a M, Milinevsky G (2009). “Prediction of Erythemally Effective UVB Radiation by Means of Nonlinear Regression Model.” Environmetrics, 20(6), 633–646. Lejeune M (1985). “Estimation Non-param´etrique par Noyaux: R´egression Polynomiale Mobile.” Revue de Statistique Appliqu´ee, 33(3), 43–67. M¨uller HG (1987). “Weighted Local Regression and Kernel Methods for Nonparametric Curve Fitting.”Journal of the American Statistical Association, 82(397), 231–238. ISSN 01621459. Journal of Environmental Statistics 13 Nadaraya EA (1964). “On Estimating Regression.” Theory of Probability and its Applications, 9(1), 141–142. Opsomer J, Wang Y, Yang Y (2001). “Nonparametric Regression with Correlated Errors.” Statistical Science, 16(2), 134–153. Priestley MB, Chao MT (1972). “Non-Parametric Function Fitting.” Journal of the Royal Statistical Society. Series B (Methodological), 34(3), 385–392. ISSN 00359246. Stone CJ (1977). “Consistent Nonparametric Regression.” The Annals of Statistics, 5(4), 595–620. ISSN 00905364. UAC (2012). “World Ozone and Ultraviolet Radiation Data Centre (WOUDC) [data].” URL http://www.woudc.org. Wand M, Jones M (1995). Kernel smoothing. Chapman and Hall, London. Watson GS (1964). “Smooth Regression Analysis.” Sankhya - The Indian Journal of Statistics, Series A, 26(4), 359–372. ISSN 0581572X. Affiliation: Ivana Horov´a Masaryk University Department of Mathematics and Statistics Brno, Czech Republic E-mail: horova@math.muni.cz URL: https://www.math.muni.cz/~horova/ Journal of Environmental Statistics http://www.jenvstat.org Volume 4, Issue 2 Submitted: 2012-03-31 February 2013 Accepted: 2012-10-09 Journal of Statistics: Advances in Theory and Applications Volume 7, Number 1, 2012, Pages 1-23 2010 Mathematics Subject Classification: 62P05, 90B50. Keywords and phrases: credit scoring, quality indexes, Gini index, lift, lift ratio, integrated relative lift. Received February 14, 2012  2012 Scientific Advances Publishers LIFT-BASED QUALITY INDEXES FOR CREDIT SCORING MODELS AS AN ALTERNATIVE TO GINI AND KS MARTIN ŘEZÁČ and JAN KOLÁČEK Department of Mathematics and Statistics Masaryk University Kotláįská 2, 61137 Brno Czech Republic e-mail: mrezac@math.muni.cz Abstract Assessment of risk associated with the granting of credits is very successfully supported by techniques of credit scoring. To measure the quality, in the sense of the predictive power, of the scoring models, it is possible to use quantitative indexes such as the Gini index (Gini), the K-S statistic (KS), the c-statistic, and lift. They are used for comparing several developed models at the moment of development as well as for monitoring the quality of the model after deployment into real business. The paper deals with the aforementioned quality indexes, their properties and relationships. The main contribution of the paper is the proposal and discussion of indexes and curves based on lift. The curve of ideal lift is defined; lift ratio (LR) is defined as analogous to Gini index. Integrated relative lift (IRL) is defined and discussed. Finally, the presented case study shows a case when LR and IRL are much more appropriate to use than Gini and KS. MARTIN ĮEZÁČ AND JAN KOLÁČEK2 1. Introduction Banks and other financial institutions receive thousands of credit applications every day (in the case of consumer credits, it can be tens or hundreds of thousands every day). Since it is impossible to process them manually, automatic systems are widely used by these institutions for evaluating the credit reliability of individuals, who ask for credit. The assessment of the risk associated with the granting of credits has been underpinned by one of the most successful applications of statistics and operations research: credit scoring. Credit scoring is the set of predictive models and their underlying techniques that aid financial institutions in the granting of credits. These techniques decide who will get credit, how much credit they should get, and what further strategies will enhance the profitability of the borrowers to the lenders. Credit scoring techniques assess the risk in lending to a particular client. They do not identify “good” or “bad” (negative behaviour is expected, e.g., default) applications on an individual basis, but forecast the probability that an applicant with any given score will be “good” or “bad”. These probabilities or scores, along with other business considerations such as expected approval rates, profit, churn, and losses, are then used as a basis for decision making. Several methods connected to credit scoring have been introduced during last six decades. The most well-known and widely used are logistic regression, classification trees, the linear programming approach, and neural networks. The methodology of credit scoring models and some measures of their quality have been discussed in surveys including Hand and Henley [7], Thomas [14] or Crook et al. [4]. Even if ten years ago the list of books devoted to the issue of credit scoring was not extensive, the situation has improved in the last decade. In particular, this list now includes Anderson [1], Crook et al. [4], Siddiqi [11], Thomas et al. [15], and Thomas [16]. LIFT-BASED QUALITY INDEXES FOR CREDIT … 3 The aim of this paper is to give an overview of widely used techniques used to assess the quality of credit scoring models, to discuss the properties of these techniques, and to extend some known results. We review widely used quality indexes, their properties and relationships. The main part of the paper is devoted to lift. The curve of ideal lift is defined; lift ratio is defined as analogous to Gini index. Integrated relative lift is defined and discussed. 2. Measuring the Quality We can consider two basic types of quality indexes: first, indexes based on a cumulative distribution function like the KolmogorovSmirnov statistic, Gini index or lift; second, indexes based on a likelihood density function like the mean difference (Mahalanobis distance) or informational statistic. For further available measures and appropriate remarks, see Wilkie [17], Giudici [6] or Siddiqi [11]. Assume that the realization Rs ∈ of a random variable S (score) is available for each client and put the following markings:    = otherwise.,0 good,isclient,1 D (1) Distribution functions, respectively, their empirical forms, of the scores of good (bad) clients are given by ( ) ( ),1 1 1 . =∧≤= ∑= DasI n aF i N i GOODn ( ) ( ) [ ],,,0 1 1 . HLaDasI m aF i N i BADm ∈=∧≤= ∑= (2) where is is the score of i-th client, n is the number of good clients, m is the number of bad clients, and I is the indicator function, where ( ) 1true =I and ( ) .0false =I L is the minimum value of a given score, H is the maximum value. The empirical distribution function of the scores of all clients is given by MARTIN ĮEZÁČ AND JAN KOLÁČEK4 ( ) ( ) [ ],,, 1 1 . HLaasI N aF i N i ALLN ∈≤= ∑= (3) where mnN += is the number of all clients. We denote the proportion of bad (good) clients by ., mn n p mn m p GB + = + = (4) An often-used characteristic in describing the quality of the model (scoring function) is the Kolmogorov-Smirnov statistic (K-S or KS). It is defined as [ ] ( ) ( ) .max .. , aFaFKS GOODnBADm HLa −= ∈ (5) It takes values from 0 to 1. Value 0 corresponds to a random model, value 1 corresponds to the ideal model. The higher the KS, the better the scoring model. The Lorenz curve (LC), sometimes called the ROC curve (receiver operating characteristic curve), can also be successfully used to show the discriminatory power of a scoring function, i.e., the ability to identify good and bad clients. The curve is given parametrically by ( ),. aFx BADm= ( ) [ ].,,. HLaaFy GOODn ∈= (6) Each point of the curve represents some value of a given score. If we consider this value as a cut-off value, we can read the proportion of rejected bad and good clients. An example of a Lorenz curve is given in Figure 1. We can see that by rejecting 20% of good clients, we also reject 50% of bad clients at the same time. LIFT-BASED QUALITY INDEXES FOR CREDIT … 5 Figure 1. Lorenz curve (ROC). The LC for a random scoring model is represented by the diagonal line from [ ]0,0 to [ ].1,1 It is the polyline from [ ]0,0 through [ ]0,1 to [ ]1,1 in the case of an ideal model. It is obvious that the closer the curve is to the bottom right corner, the better is the model. The definition and name (LC) is consistent with Müller and Rönz [8]. One can find the same definition of the curve, but called ROC, in Thomas et al. [15]. Siddiqi [11] used the name ROC for a curve with reversed axes and LC for a curve with the CDF of bad clients on the vertical axis and the CDF of all clients on the horizontal axis. This curve is also called the CAP (cumulative accuracy profile) or lift curve, see Sobehart et al. [12] or Thomas [16]. Furthermore, it is called a gains chart in the field of marketing; see Berry and Linoff [2]. An example of CAP is displayed in Figure 2. The ideal model is now represented by a polyline from [ ]0,0 MARTIN ĮEZÁČ AND JAN KOLÁČEK6 through [ ]1,Bp to [ ].1,1 The advantage of this figure is that, one can easily read the proportion of rejected bads against the proportion of all rejected. For example, in the case of Figure 2, we can see that if we want to reject 70% of bads, we have to reject about 40% of all applicants. Figure 2. CAP. In connection to LC, we consider the next quality measure, the Gini index. This index describes a global quality of the scoring model. It takes values from 0 to 1 (it can take negative values for contrariwise models). The ideal model, i.e., the scoring function that perfectly separates good and bad clients, has a Gini index equal to 1. On the other hand, a model that assigns a random score to the client, has a Gini index equal to 0. It can be shown that the Gini index is greater than or equal to KS for any scoring model. Using Figure 3, it can be defined as follows: .2A BA A Gini = + = (7) LIFT-BASED QUALITY INDEXES FOR CREDIT … 7 Figure 3. Lorenz curve, Gini index. This means that, we compute the ratio of the area between the curve and the diagonal (which represents a random model) to the area between the ideal model’s curve and the diagonal. Since the axes describe a unit square, the area BA + is always equal to 0.5. Therefore, we can compute the Gini as two times the area A. Using previous markings, the computational formula of the Gini index is given by [( )1.. 2 1 − = −−= ∑ kk BADmBADm N k FFGini ( )],1.. −+× kk GOODnGOODn FF (8) where ( )kk GOODnBADm FF .. is the k-th vector value of the empirical distribution function of bad (good) clients. For further details, see Anderson [1] or Xu [18]. The Gini index is a special case of Somers’ D (Somers [13]), which is an ordinal association measure. According to Thomas [16], one can calculate the Somers’ D as MARTIN ĮEZÁČ AND JAN KOLÁČEK8 , mn bgbg D j ij i i j ij i i S ⋅ − = ∑∑∑∑ >< (9) where ( )ji bg is the number of goods (bads) in the i-th interval of scores. Furthermore, it holds that SD can be expressed by the Mann-Whitney U-statistic; see Nelsen [9] for further details. When we use CAP instead of LC, we can define the accuracy rate (AR); see Thomas [16] or Sobehart et al. [12], where it is called the accuracy ratio. Again, it is defined by the ratio of some areas. We have diagonalandCAPslmodeidealbetweenArea diagonalandcurveCAPbetweenArea ′ =AR ( ) . 10.5 diagonalandcurveCAPbetweenArea Bp− = (10) Although the ROC and CAP are not equivalent, it is true that Gini and AR are equal for any scoring model. Proof for discrete scores is given in Engelmann et al. [5]; for continuous scores, one can find it in Thomas [16]. In connection to the Gini index, the c-statistic (Siddiqi [11]) is defined as . 2 1 _ Gini statc + = (11) It represents the likelihood that a randomly selected good client has a higher score than a randomly selected bad client, i.e., ( ).01_ 2121 =∧=≥= DDssPstatc (12) It takes values from 0.5, for the random model, to 1, for the ideal model. An alternative name for the c-statistic can be found in the literature. It is known also as Harrell’s c, which is a reparameterization of Somers’ D (Newson [10]). Furthermore, it is called AUROC, e.g., in Thomas [16] or AUC, e.g., in Engelmann et al. [5]. LIFT-BASED QUALITY INDEXES FOR CREDIT … 9 3. Lift Another possible indicator of the quality of scoring model is lift, which determines the number of times that, at a given level of rejection, the scoring model is better than random selection (the random model). More precisely, the ratio is the proportion of bad clients with a score less than a (where [ ]HLa ,∈ ) to the proportion of bad clients in the general population. Formally, it can be expressed by ( ) ( ) ( ) ( ) ( ) ( )10 0 0 1 1 1 1 =∨= = ≤ =∧≤ == ∑ ∑ ∑ ∑ = = = = DDI DI asI DasI BadRate aCumBadRate aLift N i N i i N i i N i ( ) ( ) . 0 1 1 N m asI DasI i N i i N i ≤ =∧≤ = ∑ ∑ = = (13) It can be easily verified that the lift can be equivalently expressed as ( ) ( ) ( ) [ ].,, . . HLa aF aF aLift ALLN BADn ∈= (14) Now, we would like to discuss the form of the lift function for the case of the ideal model. This is the model for which sets of output scores of bad and good clients are disjoint. So there exists a cut-off point, for which MARTIN ĮEZÁČ AND JAN KOLÁČEK10 ( ) ( ) ( ) ( )   >=∧≤+= ≤=∧≤ =≤ .,10 ,,0 caDaSPDP caDaSP aSP (15) Thus, we can derive the form of the lift function ( ) ( )      > ≤ = ., 1 ,, 1 . ca aF ca p aLift ALLN B ideal (16) In practice, lift is computed corresponding to %100,%,20%,10 … of clients with the worst score (see Coppock [3]). Usually, it is computed by using a table with the numbers of both all and bad clients in given score bands (deciles). An example of such a table is given by Table 1. Table 1. Lift (absolute and cumulative form) computational scheme Absolutely Cumulatively Decile #Clients # Bad clients Bad rate Abs. Lift #Bad clients Bad rate Cum. Lift 1 100 35 35.0% 3.50 35 35.0% 3.50 2 100 16 16.0% 1.60 51 25.5% 2.55 3 100 8 8.0% 0.80 59 19.7% 1.97 4 100 8 8.0% 0.80 67 16.8% 1.68 5 100 7 7.0% 0.70 74 14.8% 1.48 6 100 6 6.0% 0.60 80 13.3% 1.33 7 100 6 6.0% 0.60 86 12.3% 1.23 8 100 5 5.0% 0.50 91 11.4% 1.14 9 100 5 5.0% 0.50 96 10.7% 1.07 10 100 4 4.0% 0.40 100 10.0% 1.00 All 1000 100 10.0% It is possible to compute the lift value in each decile (absolute lift in the fifth column in Table 1), but usually, and in accordance with the definition of Lift(a), the cumulative form is used. It holds that the value of lift has an upper limit of Bp/1 and tends to a value of 1 when the score tends to infinity (or to its upper limit). In our case, we can see that the LIFT-BASED QUALITY INDEXES FOR CREDIT … 11 best possible value of lift is equal to 10. We obtained the value 3.5 in the first decile, which is nothing excellent, but high enough for the model to be considered applicable in practice. Results are further illustrated in Figure 4. Figure 4. Lift value (absolute and cumulative). In the context of this approach, we define ( ) ( ( )) ( ( ))qFF qFF qLift ALLNALLN ALLNBADm 1 .. 1 .. − − =Q ( ( )) ( ],1,0, 1 1 .. ∈= − qqFF q ALLNBADm (17) where q represents the score level of %100q of the worst scores and ( )qF ALLN 1 . − can be computed as ( ) { [ ] ( ) }.,,min . 1 . qaFHLaqF ALLNALLN ≥∈=− (18) It can be easily shown that the lift function for the ideal model is now MARTIN ĮEZÁČ AND JAN KOLÁČEK12 ( ) ( ] ( ]     ∈ ∈ = .1, 1 ,,0, 1 ,B B B ideal pq q pq p qLiftQ (19) Figure 5, below, gives an example of the lift function for ideal, random, and actual models. Figure 5. QLift function, lift ratio. Using the previous Figure 5, we define lift ratio as analogous to Gini index ( ) ( ) . 1 1 1 0 1 0 − − = + = ∫ ∫ dqqLift dqqLift BA A LR idealQ Q (20) LIFT-BASED QUALITY INDEXES FOR CREDIT … 13 It is obvious that, it is a global measure of a model's quality and that it takes values from 0 to 1. Value 0 corresponds to the random model, value 1 matches the ideal model. The meaning of this index is quite simple: the higher, the better. An important feature is that lift ratio allows us to fairly compare two models developed on different data samples, which is not possible with lift. Since lift ratio compares areas under the lift function corresponding to actual and ideal models, the next concept is focused on the comparison of lift functions themselves. We define the relative lift function by ( ) ( ) ( ) ( ].1,0, ∈= q qLift qLift qRLift idealQ Q (21) An example of this function is presented in Figure 6. The definition domain of the function is [ ];1,0 the range is a subinterval of [ ].1,0 The graph starts at point [ ( )],, minmin qLiftpq B Q⋅ where minq is a positive number near to zero. Then, it falls to a local minimum in point [ ( )]BBB pLiftpp Q⋅, and then rises up to point [ ].1,1 It is obvious that the graph of relative lift function for a better model is closer to the top line, which represents the function for the ideal model. MARTIN ĮEZÁČ AND JAN KOLÁČEK14 Figure 6. Relative lift function. Now, it is natural to ask what we obtain when we integrate the relative lift function. We define the integrated relative lift (IRL) by ( ) . 1 0 dqqRLiftIRL ∫= (22) It takes values from , 2 5.0 2 Bp + for the random model, to 1, for the ideal model. Again the following holds: the higher, the better. This global measure of scoring a model’s quality has an interesting connection to the c-statistic. We made a simulation with scores generated from a normal distribution. The scores of bad clients had a mean equal to 0 and a variance equal to 1. The scores of good clients had a mean and variance LIFT-BASED QUALITY INDEXES FOR CREDIT … 15 from 0.1 to 10 with a step equal 0.1. The number of samples and sample size were Bp,1000 was equal to 0.1. IRL and the c-statistic were computed for each sample and each value of the mean and variance of a good clients’ scores. Finally, means of IRL and the c-statistic were computed. The results are presented in Figure 7. Part (b) represents the contour plot of the figure in part (a). The simulation shows that IRL and the c-statistic are approximately equal when the variances of good and bad clients are equal. Furthermore, it shows that they significantly differ when the variances are different and the ratio of the mean and variance of good clients is near to 1. 4. Case Study To illustrate the advantage of the proposed indexes, we introduce a simple case study. We consider two scoring models with a score distribution given in Table 2. Furthermore, we consider the standard meaning of scores, i.e., a higher score band means better clients (clients with the lowest scores, i.e., clients in score band 1, have the highest probability of default). MARTIN ĮEZÁČ AND JAN KOLÁČEK16 (a) (b) Figure 7. Difference of IRL and c-stat (a) and its contour plot (b). LIFT-BASED QUALITY INDEXES FOR CREDIT … 17 Table 2. Score distribution and QLift of given scoring models Scoring model 1 Scoring model 2 Score band #Clients q # Bad clients Cumul. bad rate QLift #Bad clients Cumul. bad rate QLift 1 100 0.1 20 20.0% 2.00 35 35.0% 3.50 2 100 0.2 18 19.0% 1.90 16 25.5% 2.55 3 100 0.3 17 18.3% 1.83 8 19.7% 1.97 4 100 0.4 15 17.5% 1.75 8 16.8% 1.68 5 100 0.5 12 16.4% 1.64 7 14.8% 1.48 6 100 0.6 6 14.7% 1.47 6 13.3% 1.33 7 100 0.7 4 13.1% 1.31 6 12.3% 1.23 8 100 0.8 3 11.9% 1.19 5 11.4% 1.14 9 100 0.9 3 10.9% 1.09 5 10.7% 1.07 10 100 1.0 2 10.0% 1.00 4 10.0% 1.00 All 1000 100 100 The Gini index for each model is equal to 0.420. KS is equal to 0.356 for model 1 and to 0.344 for model 2. According to these numbers, one can say that both models are almost the same, maybe the first one is slightly better. However, if we look at the models in more detail, we find that they differ significantly. We get the first insight from their Lorenz curves in Figure 8. MARTIN ĮEZÁČ AND JAN KOLÁČEK18 Figure 8. Lorenz curves for model 1 and model 2. We can see that model 1 is stronger for higher score bands. This means that this model better separates the good from the best clients. On the other hand, model 2 is stronger for lower score bands, which means that it better separates the bad from the worst clients. We can read the same result from the figures of QLift and RLift in Figure 9. LIFT-BASED QUALITY INDEXES FOR CREDIT … 19 Figure 9. QLift and RLift for model 1 and model 2. MARTIN ĮEZÁČ AND JAN KOLÁČEK20 It is necessary to mention one computational problem at this point. In the discrete case, as in the case of Table 2, we do not know the value of QLift for q less than 0.1. Since QLift is not defined for ,0=q we need to extrapolate it somehow. According to the shape of the QLift curve, we propose using quadratic extrapolation, which yields ( ) ( ) ( ) ( ).3.02.031.030 LiftLiftLiftLift QQQQ +⋅−⋅= (23) When we have a full data set, we can use formula (17). In this case, the extrapolation is not needed. Of course, we still do not have the value QLift (0). However, if we start the computation of QLift in some positive value of q, which is sufficiently near to zero, the final result is precise enough. Overall, we can compare our two scoring models. Table 3, below, contains values of Gini indexes, K-S statistics, values of QLift(0.1), LR indexes, and IRL indexes. QLift(0.1) is a local measure of a model’s quality; model 2 was designed to be better in the first score bands, hence it is natural that the value of QLift(0.1) is significantly higher for model 2, concretely 3.5 versus 2.0. On the other hand, all remaining indexes are global measures of a model’s quality. Models were designed to have the same Gini index and similar KS. However, we can see that LR and IRL significantly differ for our models, 0.242 versus 0.372 and 0.699 versus 0.713, respectively. Table 3. Quality indexes of two assessed scoring models Scoring model 1 Scoring model 2 Gini 0.420 0.420 KS 0.356 0.344 QLift(0.1) 2.000 3.500 LR 0.242 0.372 IRL 0.699 0.713 LIFT-BASED QUALITY INDEXES FOR CREDIT … 21 Finally, if the expected reject rate is up to 40%, which is a very natural assumption, using LR and IRL, we can state that model 2 is better than model 1 although their Gini indexes are equal and even their KS are in reverse order. 5. Conclusion In Section 2, we presented widely used indexes for the assessment of credit scoring models. We focused mainly on the definitions of Lorenz curve, CAP, Gini index, AR, and lift. The Lorenz curve is sometimes confused with ROC. The discussion of their definitions is given within the paper. We suggest using the definition of the Lorenz curve given in Müller and Rönz [8], the definition of ROC given in Siddiqi [11], and the definition of CAP given in Sobehart et al. [12]. The main part of the paper, Section 3, was devoted to lift. Formulas for lift in basic and quantile form were presented as well as their forms for ideal models. These formulas allow the calculation of the value of lift for any given score and any given quantile level and comparison with the best obtainable results. Lift ratio was presented as analogous to Gini index. An important feature is that LR allows the fair comparison of two models developed on different data samples, which is not possible with lift or QLift. Furthermore, a relative lift function was proposed, which shows the ratio of the QLifts of the actual and ideal models. Finally, integrated relative lift was defined. The connection to the c-statistic was presented by means of a simulation by using normally distributed scores. This simulation showed that IRL and the c-statistic are approximately equal in the case when the variances of good and bad clients are equal. Despite the high popularity of the Gini index and KS, we conclude that the proposed lift based indexes are more appropriate for assessing the quality of credit scoring models. In particular, it is better to use them in the case of an asymmetric Lorenz curve. In such cases, using the Gini index or KS during the development process could lead to the selection of a weaker model. MARTIN ĮEZÁČ AND JAN KOLÁČEK22 Acknowledgement This research was supported by our department and by The Jaroslav Hájek Center for Theoretical and Applied Statistics (grant No. LC 06024). References [1] R. Anderson, The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation, Oxford University Press, Oxford, 2007. [2] M. J. A. Berry and G. S. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 2nd Edition, Wiley, Indianapolis, 2004. [3] D. S. Coppock, Why Lift? DM Review Online, (2002). [Accessed on 1 December 2009]. www.dmreview.com/news/5329-1.html [4] J. N. Crook, D. B. Edelman and L. C. Thomas, Recent developments in consumer credit risk assessment, European Journal of Operational Research 183(3) (2007), 1447-1465. [5] B. Engelmann, E. Hayden and D. Tasche, Measuring the Discriminatory Power of Rating System, (2003). [Accessed on 4 October 2010]. http://www.bundesbank.de/download/bankenaufsicht/dkp/200301dkp_b.pdf [6] P. Giudici, Applied Data Mining: Statistical Methods for Business and Industry, Wiley, Chichester, 2003. [7] D. J. Hand and W. E. Henley, Statistical classification methods in consumer credit scoring: A review, Journal of the Royal Statistical Society, Series A 160(3) (1997), 523-541. [8] M. Müller and B. Rönz, Credit Scoring using Semiparametric Methods, In: J. Franke, W. Härdle and G. Stahl (Eds.), Measuring Risk in Complex Stochastic Systems, Springer-Verlag, New York, 2000. [9] R. B. Nelsen, Concordance and Gini’s measure of association, Journal of Nonparametric Statistics 9(3) (1998), 227-238. [10] R. Newson, Confidence intervals for rank statistics: Somers’ D and extensions, The Stata Journal 6(3) (2006), 309-334. [11] N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, Wiley, New Jersey, 2006. [12] J. Sobehart, S. Keenan and R. Stein, Benchmarking Quantitative Default Risk Models: A Validation Methodology, Moody’s Investors Service, (2000). [Accessed on 4 October 2010]. http://www.algorithmics.com/EN/media/pdfs/Algo-RA0301-ARQ-DefaultRiskModels.pdf LIFT-BASED QUALITY INDEXES FOR CREDIT … 23 [13] R. H. Somers, A new asymmetric measure of association for ordinal variables, American Sociological Review 27 (1962), 799-811. [14] L. C. Thomas, A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers, International Journal of Forecasting 16(2) (2000), 149-172. [15] L. C. Thomas, D. B. Edelman and J. N. Crook, Credit Scoring and its Applications, SIAM Monographs on Mathematical Modelling and Computation, Philadelphia, 2002. [16] L. C. Thomas, Consumer Credit Models: Pricing, Profit, and Portfolio, Oxford University Press, Oxford, 2009. [17] A. D. Wilkie, Measures for Comparing Scoring Systems, In: L. C. Thomas, D. B. Edelman and J. N. Crook (Eds.): Readings in Credit Scoring, Oxford University Press, Oxford, (2004), 51-62. [18] K. Xu, How has the literature on Gini’s index evolved in past 80 years? (2003). [Accessed on 1 December 2009]. economics.dal.ca/RePEc/dal/wparch/howgini.pdf g This article was downloaded by: [ Masarykova Univerzita v Brne] , [ Ivana Horova] On: 12 January 2012, At: 08: 02 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Communications in Statistics - Theory and Methods Publication details, including instructions for authors and subscription information: http:/ / www.tandfonline.com/ loi/ lsta20 Visualization and Bandwidth Matrix Choice Ivana Horová a , Jan Koláček a & Kamila Vopatová a a Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic Available online: 10 Jan 2012 To cite this article: Ivana Horová, Jan Koláček & Kamila Vopatová (2012): Visualization and Bandwidth Matrix Choice, Communications in Statistics - Theory and Methods, 41:4, 759-777 To link to this article: http:/ / dx.doi.org/ 10.1080/ 03610926.2010.529539 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http: / / www.tandfonline.com/ page/ terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Communications in Statistics—Theory and Methods, 41: 759–777, 2012 Copyright © Taylor & Francis Group, LLC ISSN: 0361-0926 print/1532-415X online DOI: 10.1080/03610926.2010.529539 Visualization and Bandwidth Matrix Choice IVANA HOROVÁ, JAN KOLÁ ˇCEK, AND KAMILA VOPATOVÁ Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic Kernel smoothers are among the most popular nonparametric functional estimates. These estimates depend on a bandwidth that controls the smoothness of the estimate. While the literature for a bandwidth choice in a univariate density estimate is quite extensive, the progress in the multivariate case is slower. The authors focus on a bandwidth matrix selection for a bivariate kernel density estimate provided that the bandwidth matrix is diagonal. A common task is to find entries of the bandwidth matrix which minimizes the Mean Integrated Square Error (MISE). It is known that in this case there exists explicit solution of an asymptotic approximation of MISE (Wand and Jones, 1995). In the present paper we pay attention to the visualization and optimizers are presented as intersection of bivariate functional surfaces derived from this explicit solution and we develop the method based on this visualization. A simulation study compares the least square cross-validation method and the proposed method. Theoretical results are applied to real data. Keywords Asymptotic mean integrated square error; Bandwidth matrix; Mean integrated square error; Product kernel. Mathematics Subject Classification 62G07; 62H12. 1. Introduction Methods for a bandwidth choice in a univariate density estimate have been developed in many papers and monographs (e.g., Cao et al., 1994; Chaudhuri and Marron, 1999; Härdle et al., 2004; Horová et al., 2002; Horová and Zelinka, 2007; Silverman, 1989; Taylor, 1989; Wand and Jones, 1995). In this paper we focus on a problem of a data-driven choice of a bandwidth matrix in bivariate kernel density estimates. Bivariate kernel density estimation problem is an excellent setting for understanding aspects of multivariate kernel smoothing. This problem, despite being the simplest multivariate density estimation problem, presents many challenges when it comes to selecting the correct amount of smoothing (i.e., choosing of a bandwidth matrix H). Most of popular bandwidth Received July 19, 2010; Accepted September 28, 2010 Address correspondence to Ivana Horová, Department of Mathematics and Statistics, Masaryk University, Kotlarska 2, Brno 61137, Czech Republic; E-mail: horova@math.muni.cz 759 Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 760 Horová et al. selection methods in a univariate case (e.g., Cao et al., 1994; Härdle et al., 2004) can be transferred into multivariate settings. The least squares cross-validation, the biased cross-validation, the smoothed cross-validation, and plug-in methods in multivariate case have been developed and widely discussed (Chacón and Duong, 2009; Duong and Hazelton, 2003, 2005a,b; Sain et al., 1994; Scott, 1992; Wand and Jones, 1994). The problem of the bandwidth matrix selection can be simplified by imposing constraints on H (Wand and Jones, 1995). A common approach to the multivariate smoothing is to first rescale the data so the sample variances are equal in each dimension—this approach is called scaling or sphering the data so the sample covariance matrix is the identity (e.g., Duong, 2007; Wand and Jones, 1993). The aim of the present paper is to propose methods for the bandwidth matrix choice in bivariate case without using any pretransformations of the data. It is well known that a visualization is an important component of a nonparametric data analysis (e.g., Chaudhuri and Marron, 1999; Godtliebsen et al., 2002). We use this effective strategy to clarify the process of the bandwidth matrix choice by using bivariate functional surfaces. The proposed method uses an optimally balanced relation between bias squared and variance and a suitable estimate of the asymptotic approximation of Mean Integrated Square Error (MISE). The paper is organized as follows: In Section 2 we describe the basic properties of the multivariate density estimates. Section 3 is devoted to the mean integrated square error and its minimization. In Section 4 we deal with asymptotic MISE (AMISE) and its minimization. In Section 5 we describe the idea of our method and the theoretical results are explain by means of bivariate functional surfaces. In Section 6 we conduct a simulation study comparing the least squares crossvalidation (LSCV) method and the proposed method. In Section 7 the theoretical results are applied to real data. 2. Kernel Density Estimation Consider a d-variate random sample X1 Xn coming from an unknown density f. We denote Xi1 Xid the components of Xi and a generic vector x ∈ d has the representation x = x1 xd T . For a d-variate random sample X1 Xn drawn from the density f the kernel density estimator is defined ˆf x H = 1 n n i=1 KH x − Xi (1) where H is a symmetric positive definite d × d matrix called the bandwidth matrix, and KH x = H −1/2 K H−1/2 x , where H stands for the determinant of H, and K is a d-variate kernel function. The kernel function K is often taken to be a d-variate probability density function. There are two types of multivariate kernels created from a symmetric univariate kernel k—a product kernel KP and a spherically symmetric kernel KS : KP x = d i=1 k xi KS x = ckk √ xT x Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 761 where c−1 k = k √ xT x dx. The choice of a kernel does not influence the estimate as significantly as the bandwidth matrix. The choice of the smoothing matrix H is of a crucial importance. This matrix controls the amount and the direction of the multivariate smoothing. Let ℋℱ denote the class of symmetric, positive definite d × d matrices. The matrix H ∈ ℋℱ has 1 2 d d + 1 independent entries which have to be chosen. A simplification can be obtained by imposing the restriction H ∈ ℋ , where ℋ ⊂ ℋℱ is the subclass of diagonal positive definite matrices: H = diag h2 1 h2 d . A further simplification follows from the restriction H ∈ ℋ where ℋ = h2 Id h > 0 , Id is d × d identity matrix and leads to the single bandwidth estimator (Wand and Jones, 1995). Using the single bandwidth matrix parametrization class ℋ is not advised for data which have different dispersions in the coordinate directions (Wand and Jones, 1993). On the other hand, the bandwidth selectors in the general ℋℱ class are able to handle differently dispersed data but are computationally intensive. So the ℋ diagonal matrix class is a compromise between computational speed with sufficient flexibility. For this reason we turn our attention to the bivariate kernel density estimate provided that the bandwidth matrix is diagonal (i.e., H = diag h2 1 h2 2 ). First, let us make some notation: • will be shorthand for and dx will be shorthand for dx1dx2, V K = K2 x dx, and • f stands for the gradient and 2 f for the Hessian matrix. f =   f x x1 f x x2   2 f =    2f x x2 1 2f x x1 x2 2f x x1 x2 2f x x2 2    For the next steps we need a few assumptions about the kernel function K, the bandwidth matrix H, and the density f: (A1) K is a product bivariate kernel function satisfying K x dx = 1 xK x dx = 0 xxT K x dx = 2 K I2 (A2) H = Hn is a sequence of diagonal bandwidth matrices such that n−1 h1h2 −1 and h2 1 and h2 2 approach zero as n → . (A3) Each entry of the Hessian matrix 2 f is piecewise continuous and square integrable. 3. MISE and Its Minimization The quality of the estimate (1) can be expressed in terms of MISE (Wand and Jones, 1995) MISE H = E ˆf x H − f x 2 dx = var ˆf x H dx + bias2 ˆf x H dx Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 762 Horová et al. that is, MISE H = 1 nh1h2 V K + o nh1h2 −1 + 1 4 2 2 K h4 1 4 0 + 2h2 1h2 2 2 2 + h4 2 0 4 + o h2 1 + h2 2 2 where k ℓ = 2 f x2 1 k/2 2 f x2 2 ℓ/2 dx k ℓ = 0 2 4 k + ℓ = 4 Let HMISE be a minimizer of MISE with respect to H, that is, HMISE = arg min H∈ℋ MISE The well known method of estimating HMISE is the LSCV method (Duong and Hazelton, 2005b; Wand and Jones, 1995). The LSCV objective function is LSCV H = ˆf x H 2 dx − 2 n n i=1 ˆf−i Xi H ˆf−i Xi H = 1 n − 1 n j=1 j=i KH Xi − Xj This function can be written in terms of convolutions f ∗ g x = f u g x − u du (Duong and Hazelton, 2005b): LSCV H = n−2 n i j=1 KH ∗ KH − 2KH Xi − Xj + 2n−1 KH 0 Moreover, HLSCV = arg minH∈ℋ LSCV is an unbiased estimate of H in the sense E LSCV H = MISE ˆf · H − f2 x dx 4. AMISE and Its Minimization Since MISE is not mathematically tractable, we employ an AMISE, which can be written as a sum of an asymptotic integrated variance and an asymptotic integrated square bias: AMISE H = V K nh1h2 AIVar + 1 4 2 K 2 h4 1 4 0 + 2h2 1h2 2 2 2 + h4 2 0 4 AIBias2 (2) and HAMISE stands for minimum of AMISE HAMISE = arg min H∈ℋ AMISE Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 763 First, we summarize properties of AMISE and HAMISE. As a multivariate analogue of the functional, which minimization yields optimal kernels, we consider the functional W K = V K 2/3 2 K 2/3 Moreover, we define as a canonical factor 3 = V K 2 K 2 Making some calculations we arrive at the following lemma. Lemma 4.1. AMISE(H) can be expressed in the form AMISE H = W K nh1h2 + 1 4 2 h4 1 4 0 + 2h2 1h2 2 2 2 + h4 2 0 4 (3) It can be shown (Wand and Jones, 1995) that the entries of HAMISE are equal to h2 1 AMISE =   3/4 0 4 V K n 2 K 2 3/4 4 0 2 2 + 1/2 0 4 1/2 4 0   1/3 (4) h2 2 AMISE =   3/4 4 0 V K n 2 K 2 3/4 0 4 2 2 + 1/2 0 4 1/2 4 0   1/3 Thus h2 i AMISE = O n−1/3 i = 1 2. Inserting these quantities into the formula (2), we arrive at the following lemma. Lemma 4.2. Let HAMISE ∈ ℋ be a minimizator of AMISE with entries given by formula (4). Then varˆf x HAMISE dx AIVar = 2 biasˆf x HAMISE 2 dx AIBias2 (5) This relation is of great importance because it serves as a basis for a method we are going to present. It means that minimization of AMISE is equivalent to seeking for HAMISE such that (5) is satisfied. Further, the use of formulas (4) in the relation (3) yields AMISE HAMISE = 3 2 n−2/3 W K 2 2 + 1/2 0 4 1/2 4 0 1/3 (6) that is, AMISE HAMISE = O n−2/3 . It is easy to show that h2 AMISE h1 AMISE = 4 0 0 4 1/4 (7) Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 764 Horová et al. and 2 2 + 1/2 0 4 1/2 4 0 1/3 = n1/3h1 AMISE h2 AMISE Then substituting 2 2 + 1/2 0 4 1/2 4 0 1/3 into (6) we obtain AMISE HAMISE = 3W K 2nh1 AMISEh2 AMISE This formula allows to separate kernel effects from bandwidth matrix effects in AMISE and thus offers a possibility to choose the kernel and the bandwidth matrix in some automatic and optimal way. For a univariate case an automatic procedure for simultaneous choice of a bandwidth, a kernel, and an order of the kernel was proposed previously (Horová et al., 2002). Remark. The biased cross-validation methods and smoothed cross-validation method for estimating HAMISE have been widely discussed previously (Duong and Hazelton, 2005b; Sain et al., 1994; Wand and Jones, 1994). 5. Proposed Methods Our method is based on formula (5) and on a suitable estimate of AMISE. In Horová et al. (2008) a suitable estimate of AMISE was used and the extension of the method for a univariate case was presented in Horová and Zelinka (2007). Here, we briefly describe this method and provide theoretical results. Let AMISE H = varˆf x H dx + biasˆf x H 2 dx where varˆf x H dx = 1 n K2 H x − y ˆf y H dy dx = 1 n H −1/2 K2 z ˆf x − H1/2 z H dzdx = 1 n H −1/2 V K ˆf x H dx = 1 n H −1/2 V K and biasˆf x H 2 dx = KH x − y ˆf y H dy − ˆf x H 2 dx = K z ˆf x − H1/2 z H dz − ˆf x H 2 dx Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 765 = 1 n2 n i j=1 KH ∗ KH ∗ KH ∗ KH − 2KH ∗ KH ∗ KH + KH ∗ KH Xi − Xj Here a connection of the estimated squared bias term with the bootstrap method of Taylor (1989) can be seen. Hereinafter, HAMISE = diag ˆh2 1 AMISE ˆh2 2 AMISE is the minimizer of AMISE over the class of diagonal bandwidth matrices ℋ (i.e., HAMISE = arg minH∈ℋ AMISE). Let g h1 h2 stand for the sum of convolutions in the form biasˆf x H 2 dx, that is, g h1 h2 = n i=1 n j=1 KH ∗ KH ∗ KH ∗ KH − 2KH ∗ KH ∗ KH + KH ∗ KH Xi − Xj The idea of our method is based on Lemma 4.2. Thus, we are seeking for ˆh1, ˆh2 such that 1 n 1 ˆh1 ˆh2 V K = 2 1 n2 g ˆh1 ˆh2 that is, nV K = 2ˆh1 ˆh2g ˆh1 ˆh2 (8) It means that minimization of AMISE could be achieved through the solving Eq. (8). But (8) is the nonlinear equation for two variables and thus we need another relation between h1 and h2. This problem will be dealt with in the next section. Now we explain the rationale of the proposed method. Theorem 5.1. Let assumptions (A1), (A2), (A3) be satisfied and let the density f have continuous partial derivatives of the fourth order. Then E KH x − y ˆf y H dy = f x + 2 K tr H 2 f x + 1 4 2 K 2 tr H 2 f H 2 f x + o trH The proof is given in the Appendix. Corollary 5.1. Under assumptions of Theorem 5.1, the relation E biasˆf x H = biasˆf x H + o trH is valid. The last relation confirms that the solution of Eq. (8) may be expected to be reasonably close to HAMISE. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 766 Horová et al. Figure 1. Optimal values of h1 and h2 lie on the curve h1 h2 = 2h1h2g h1 h2 − nV K = 0, which is an intersection of the surface h1 h2 (light gray) and the coordinate plane z = 0 (white). Remark. Jones et al. (1991) was treated of the properties of the estimated square bias for a univariate case. Remark. Wand and Jones (1995) reminded of solve-the-equation (STE) univariate selectors, which require solving nonlinear equation with respect to h. But their idea is different from that which we present. Figure 1 shows the shape of the functional h1 h2 = 2h1h2g h1 h2 − nV K and the point we are seeking lies on curve h1 h2 = 0. Obviously, it has not a unique solution, and thus we need another relationship between h1 and h2 to get the unique solution. We propose two possibilities how to find this relationship. 5.1. M1 Method Using Scott’s rule (Scott, 1992) ˆhi = ˆin−1/6 for i = 1 2 gives the other relationship between h1 and h2. It is easy to see that h2 = ˆch1 ˆc = ˆ2 ˆ1 and ˆ can be estimated by a sample standard deviation, or by some robust method (e.g., a median deviation). Now, the system of two equations for two unknowns h1, h2 has to be solved: M1    2h1h2g h1 h2 = nV K h2 = ˆch1 (9) Figure 2 demonstrates the solution of the system (9) as an intersection of the functional and planes. As it will be shown in a simulation study, the method is rather inappropriate because the entries of covariance matrix are often not able to take into account the curvature of f and its orientation. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 767 Figure 2. M1 method: The point ˆh1 ˆh2 we are looking for is an intersection of the plane h2 − ˆch1 = 0 (dark gray) and the surface h1 h2 (light gray) and the coordinate plane z = 0 (white). 5.2. M2 Method The second method can be considered as a hybrid of the biased cross-validation method (Duong and Hazelton, 2005b; Sain et al., 1994) and the plug-in method (Wand and Jones, 1994). We are concerned with fact (7), that is, h4 2 AMISE · 0 4 = h4 1 AMISE · 4 0 (10) where 0 4 = 2 f x2 2 2 dx 4 0 = 2 f x2 1 2 dx For the sake of simplicity in the next considerations the notation h1 = h1 AMISE, h2 = h2 AMISE is used. Relation (10) means that h1, h2 should be such that this equation is satisfied. At this step the estimates of 0 4 and 4 0 are needed. Since we assume that K is a product kernel we can express the estimates of 0 4 and 4 0 as the following ˆ 0 4 = n−2 n i j=1 2 KH x2 2 ∗ 2 KH x2 2 Xi − Xj ˆ 4 0 = n−2 n i j=1 2 KH x2 1 ∗ 2 KH x2 1 Xi − Xj where instead of a pilot bandwidth matrix G in the plug-in method the bandwidth matrix H is used (i.e., ˆ 0 4, ˆ 4 0 estimate the density curvature in both directions). Now, relation (10) yields h4 2n−2 n i j=1 2 KH x2 2 ∗ 2 KH x2 2 Xi − Xj = h4 1n−2 n i j=1 2 KH x2 1 ∗ 2 KH x2 1 Xi − Xj (11) Hence, we have the second equation for h1, h2. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 768 Horová et al. Figure 3. The searched point ˆh1 ˆh2 is an intersection of the surface h1 h2 (light gray), the coordinate plane z = 0 (white) and the surface h1 h2 (dark gray). The proposed method is described by the system M2    2h1h2g h1 h2 = nV K h4 2 n i j=1 2 KH x2 2 ∗ 2 KH x2 2 Xi − Xj = h4 1 n i j=1 2 KH x2 1 ∗ 2 KH x2 1 Xi − Xj (12) The solution ˆh1 ˆh2 of this nonlinear system is an estimate of h1 AMISE h2 AMISE . This system can be solved by Newton’s method. Table 1 Target densities Normal I 2 0 0 1/4 1 0 Normal II 1 2 2 −3/2 0 1/16 1 0 + 1 2 2 3/2 0 1/16 1 0 Normal III 1 2 2 0 0 1 1 0 + 1 2 2 3 0 1 1/2 0 Normal IV 1 3 2 0 0 1 1 0 + 1 3 2 0 4 1 4 0 + 1 3 2 4 0 4 1 0 Normal V 1 4 2 0 0 1 1 0 + 3 4 2 4 3 4 3 0 Normal VI 1 5 2 0 0 1 1 0 + 1 5 2 1/2 1/2 4/9 4/9 0 +3 5 2 13/12 13/12 25/81 25/81 0 Normal VII 1 3 2 0 −3 1 1/16 0 + 1 3 2 0 0 1 1/16 0 + 1 3 2 0 3 1 1/16 0 Normal VIII 1 3 2 0 −3 1 1/16 0 + 1 3 2 0 0 1/2 1/16 0 + 1 3 2 0 3 1/8 1/16 0 Normal IX 1 3 2 −6/5 0 9/16 9/16 7/10 + 1 3 2 0 0 9/16 9/16 −7/10 +1 3 2 6/5 0 9/16 9/16 7/10 Beta Beta ℬ 2 4 · ℬ 2 6 Beta Weibull ℬ 2 4 · 2 3 Gamma Beta 2 1 · ℬ 2 6 LogNormal ℒ 2 0 0 1 1 0 Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 769 Figure 4. Contour plots of normal target densities. Figure 5. Contour plots of nonnormal target densities. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 770 Horová et al. In Fig. 3 the graphs of the surfaces h1 h2 = 2h1h2g h1 h2 − nV K and h1 h2 = h4 2 n i j=1 2 KH x2 2 ∗ 2 KH x2 2 Xi − Xj − h4 1 n i j=1 2 KH x2 1 ∗ 2 KH x2 1 Xi − Xj are presented. The solution of this system yields the estimates ˆh1 and ˆh2. Remark. It is clear that h2 i AMISE = O n−1/3 . Asymptotic properties and a rate of convergence of HAMISE to HAMISE can be treated of in a similar way as in (Duong and Hazelton, 2005a,b) and (Duong and Hazelton, 2005a) showed that the discrepancy between HAMISE and HMISE is asymptotically negligible. 6. Simulation Study In this section we conduct a simulation study comparing the LSCV method with the M1 and M2 methods. Samples of the size n = 100 were drawn from densities listed in Table 1. Bandwidth matrices were selected for 100 random samples generated from each density. Contour plots of target densities are displayed in Figures 4 and 5. As a criterion for comparison of data driven bandwidth matrix selectors the average of integrated square errors, that is, ISE = avgH ˆf x H − f x 2 dx (13) Table 2 ISE: The average of ISE with a standard error in parentheses Density LSCV M1 M2 Normal I 1 58 · 10−2 0 150 · 10−2 0 91 · 10−2 0 041 · 10−2 0 92 · 10−2 0 042 · 10−2 Normal II 1 82 · 10−2 0 068 · 10−2 3 59 · 10−2 0 045 · 10−2 1 39 · 10−2 0 043 · 10−2 Normal III 0 62 · 10−2 0 040 · 10−2 0 47 · 10−2 0 016 · 10−2 0 49 · 10−2 0 017 · 10−2 Normal IV 0 28 · 10−2 0 024 · 10−2 0 20 · 10−2 0 007 · 10−2 0 23 · 10−2 0 008 · 10−2 Normal V 0 23 · 10−2 0 013 · 10−2 0 18 · 10−2 0 005 · 10−2 0 18 · 10−2 0 005 · 10−2 Normal VI 1 55 · 10−2 0 110 · 10−2 1 00 · 10−2 0 045 · 10−2 1 01 · 10−2 0 045 · 10−2 Normal VII 1 23 · 10−2 0 063 · 10−2 5 51 · 10−2 0 146 · 10−2 1 11 · 10−2 0 075 · 10−2 Normal VIII 2 92 · 10−2 0 126 · 10−2 5 52 · 10−2 0 144 · 10−2 2 76 · 10−2 0 124 · 10−2 Normal IX 1 98 · 10−2 0 084 · 10−2 1 91 · 10−2 0 044 · 10−2 1 81 · 10−2 0 048 · 10−2 Beta Beta 3 07 · 10−1 0 194 · 10−1 1 93 · 10−1 0 071 · 10−1 1 99 · 10−1 0 104 · 10−1 Beta Weibull 5 92 · 10−2 0 420 · 10−2 3 72 · 10−2 0 151 · 10−2 4 12 · 10−2 0 248 · 10−2 Gamma Beta 5 93 · 10−2 0 324 · 10−2 4 05 · 10−2 0 128 · 10−2 4 27 · 10−2 0 221 · 10−2 LogNormal 2 49 · 10−2 0 060 · 10−2 2 51 · 10−2 0 065 · 10−2 2 81 · 10−2 1 601 · 10−2 Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 771 is used, where the average is taken over simulated realizations. Table 2 brings the results of this comparison. It can be also considered the criterion IAE = avgH ˆf x H − f x dx. Figures 6–8 show distributions of the entries ˆh1 and ˆh2 of bandwidth matrices HAMISE in the h1 h2 coordinate plane. We observe that LSCV estimates of HAMISE suffer from large variability. This fact could be explained by the fact that MISE H surface is rather flatter near HMISE. The M1 and M2 methods perform very similarly; however, the M1 estimator fails for densities Normal II, Normal VII and Normal VIII. It is due to the fact that the use of the Scott’s (1992) rule does not quite account for the curvature of f. The same problem occurs in application to real data, shown in the next section. The advantage of the M1 method is contained in its simplicity. Figure 6. Distribution of ˆh1 and ˆh2—normal densities. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 772 Horová et al. Figure 7. Distribution of ˆh1 and ˆh2—normal densities. Figure 8. Distribution of ˆh1 and ˆh2—nonnormal densities. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 773 Figure 9. Kernel estimate of plasma lipid data. On the other hand, it is obvious that the LSCV method performs rather well in mixtures of normal densities (Normal VII, Normal VIII) with respect to M1. The M2 method seems to be sufficiently reliable and easy to implement (using the product kernel). This fact is also confirmed by examing these methods on real data sets in the next section. 7. Application to Real Data We applied the proposed methods to the plasma lipid data—a bivariate data set consisting of concentration of plasma cholesterol and plasma triglycerides taken on 320 patients with chest pain in a heart disease study (Scott, 1992). A scatterplot of the data is shown in Fig. 9a. Figures 9c and 9d represent reconstructed probability density functions using the bandwidth matrix HM1 = diag 5 372 11 632 and HM2 = diag 14 992 25 582 , respectively. It can be compared with the reconstructed probability density function using HLSCV = diag 42 312 31 862 shown in Fig. 9b. The authors of the original case study (Scott et al., 1978) found two primary clusters in these data set as well as the method M2 has found. See also papers by ´Cwik and Koronacki (1997), Sain et al. (1994), Silverman (1989), and Wand and Jones (1995). Interestingly, while the LSCV and M1 methods fail to recognize the density bimodality, the M2 estimate is clearly bimodal. 8. Conclusion The advantage of these methods is in their flexibility and in the fact that they are very easy to implement, especially for product kernels. Due to the fast computations of convolutions these methods seem to be less time consuming. Simulations show Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 774 Horová et al. that M2 estimates provides a sufficiently reliable way of estimating arbitrary densities. We would like to emphasize that we restrict ourselves on the use of the Epanechnikov product kernel, because it has an optimality property (Wand and Jones, 1995) and corresponding integrals can be easily evaluated by means of convolutions. On the other hand, this kernel does not satisfy smoothness conditions for bias cross-validation methods and the plug-in method. Thus the simulation study compares the proposed methods with the LSCV method. Moreover, the proposed methods essentially minimize the MISE as the LSCV does. Further assessment of their practical performance and comparison with other matrix bandwidth selectors through a large-scale simulation study would be very important further research. Appendix Proof of Theorem 5.1. The proof requires some notations: for a m × n matrix A vec is the vector operation (i.e., vec A is a mn × 1 vector of stacked columns of the matrix A), and A ⊗ B denotes the Kronecker product of matrices A and B. Let us denote I x = E KH x − y ˆf y H dy = E K z ˆf x − H1/2 z H dz = K z E ˆf x − H1/2 z H dz And now compute I1 z = E ˆf x − H1/2 z H = KH x − H1/2 z − y f y dy Substitutions yield I1 z = K w − z f x − H1/2 w dw = K u f x − H1/2 u − H1/2 z du We use Taylor expansion in the form f x − H1/2 u − H1/2 z = f x − H1/2 z − H1/2 u T f x − H1/2 z + 1 2 H1/2 u T 2 f x − H1/2 z H1/2 u + o tr H Hence, using properties (A1) of the kernel I1 z = f x − H1/2 z + 1 2 H1/2 u T 2 f x − H1/2 z H1/2 uK u du + o tr H = f x − H1/2 z + 1 2 2 K tr H 2 f x − H1/2 z + o tr H Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 775 Further tr H 2 f x − H1/2 z = vecH T vec 2 f x − H1/2 z (Magnus and Neudecker, 2007). Now, we need Taylor expansion of vec 2 f x − H1/2 z : vec 2 f x − H1/2 z = vec 2 f x − f ⊗ 2 f x H1/2 z + 1 2 2 f ⊗ 2 f x vec zzT H + O vecH2 Thus I1 z = f x − H1/2 z + 1 2 2 K vecH T vec 2 f x − f ⊗ 2 f x H1/2 z + 1 2 2 f ⊗ 2 f x vec zzT H + O vecH2 + o tr H Hence I x = K z I1 z dz = K z f x − H1/2 z dz + 1 2 2 K vecH T vec 2 f x + 1 4 2 K 2 vecH T 2 f ⊗ 2 f x vecH + o tr H = E ˆf x H + 1 2 2 K tr H 2 f x + 1 4 2 K 2 tr H 2 f x H 2 f x + o tr H where we use again the results from (Magnus and Neudecker, 2007): A B C D square matrices ⇒ trABCD = vecD T A ⊗ C T vecBT . In our case D = B = H A = C = 2 f x . All matrices are symmetrical and from this statement the last expression follows immediately. Since E ˆf x H = f x + 1 2 2 K tr H 2 f x + o trH the statement of Theorem 5.1 is valid. Proof of Corollary 5.1. E biasˆf x H = E KH x − y ˆf y H dy − ˆf x H = f x + 2 K tr H 2 f x + 1 4 2 2 K tr H 2 f x H 2 f x + o tr H − E ˆf x H Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 776 Horová et al. = f x + 2 K tr H 2 f x + 1 4 2 2 K tr H 2 f x H 2 f x − f x − 1 2 2 K tr H 2 f x + o tr H = 1 2 2 K tr H 2 f x + 1 4 2 2 K tr H 2 f x H 2 f x + o tr H Further, E ˆf x H − f x = 1 2 2 K tr H 2 f x + o tr H , then E biasˆf x H = biasˆf x H + 1 4 2 2 K tr H 2 f x H 2 f x + o tr H = biasˆf x H + o tr H Acknowledgment This research was supported by Ministry of Education, Youth and Sports of the Czech Republic under the project LC06024 and by Masaryk University under the Student Project Grant MUNI/A/1001/2009. The authors would like to thank José E. Chacón for his very helpful and constructive comments and suggestions. References Cao, R., Cuevas, A., González Manteiga, W. (1994). A comparative study of several smoothing methods on density estimation. Comput. Statist. Data Anal. 17:153–176. Chacón, J. E., Duong, T. (2009). Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19:375–398. Chaudhuri, P., Marron, J. S. (1999). SiZer for exploration of structure in curves. J. Amer. Statist. Assoc. 94:807–823. ´Cwik, J., Koronacki, J. (1997). A combined adaptive-mixtures/plug-in estimator of multivariate probability densities. Comput. Statist. Data Anal. 26:199–218. Duong, T. (2007). ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Soft. 21:1–16. Duong, T., Hazelton, M. L. (2003). Plug-in bandwidth matrices for bivariate kernel density estimation. J. Nonparametr. Stat. 15:17–30. Duong, T., Hazelton, M. L. (2005a). Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. J. Multivariate Anal. 93:417–433. Duong, T., Hazelton, M. L. (2005b). Cross-validation bandwidth matrices for multivariate kernel density estimation. Scand. J. Statist. 32:485–506. Godtliebsen, F., Marron, J. S., Chaudhuri, P. (2002). Significance in scale space for density estimation. J. Comput. Graph. Statist. 11:1–21. Härdle, W., Müller, M., Sperlich, S., Werwatz, A. (2004). Nonparametric and Semiparametric Models. [On-line]. Retrieved from http://fedc.wiwi.hu-berlin.de/xplore/ebooks/html/- spm/ Horová, I., Vieu, P., Zelinka, J. (2002). Optimal choice of nonparametric estimates of a density and of its derivatives. Statistics & Decisions 20:355–378. Horová, I., Koláˇcek, J., Zelinka, J., Vopatová, K. (2008). Bandwidth choice for kernel density estimates. Proc. IASC, 542–551. Horová, I., Zelinka, J. (2007). Contribution to the bandwidth choice for kernel density estimates. Comput. Statist. 22:31–47. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Visualization and Bandwidth Choice 777 Jones, M. C., Marron, J. S., Park, B. U. (1991). A simple root n bandwidth selector. Ann. Statist. 19:1919–1932. Magnus, J. R., Neudecker, H. (2007). Matrix Differential Calculus With Applications in Statistics and Econometrics. Chichester, England: Wiley. Sain, S. R., Baggerly, K. A., Scott, D. W. (1994). Cross-validation of multivariate densities. J. Amer. Statist. Assoc. 89:807–817. Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. New York: Wiley. Scott, D. W., Gorry, G. A., Hoffman, R. G., Barboriak, J. J., Gotto, A. M. (1978). A new approach for evaluating risk factors in coronary artery disease: a study of lipid concentrations and severity of disease in 1847 males. Circulation 62:477–484. Silverman, B. W. (1989). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Taylor, C. C. (1989). Bootstrap choice of the smoothing parameter in kernel density estimation. Biometrika 76:705–712. Wand, M. P., Jones, M. C. (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Amer. Statist. Assoc. 88:520–528. Wand, M. P., Jones, M. C. (1994). Multivariate plug-in bandwidth selection. Comput. Statist. 9:97–116. Wand, M. P., Jones, M. C. (1995). Kernel Smoothing. London: Chapman & Hall. Downloadedby[MasarykovaUniverzitavBrne],[IvanaHorova]at08:0212January2012 Journal of Statistics: Advances in Theory and Applications Volume 8, Number 2, 2012, Pages 91-103 2010 Mathematics Subject Classification: 62G08. Keywords and phrases: kernel regression, bandwidth selection, iterative method. Received November 1, 2012  2012 Scientific Advances Publishers ITERATIVE BANDWIDTH METHOD FOR KERNEL REGRESSION JAN KOLÁČEK and IVANKA HOROVÁ Department of Mathematics and Statistics Masaryk University Brno Czech Republic e-mail: kolacek@math.muni.cz Abstract The aim of the contribution is to extend the idea of an iterative method known for a kernel density estimate to kernel regression. The method is based on a suitable estimate of the mean integrated square error. This approach leads to an iterative quadratically convergent process. We conduct a simulation study comparing the proposed method with the well-known cross-validation method. Results are implemented in Matlab. 1. Univariate Kernel Density Estimator Let nXX ,,1 … be independent real random variables having the same continuous density f. The symbol fˆ will be used to denote whatever density estimation is currently being considered. Definition 1.1. Let k be an even nonnegative integer and K be a real valued function continuous on R and satisfying the conditions: (i) ( ) ( ) yxLyKxK −≤− for a constant [ ],1,1,,0 −∈∀> yxL JAN KOLÁČEK and IVANKA HOROVÁ92 (ii) support ( ) [ ] ( ) ( ) ,011,1,1 ==−−= KKK (iii) ( )        =≠β = << = ∫− .,0 ,,1 ,0,0 1 1 kj j kj dxxKx k j ν Such a function is called a kernel of order k and a class of these kernels is denoted as .0kS Remark 1.2. The well-known kernels are, e.g., (a) Epanechnikov kernel: ( ) ( ) [ ],1 4 3 1,1 2 −−= IxxK (b) quartic kernel: ( ) ( ) [ ],1 4 3 1,1 22 −−= IxxK (c) triweight kernel: ( ) ( ) [ ],1 32 35 1,1 22 −−= IxxK (d) Gaussian kernel: ( ) , 2 1 2 2x exK − π = where [ ]1,1−I is an indicator function. Though the Gaussian kernel does not satisfy the assumption (ii), it is very popular in many applications. Let ,0kSK ∈ set ( ) ( ) .0, .1 . >= h h K h Kh A parameter h is called a bandwidth. The kernel estimator of f at the point R∈x is defined as ( ) . 1 ,ˆ 1       − = ∑= h Xx K nh hxf i n i The problem of choosing the smoothing parameter is of a crucial importance and will be treated in the next sections. Our analysis requires the specification of an appropriate error criterion for measuring the error when estimating the density at a single point as well as the error when ITERATIVE BANDWIDTH METHOD FOR KERNEL … 93 estimating the density over the whole real line. A useful criterion when estimating at a single point is the mean square error (MSE) defined by { ( )} { ( ) ( )} .,ˆ,ˆ 2 xfhxfEhxfMSE −= As concerns a global criterion, we consider the mean integrated square error { ( )} { ( ) ( )} .,ˆ,ˆ 2 dxxfhxfEhfMISE −=⋅ ∫ Since MISE is not mathematically tractable, we employ the asymptotic mean integrated square error (AMISE), which can be written as a sum of the asymptotic integrated variance and the asymptotic integrated square bias { ( )} ( ) ( ( ) ), ! ,ˆ ˆ 2 2 2 ˆ fAISB kkk fAIV fV k h nh KV hfAMISE β +=⋅ (1.1) where ( ) ( ) .2 dxxggV ∫= Now, by minimizing (1.1) with respect to h, we obtain the AMISE-optimal bandwidth { ( )},,ˆminarg, hfAMISEh kopt ⋅= which takes the form ( ) ( ( ) ) . ! 2 2 2 12 , k k k kopt k fknV KV h β =+ For more details, see, e.g., [9], [14]. 2. Iterative Method for Kernel Density Estimation The problem of choosing how much to smooth, i.e., how to choose the bandwidth is a crucial common problem in kernel smoothing. Methods for a bandwidth choice have been developed in many papers and monographs, see, e.g., [1, 2, 5, 7, 8, 11, 12, 14], and many others. However, there does not exist any universally accepted approach to this serious problem yet. JAN KOLÁČEK and IVANKA HOROVÁ94 The iterative method is based on the relation ( ) ( ),,ˆ2,ˆ ,, koptkopt hfkAISBhfAIV ⋅=⋅ (2.1) with estimates of AIV and AISB ( ) ( )ˆ , , V K AIV f h nh ⋅ = and ( ) ( ) ( ) ( )( ) 2 ˆ ˆ ˆ, , ,AISB f h K x f x hy h dy f x h dx⋅ = − − ∫ ∫ , 1 1, 2       − Λ= ∑= h XX hn ji n ji where ( ) ( )( ),2 zKKKKKKKKKz ∗+∗∗−∗∗∗=Λ and ∗ denotes the convolution, i.e., ( ) ( ) ( ) .dttuKtKuKK −=∗ ∫ The bandwidth estimate kITh , ˆ is a solution of the equation ( ) .0 2 1, 2 =      − Λ− ∑= h XX hn k nh KV ji n ji (2.2) In the paper [8], this nonlinear equation was solved by Steffensen’s method. But this equation can be rewritten as ( ) .0 2 1, =−      − Λ∑ ≠ = KV h XX n k ji n ji ji (2.3) Since the first derivative of the function standing on the left hand side of this equation is easy to compute by using convolutions, Newton’s method can be used. For more details, see [9]. ITERATIVE BANDWIDTH METHOD FOR KERNEL … 95 3. Univariate Kernel Regression Consider a standard regression model of the form ( ) ,,,1, nixmY iii …=ε+= (3.1) where m is an unknown regression function, nYY ,,1 … are observable data variables with respect to the design points .,,1 nxx … The residuals nεε ,,1 … are independent identically distributed random variables for which ( ) ( ) .,,1,0var,0 2 niE ii …=>σ=ε=ε The aim of kernel smoothing is to find a suitable approximation m of the unknown function m. To avoid boundary effects, the estimate is obtained by applying the kernel on the extended series ,2,,2,1, ~ nnniYi …+−+−= where jnj YY =± ~ for .,,1 nj …= Similarly, .2,,2,1, nnninixi …+−+−== The assumption of the cyclic model leads to the kernel regression estimator ( ) ( ) 2 1 1 , , 1, , , n ij h i j n i n m x h K x x Y j n C =− + = − =∑ … (3.2) where ( ). 1 1 ih n ni n xKC ∑ − +−= = For more details about this estimator, see [9] and [10]. The quality of a kernel regression estimator can be locally described by the mean square error (MSE) or by a global criterion the mean integrated square error (MISE). According to same reasons as in kernel density estimation, we employ the asymptotic mean integrated square error (AMISE), which can be written as a sum of the asymptotic integrated variance and asymptotic integrated square bias JAN KOLÁČEK and IVANKA HOROVÁ96 { ( )} ( ) 2 2 2 , , ! k k AIV AISB V K k AMISE m h A h nh k σ β ⋅ = +     (3.3) where ( ( )( )) .2 dxxmA k k ∫= The optimal bandwidth considered here is ,, kopth the minimizer of (3.3), i.e., { ( )}, arg min , , n opt k h H h AMISE m h ∈ = ⋅ where [ ( ) ( ) ]121121 , +−+− = kk n bnanH for some .0 ∞<<< ba The calculation gives ( )( ) . 2 ! 12 1 2 22 , +         β σ = k kk kopt Akn kKV h (3.4) In nonparametric regression estimation, like in density estimation, a critical and inevitable step is to choose the smoothing parameter (bandwidth) to control the smoothness of the curve estimate. The smoothing parameter considerably affects the features of the estimated curve. One of the most widespread procedures for bandwidth selection is the cross-validation method, also known as “leave-one-out” method. The method is based on modified regression smoother (3.2) in which one, say the j-th, observation is left out ( ) ( ) 2 1 1 , , 1, , . i j n ij j h i j n i n m x h K x x Y j n C ≠ − =− + = − =∑ … With using these modified smoothers, the error function which should be minimized takes the form ( ) { ( ) }2 1 1 . n i i i i CV h m x Y n − = = −∑ (3.5) ITERATIVE BANDWIDTH METHOD FOR KERNEL … 97 The function ( )hCV is commonly called a “cross-validation” function. Let CVhˆ stand for minimization of ( ),hCV i.e., ( ).minargˆ hCVh nHh CV ∈ = The literature on this criterion is quite extensive, e.g., [3, 4, 6, 13]. 4. Iterative Method for Kernel Regression The proposed method is based on the similar relation as in the kernel density estimation. It is easy to show that the following equation holds: ( ) ( ), ,, 2 , ,opt k opt kAIV m h kASBm h⋅ = ⋅ (4.1) where ( ) ( )2 , , V K AIV m h nh σ ⋅ = and ( ) ( ) ( ){ }2 1 1 , , . n i i i mASB h E m x h m x n = ⋅ = −∑ For estimating of AIV and ASB in (4.1), we use ( ) ( ) ( ) 2 22 1 2 ˆ 1 ˆ, , with , 2 2 n i i i V K AIV m h Y Y nh n − = σ ⋅ = σ = − − ∑ and ( ),ASBm h⋅ ( ) ( ) . ~1~11 22 1 2 11                 −−−= ∑∑∑ +−=+−== lilh n nln ijih n nin n j YxxK C YxxK Cn To find the bandwidth estimate ,ˆ , kITh we solve the following equation: JAN KOLÁČEK and IVANKA HOROVÁ98 ( ) ( ) 2 ˆ . 2 , V K h knASBm h σ = ⋅ (4.2) We use Steffensen’s iterative method with the starting approximation .ˆ 0 nkh = This approach leads to an iterative quadratically convergent process. 5. Simulation Study We carry out two simulation studies to compare the performance of the bandwidth estimates. The comparison is done by the following way. The observations, ,iY for ,100,,1 == ni … are obtained by adding independent Gaussian random variables with mean zero and variance 2 σ to some known regression function. Both regression functions used in our simulations are illustrated in Figure 1. They are not chosen randomly for our comparison. The first one is suitable for the extension to the cyclic model, on the other side, the second function does not satisfy the assumption for the cyclic model. ITERATIVE BANDWIDTH METHOD FOR KERNEL … 99 Figure 1. Regression functions. One hundred series are generated. For each data set, we estimate the optimal bandwidth by both mentioned methods, i.e., for each method, we obtain 100 estimates. Since we know the optimal bandwidth, we compare it with the mean of estimates and look at their standard deviation, which describes the variability of all methods. The Epanechnikov kernel ( ) ( ) [ ]1,1 2 1 4 3 −−= IxxK is used in all cases. JAN KOLÁČEK and IVANKA HOROVÁ100 5.1. Simulation 1. In this case, we use the regression function ( ) ( ) ,5 6 4sin220cos +      −+= x xxm with .3.02 =σ Table 1 summarizes the sample means and the sample standard deviations of bandwidth estimates, ( )hE ˆ is the average of all 100 values and ( )hstd ˆ is their standard deviation. Figure 2 illustrates the histogram of results of all 100 experiments. Table 1. Means and standard deviations 0560.02, =opth ( )hE ˆ ( )hstd ˆ CV 0.0550 0.0120 IT 0.0556 0.0048 Figure 2. Distribution of h for both methods. ITERATIVE BANDWIDTH METHOD FOR KERNEL … 101 As we see, the standard deviation of all results obtained by the proposed method is less than the value for the case of cross-validation method and also the mean of these results is a little bit closer to the theoretical optimal bandwidth. The reason is that the regression function is smooth and satisfies the conditions for the extension to the cyclic design. Thus, the proposed method works very well in this case. 5.2. Simulation 2. In the second example, we use the regression function ( ) ( ) ( ) ,55cot1011ln 13 −−+−= xxxm with .05.02 =σ Table 2 summarizes the sample means and the sample standard deviations of bandwidth estimates, ( )hE ˆ is the average of all 100 values and ( )hstd ˆ is their standard deviation. Figure 3 illustrates the histogram of results of all 100 experiments. Table 2. Means and standard deviations 0707.02, =opth ( )hE ˆ ( )hstd ˆ CV 0.1466 0.0443 IT 0.0592 0.0112 JAN KOLÁČEK and IVANKA HOROVÁ102 Figure 3. Distribution of h for both methods. It is evident that better results are obtained by the proposed method. This method is successful despite the fact that the regression function does not meet assumptions for the extension to the cyclic model. The cross-validation method often results in smaller bandwidths. The variance of this criterion is also significant. Acknowledgement This research was supported by Masaryk University under the project MUNI/A/1001/2009. ITERATIVE BANDWIDTH METHOD FOR KERNEL … 103 References [1] R. Cao, A. Cuevas and W. González Manteiga, A comparative study of several smoothing methods in density estimation, Computational Statistics and Data Analysis 17(2) (1994), 153-176. [2] P. Chaudhuri and J. S. Marron, Sizer for exploration of structures in curves, Journal of the American Statistical Association 94(447) (1999), 807-823. [3] P. Craven and G. Wahba, Smoothing noisy data with spline functions estimating the correct degree of smoothing by the method of generalized cross-validation, Numerische Mathematik 31(4) (1979), 377-403. [4] Bernd Droge, Some Comments on Cross-Validation, Technical Report 1994-7, Humboldt Universitaet Berlin, 1996. [5] Jianqing Fan and Irene Gijbels, Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation, Journal of the Royal Statistical Society, Series B (Methodological) 57(2) (1995), 371-394. [6] W. Härdle, Applied Nonparametric Regression, 1st Edition, Cambridge University Press, Cambridge, 1990. [7] W. Härdle, M. Müller, S. Sperlich and A. Wewatz, Nonparametric and Semiparametric Models, 1st Edition, Springer, Heidelberg, 2004. [8] I. Horová and J. Zelinka, Contribution to the bandwidth choice for kernel density estimates, Computational Statistics 22(1) (2007), 31-47. [9] I. Horová, J. Koláček and J. Zelinka, Kernel Smoothing in MATLAB, World Scientific, Singapore, 2012. [10] Jan Koláček, Plug-in method for nonparametric regression, Computational Statistics 23(1) (2008), 63-78. [11] David W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, 1992. [12] Bernard W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, 1986. [13] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society Series B-Statistical Methodology 36(2) (1974), 111-147. [14] M. P. Wand and M. C. Jones, Kernel Smoothing, Chapman and Hall, London, 1995. g Journal of Applied Probability and Statistics Vol. 6, No. 1&2, pp. 73-85 ISOSS Publications 2012 A GENERALIZED REFLECTION METHOD FOR KERNEL DISTRIBUTION AND HAZARD FUNCTIONS ESTIMATION Jan Kol´aˇcek Jan Kol´aˇcek, Department of Mathematics and Statistics, Masaryk University, Brno, Czech Republic Email: kolacek@math.muni.cz Rohana J. Karunamuni Rohana J. Karunamuni, Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada Email: R.J.Karunamuni@ualberta.ca summary In this paper we focus on kernel estimates of cumulative distribution and hazard functions (rates) when the observed random variables are nonnegative. It is well known that kernel distribution estimators are not consistent when estimating a distribution function near the point x = 0. This fact is rather visible in many applications, for example in kernel ROC curve estimation [10]. In order to avoid this problem we propose a bias reducing technique that is a kind of generalized reflection method. Our method is based on ideas of [8] and [19] developed for boundary correction in kernel density estimation. The proposed estimators are compared with the traditional kernel estimator and with the estimator based on “classical” reflection method using simulation studies. Keywords and phrases: kernel estimation, reflection, distribution function, hazard function. AMS Classification: 30C40, 62G30. 1 Introduction The most commonly used nonparametric estimate of a cumulative distribution function F is the empirical distribution function Fn, where Fn(x) = n−1 n i=1 I[Xi ≤ x] with X1, ..., Xn being the observations. But Fn is a step function even in the case that F is a continuous function. Another type of nonparametric estimator for F is derived from kernel smoothing methods. Kernel smoothing is most widely used because it is easy to apply and produce estimators which have good small and asymptotic properties. Kernel smoothing has received a lot of attention in density estimation. Good references in this area are [3], [16] and [17]. However, results in kernel distribution function estimation are relatively few. Theoretical properties of kernel distribution function estimator have been investigated by [12], [14] and [1]. Although there is a vast literature on boundary correction in density estimation context, boundary effects problem in distribution function context has been less studied. The same can be said about estimation of hazard function (rates) estimation. In this paper, we develop a new kernel type estimator of the cumulative distribution and hazard rates that removes boundary effects near the end points of the support. Our estimator is based on a new boundary corrected kernel estimator of distribution function and it is based on ideas of [6], [7], [8] and [19] developed for boundary correction in kernel density estimation. The basic technique of construction of the proposed estimator is kind of a generalized reflection method involving reflecting a transformation of the observed data. In fact, the proposed method generates a class of boundary corrected estimators. We derive expressions for the bias and variance of the proposed estimators. Furthermore, the proposed estimators are compared with the traditional estimator and with the estimator based on “classical” reflection method using simulation studies. We observe that the proposed estimators successfully remove boundary effects and performs considerably better than the others two. Kernel smoothing in distribution function estimation and boundary effects are discussed in the next section. The proposed estimator of distribution functions is given in Section 3. Section 4 discusses estimation of hazard functions (rates). Simulation results are given in Section 5 and our results are applied on real data in Section 6. Finally, some concluding remarks are given in Section 7. 2 Kernel distribution estimator and boundary effects Let f denote a continuous density function with support [0, a], 0 < a ≤ ∞, and consider nonparametric estimation of the cumulative distribution function F of f based on a random sample X1, ..., Xn from f. Suppose that F(j) , the j-th derivative of F, exists and is continuous on [0, a], j = 0, 1, 2, with F(0) = F and F(1) = f. Then the traditional kernel estimator of F is given by Fh,K(x) = 1 n n i=1 W x − Xi h , W(x) = x −1 K(t)dt (2.1) where K is a unimodal symmetric density function with support [−1, 1] and h is the bandwidth (h → 0 as n → ∞). Set β2 = 1 −1 t2 K(t)dt. The basic properties of Fh,K(x) at interior points are well-known (e.g., [11]), and under some smoothness assumptions these include, for h ≤ x ≤ a − h, E(Fh,K(x)) − F(x) = 1 2 β2f(1) (x)h2 + o(h2 ) 74 nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(x) 1 −1 W(t)(W(t) − 1)dt + o(h). The performance of Fh,K(x) at boundary points, i.e., for x ∈ [0, h) ∪ (a − h, a], however, differs from the interior points due to so-called “boundary effects” that occur in nonparametric curve estimation problems. More specifically, the bias of Fh,K(x) is of order O(h) instead of O(h2 ) at boundary points while the variance of Fh,K(x) is of the same order. This fact can be clearly seen by examining the behavior of Fh,K inside the left boundary region [0, h]. Let x be a point in the left boundary, i.e., x ∈ [0, h]. Then we can write x = ch, 0 ≤ c ≤ 1. It can be shown that the bias and variance of Fh,K(x) at x = ch are of the form E(Fh,K(x)) − F(x) = hf(0) −c −1 W(t)dt + h2 f(1) (0)    c2 2 + c −c −1 W(t)dt − c −1 tW(t)dt    + o(h2 ) (2.2) nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(0)    c −1 W2 (t)dt − c    + o(h). (2.3) From the expression (2.2) it is now clear that the bias of Fh,K(x) is of order O(h) instead of O(h2 ). To remove this boundary effect in kernel distribution estimation we investigate a new class of estimators in the next section. 3 The proposed estimator In this section we propose a class of estimators of the distribution function F of the form Fh,K(x) = 1 n n i=1 W x − g1(Xi) h − W − x + g2(Xi) h , (3.1) where h is the bandwidth, W is a cumulative distribution function defined by (2.1) and g1 and g2 are two transformations that need to be determined. We assume that gi, i = 1, 2, are nonnegative, continuous and monotonically increasing functions defined on [0, ∞). Further assume that g−1 i exists, gi(0) = 0, g (1) i (0) = 1, and that g (2) i exists and is continuous on [0, ∞), where g (j) i denotes the j-th derivative of gi, with g (0) i = gi and g−1 i denoting the inverse function of gi, i = 1, 2. We will choose g1 and g2 so that Fh,K(x) ≥ 0 everywhere. Note that the i-th term of the sum in (3.1) can be expressed as W x − g1(Xi) h − W − x + g2(Xi) h = x+g2(Xi) h −x+g1(Xi) h K(t)dt. 75 The preceding integral is non-negative provided the inequality −x+g1(Xi) h ≤ x+g2(Xi) h holds. Since x ≥ 0, the preceding inequality will be satisfied if g1 and g2 are such that g1(Xi) ≤ g2(Xi) for i = 1, . . . , n. Thus we will assume that g1 and g2 are chosen such that g1(x) ≤ g2(x) for x ∈ [0, ∞) for the proposed estimator. Now, we can obtain the bias and variance of (3.1) at x = ch, 0 ≤ c ≤ 1, as E(Fh,K(x)) − F(x) = h2    f(1) (0)  c2 2 + 2c −c −1 W(t)dt − c −c tW(t)dt   − f(0)g (2) 1 (0) c −1 (c − t)W(t)dt −f(0)g (2) 2 (0) −c −1 (c + t)W(t)dt    + o(h2 ). (3.2) nVar(Fh,K(x)) = F(x)(1 − F(x)) + hf(0)    c −1 W2 (t)dt −2 c −1 W(t)W(t − 2c)dt + −c −1 W2 (t)dt    + o(h). (3.3) The proofs of (3.2) and (3.3) are given in [10]. Similarly we can express the bias and variance of (3.1) at “interior” points x = c > 1. Note that the contribution of g2 on the bias vanishes as c → 1. By comparing expressions (2.2), (3.2), (2.3) and (3.3) at boundary points we can see that the variances are of the same order and the bias of Fh,K(x) is of order O(h) while the bias of Fh,K(x) is of order O(h2 ). So our estimator removes boundary effects in kernel distribution estimation since the bias at boundary points is of the same order as the bias at interior points. It is clear that there are various possible choices available for the pair (g1, g2). However, we will choose g1 and g2 so that the condition Fh,K(0) = 0 will be satisfied because of the fact that F(0) = 0. A sufficient (but not necessary) condition for the preceding to be satisfied is that g1 and g2 must be equal. Thus we need to construct a single transformation function g such that g = g1 = g2. Other important properties that are desirable in the estimator Fh,K are the local adaptivity, that is the transformation function g depends on c. Some discussion on the choice of gc and other various improvements that can be made would be appropriate here. The trivial choice is gc(y) = y, which represents the “classical” reflection method estimator. However, it is possible to construct functions gc’s that improve the bias further under some additional conditions. For instance, if one examines the right hand side of the bias expansion (3.2) then it is not difficult to see that the terms inside bracket (i.e., the coefficient of h2 ) can be made equal to zero if gc is appropriately chosen. 76 Set Ac =    d1 c2 2 +2cI1−I2 c2+2cI1−I2 , for 0 ≤ c < 1 d1 β2 c2+β2 , for c > 1 where d1 = f(1) (0) f(0) , I1 = −c −1 W(t)dt, I2 = c −c tW(t)dt. If gc is chosen such that g (2) c (0) = Ac then the bias of Fh,K(x) would be theoretically of order O(h3 ). For such a function gc, the second derivative at zero g (2) c (0) will be dependent on the ratio d1 = f(1) (0) f(0) . Then the problem of estimation of d1 naturally arises as in the papers of [6], [7], [8] and [9]. For example, the ratio d1 = f(1) (0) f(0) is estimated in [9] as the first derivative of natural logarithm of f at zero. For more details, especially for the exact formula for d1 and for some statistical properties, especially for the asymptotic convergence rate, see the preceding paper. Summarizing all the assumptions, it is clear now that gc should satisfy the following conditions: (i) gc : [0, ∞) → [0, ∞), gc is continuous, monotonically increasing and g (i) c exists, i = 1, 2, (ii) g−1 c (0) = 0, g (1) c (0) = 1 (iii) g (2) c (0) = Ac. Functions satisfying conditions (i) – (iii) are easy to construct. We will consider the following transformation. For y ≥ 0, let us define gc(y) = y + 1 2 Acy2 + λA2 cy3 , (3.4) where Ac is an estimator of Ac based on an estimator d1 of d1, and λ is a positive constant such that λ > 1 12 . This condition on λ is necessary for gc(y) to be an increasing function of y. Based on extensive simulations, we find that this transformation adapts well to various shapes of distribution functions with setting λ = 0.1. 77 4 Estimation of hazard rates Given a distribution F with probability density function f, the hazard rate is defined by z(t) = f(t) 1 − F(t) . (4.1) The hazard rate is also called the age-specific or conditional failure rate. It is useful particularly in the context of reliability theory and survival analysis and hence in fields as diverse as engineering and medical statistics. See [2] for a discussion of the role of the hazard rate in understanding and modeling survival data. [16] provides a survey of some methods for nonparametric estimation of hazard estimation. Given a sample X1, ..., Xn from the density f, a natural nonparametric estimator of the hazard rate is z(t) = f(t)/(1−F(t)), where f is a suitable density estimator of f based on X1, ..., Xn and F(t) = t −∞ f(x)dx estimates F(t). If f is the traditional kernel estimator with a kernel K and bandwidth h, then F(t) can be obtained by F(t) = n−1 n i=1 K1((t − Xi)/h), where K1(u) = u −∞ K(t)dt. [18] introduced and discussed z(t) and various alternative nonparametric estimators of z(t). For further properties of z(t) with kernel and other related estimators see, e.g., [15], [13] (Section 4.3) and [16] (Section 6.5). It has been observed that consideration of errors involved in the construction of z show that, to a first approximation, the main contribution to the error will be due to the numerator of z, i.e., due to the estimator f; see, e.g., [16] (Section 6.5). Thus, to obtain the best possible estimate of the hazard rate, one should aim to minimize the error in the estimation of density f. If the support of f is the interval [0, a], 0 < a ≤ ∞, which is usually the case in survival and reliability data, then the traditional kernel estimators of density f suffer from boundary effects. Therefore, it is advisable to use boundary adjusted estimators of density f and the distribution F in this context. For this purpose here we implement a boundary adjusted kernel density estimator similar to the one proposed in [6] and the boundary adjusted distribution function estimator Fh,K given above. Thus, the proposed estimator of the hazard rate z(t) is given by, for t = ch, c ≥ 0, z(t) = f(t) 1 − Fh,K(t) , (4.2) where Fh,K is defined by (3.1) and f is defined by f(t) = 1 nh n i=1 K t − g1,c(Xi) h + K t + g1,c(Xi) h , (4.3) where g1,c(x) = x + 1 2 d1kcx2 + λ0(d1kc)2 y3 , (4.4) 78 with d1 as defined in [9], λ0 is a positive constant such that 12λ0 > 1, and kc given by, for c ≥ 0, kc = 2 1 c (u − c)K(u)du c + 2 1 c (u − c)K(u)du . (4.5) Theorem 1. The mean squared error (MSE) of z(t) is given by, for t = ch, c ≥ 0, E(z(t) − z(t))2 = 1 − F(t) w1w2 2 f(0) nh  2 1 c K(t)K(2c − t)dt + V (K)   + o 1 nh , (4.6) where wi, i = 1, 2 are finite constants satisfying 1−Fh,K(t) ≥ w1 > 0 and 1−F(t) ≥ w2 > 0 and V (K) = 1 −1 K2 (x)dx. Proof. For a detailed proof see Appendix. 5 A simulation study To test the effectiveness of our estimator, we simulated its performance against the classical reflection method. The simulation is based on 1 000 replications. In each replication, the random variables X ∼ Exp(1) were generated and the estimate of the hazard function was computed. Let us note that the real hazard function in this case is constant equal to one. In all replications the sample size of n = 100 was used. In this case, the actual global optimal bandwidth (see [1]) for F is hF = 0.8479 and for f is hf = 0.7860 (see [16]). For kernel estimation of both needed functions (distribution and density) we have used the Epanechnik kernel K(x) = 3 4 (1 − x2 )I[−1,1](x), where IA is the indicator function on the set A. For each estimated hazard function we have calculated the mean integrated squared Error (MISE) on the interval [0, hF ] over all 1 000 replications and have displayed the results in a boxplot in Figure 1. The variance of each estimator can be accurately gauged by the whiskers of the plot. The values of means and standard deviations for MISE of each method are given in Table 1. As we can see the reflection method gives the smaller values of MISE than the classical estimator, but the variance is not so small. From this point of view the proposed estimator seems to be better. To get more detailed information about estimators we have calculated the Mean Squared error (MSE) at four points in the boundary region x = chF , c = 0, 0.25, 0.5, 0.75. The boxplot of MSE for each estimator over all 1 000 replications is illustrated in Figure 2. The values of means and standard deviations for MSE at each point for each method are given in Table 2. These values describe the performance of our proposed method with respect to MSE when compared with the classical and reflection method estimators. The values of mean and also of the variance were smallest in the case of our proposed estimator. This is 79 0 0.05 0.1 0.15 0.2 0.25 1 2 3 Figure 1: MISE for estimates of z(t) for the classical estimator with boundary effects (1), the reflection method (2) and for our proposed method (3). Table 1: Means and STD’s for MISE Method Mean STD Classical 0.1265 0.0376 Reflection 0.0273 0.0209 Proposed 0.0142 0.0185 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 0.25h 0.5h 0.75h 0 0.25h 0.5h 0.75h 0 0.25h 0.5h 0.75h classical reflection proposed Figure 2: MSE at points x = chF , c = 0, 0.25, 0.5, 0.75 for the classical estimator with boundary effects, the reflection method and for our proposed method. caused by a local adaptivity of our estimator. On other hand, the classical and reflection method estimators are not locally adaptive. From the figures and tables it is clear that the proposed estimator performed the best among the three compared. It captures the features of the distribution and hazard functions correctly with minimum bias while holding onto a low variance. Table 2: Means and STD’s for MSE at x = chF . Classical Reflection Proposed c Mean STD Mean STD Mean STD 0.00 0.3103 0.0591 0.0582 0.0369 0.0149 0.0195 0.25 0.1398 0.0528 0.0229 0.0261 0.0144 0.0194 0.50 0.0421 0.0346 0.0140 0.0198 0.0137 0.0198 0.75 0.0140 0.0183 0.0139 0.0210 0.0139 0.0210 81 6 Real data In this section we apply our results to a real data set. For our analysis, we have used the suicide data from [16]. The proposed hazard rate estimate is given in Figure 3. The solid line represents our proposed estimator (4.2) and the dashed line is for the traditional kernel estimator (with boundary effects). When choosing the optimal bandwidths for the density and distribution function estimation, we used iterative methods described in [5] and [4]. The optimal bandwidths for the density and the distribution function were estimated as ˆhf = 132.01 and ˆhF = 144.83, respectively. The proposed estimator of hazard rate again captures proper features of the actual hazard rate, while the traditional estimator dip near the left end point due to boundary effects. 0 100 200 300 400 500 600 700 800 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 Figure 3: Hazard rate estimates constructed from the suicide data. 7 Conclusion In this paper we proposed new kernel-type estimators to estimate the distribution and hazard functions without boundary effects near the endpoints of the support. The technique implemented is a kind of generalized reflection method involving reflecting a transformation of the data. The proposed method generates a class of boundary corrected estimators and it is based on ideas of boundary corrections for kernel density estimators presented in [6], [7] and [8]. We showed some good properties of our proposed method (e.g., local adaptivity). Furthermore, it is shown that bias of the proposed estimator is better than that of the “classical” one. The proposed estimators performed quite well in the numerical studies compared to the classical and reflection method estimators. 82 8 Acknowledgements The research was supported by The Jaroslav H´ajek center for theoretical and applied statistics (MˇSMT LC 06024). Appendix: Proof of Theorem 1 Theorem 1 The mean squared error (MSE) of z(t) is given by, for t = ch, c ≥ 0, E(z(t) − z(t))2 = 1 − F(t) w1w2 2 f(0) nh  2 1 c K(t)K(2c − t)dt + V (K)   + o 1 nh , where wi, i = 1, 2 are finite constants satisfying 1−Fh,K(t) ≥ w1 > 0 and 1−F(t) ≥ w2 > 0 and V (K) = 1 −1 K2 (x)dx. Proof. The difference z(t) − z(t) is equal to, for t = ch, c ≥ 0, z(t) − z(t) = f(t) 1 − Fh,K(t) − f(t) 1 − F(t) = f(t) (1 − F(t)) − f(t)(1 − Fh,K(t) (1 − Fh,K(t))(1 − F(t)) . Since we are only concerned about the behavior of z(t) near the left boundary, i.e., t = ch, c ≥ 0, we only need to study the difference near the left endpoint 0. For t = ch, c ≥ 0 we can assume that 1 − Fh,K(t) ≥ w1 > 0 and 1 − F(t) ≥ w2 > 0, where wi, i = 1, 2 are finite constants. The preceding conditions are reasonable, since Fh,K(0) = 0, F(0) = 0 and Fh,K and F are continuous functions. Therefore we obtain (z(t) − z(t))2 ≤ (w1w2)−2 (f(t) (1 − F(t)) − f(t)(1 − Fh,K(t)))2 . To get the formula for MSE of z(t) we need to express E(f(t) (1 − F(t))−f(t)(1−Fh,K(t))2 . 83 E(f(t) (1 − F(t)) − f(t)(1 − Fh,K(t))2 = (1 − F(t))2 Ef2 (t) + f2 (t)E(1 − Fh,K(t))2 − 2f(t)(1 − F(t))Ef(t)(1 − Fh,K(t)) = (1 − F(t))2 varf(t) + (Ef(t))2 + f2 (t) varFh,K(t) + (1 − EFh,K(t))2 −2f(t)(1 − F(t)) Ef(t)(1 − EFh,K(t)) + o 1 nh = (1 − F(t))2    f(0) nh  2 1 c K(u)K(2c − u)du + V (K)   + o 1 nh + f2 (t) + o(h)    +f2 (t)    1 n F(t)(1 − F(t)) + hf(0) n   c −1 W2 (u)du − 2 c −1 W(u)W(u − 2c)du + −c −1 W2 (u)du   + o(h) + (1 − F(t))2 + o(h2 )    −2f(t)(1 − F(t)) [f(t) + o(h)] 1 − F(t) + o(h2 ) + o 1 nh = (1 − F(t))2 f(0) nh  2 1 c K(u)K(2c − u)du + V (K)   + o 1 nh . References [1] A. Azzalini, (1981). A note on the estimation of a distribution function and quantiles by a kernel method, Biometrika, 68, 326–328. [2] D. Cox and D. Oakes, (1984). Analysis of survival data, London, New York: Chapman and Hall. [3] T. Gasser, H. M¨uller and V. Mammitzsch, (1985). Kernels for nonparametric curve estimation, Journal of the Royal Statistical Society. Series B, 47, 238–252. [4] I. Horov´a, J. Kol´aˇcek, J. Zelinka and A.H. El-Shaarawi, (2008). Smooth Estimates of Distribution Functions with Application in Environmental Studies, Advanced topics on mathematical biology and ecology, pp. 122–127. [5] I. Horov´a and J. Zelinka, (2007). Contribution to the bandwidth choice for kernel density estimates, Computational Statistics, 22, 31–47. [6] R. Karunamuni and T. Alberts, (2005a). A generalized reflection method of boundary correction in kernel density estimation, Canad. J. Statist., 33, 497–509. 84 [7] R. Karunamuni and T. Alberts, (2005b). On boundary correction in kernel density estimation, Statistical Methodology, 2, 191–212. [8] R. Karunamuni and T. Alberts, (2006). A locally adaptive transformation method of boundary correction in kernel density estimation, J. Statist. Plann. Inference, 136, 2936–2960. [9] R. Karunamuni and S. Zhang, (2007). Some improvements on a boundary corrected kernel density estimator, Statistics & Probability Letters, 78, 497–507. [10] J. Kol´aˇcek and R. Karunamuni, (2009). On boundary correction in kernel estimation of ROC curves, Austrian Journal of Statistics, 38, 17–32. [11] M. Lejeune and P. Sarda, (1992). Smooth estimators of distribution and density functions, Computational Statistics & Data Analysis, 14, 457–471. [12] E. Nadaraya, (1964). Some new estimates for distribution functions, Theory Probab. Appl., 15, 497–500. [13] B. Prakasa Rao, (1983). Nonparametric functional estimation, Academic Press. [14] R. Reiss, (1981). Nonparametric estimation of smooth distribution functions, Scandinavian Journal of Statistics, 8, 116–119. [15] J. Rice and M. Rosenblatt, (1976). Estimation of the log survivor function and hazard function, Sankhya, 38, 60–78. [16] W. Silverman, (1986). Density estimation for statistics and data analysis, London: Chapman and Hall. [17] M. Wand and M. Jones, (1995). Kernel smoothing, London: Chapman and Hall. [18] G. Watson and M. Leadbetter, (1964). Hazard Analysis I, Biometrika, 51, 175–184. [19] S. Zhang, R. Karunamuni and M. Jones, (1999). An improved estimator of the density function at the boundary, J. Amer. Statist. Assoc., 94, 1231–1241. 85 AUSTRIAN JOURNAL OF STATISTICS Volume 38 (2009), Number 1, 17–32 On Boundary Correction in Kernel Estimation of ROC Curves Jan Kol´aˇcek1 and Rohana J. Karunamuni2 1 Dept. of Mathematics and Statistics, Brno 2 Dept. of Mathematical and Statistical Sciences, University of Alberta Abstract: The Receiver Operating Characteristic (ROC) curve is a statistical tool for evaluating the accuracy of diagnostics tests. The empirical ROC curve (which is a step function) is the most commonly used non-parametric estimator for the ROC curve. On the other hand, kernel smoothing methods have been used to obtain smooth ROC curves. The preceding process is based on kernel estimates of the distribution functions. It has been observed that kernel distribution estimators are not consistent when estimating a distribution function near the boundary of its support. This problem is due to “boundary effects” that occur in nonparametric functional estimation. To avoid these difficulties, we propose a generalized reflection method of boundary correction in the estimation problem of ROC curves. The proposed method generates a class of boundary corrected estimators. Zusammenfassung: Die Receiver Operating Characteristic (ROC) Kurve ist ein statistisches Werkzeug zur Bewertung der Pr¨azision diagnostischer Tests. Die empirische ROC Kurve (sie ist eine Treppenfunktion) ist der am weitesten verbreitete nicht-parametrische Sch¨atzer der ROC Kurve. Andererseits wurden Kerng¨attungsmethoden verwendet, um glatte ROC Kurven zu erhalten. Der vorangehende Prozess basiert dabei auf Kernsch¨atzungen der Verteilungsfunktionen. Es wurde beobachtet, dass Kernsch¨atzer der Verteilung nicht konsistent sind falls die Verteilungsfunktion in der N¨ahe des Randes ihres Tr¨agers gesch¨atzt wird. Dieses Problem beruht auf dem “Randeffekt” der in der nicht-parametrischen funktionalen Sch¨atzung auftritt. Um derartige Schwierigkeiten zu vermeiden, empfehlen wir eine verallgemeinerte Reflexionsmethode der Randkorrektur im Sch¨atzproblem von ROC Kurven. Die vorgeschlagene Methode generiert eine Klasse von randkorrigierten Sch¨atzern. Keywords: Reflection, Distribution Estimation. 1 Introduction The Receiver Operating Characteristic (ROC) describes the performance of a diagnostic test which classifies subjects into either group without condition G0 or group with condition G1 by means of a continuous discriminant score X, i.e., a subject is classified as G1 if X ≥ d and G0 otherwise for a given cutoff point d ∈ R. The ROC is defined as a plot of probability of false classification of subjects from G1 versus the probability of true classification of subjects from G0 across all possible cutoff point values of X. Specifically, let 18 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 F0 and F1 denote the distribution functions of X in the groups G0 and G1, respectively. Then, the ROC curve can be written as R(p) = 1 − F1 F−1 0 (1 − p) , 0 < p < 1 , where p is the false positive rate in (0, 1) as the corresponding cut-off point ranges from −∞ to +∞ and F−1 0 denotes the inverse function of F0. A simple non-parametric estimator for R(p) is to use the empirical distribution functions for F0 and F1. The resulting ROC curve is a step function and it is called the empirical ROC curve. Another type of non-parametric estimator for R(p) is derived from kernel smoothing methods. Kernel smoothing is most widely used mainly because it is easy to derive and has good asymptotic and small sample properties. Kernel smoothing has received a considerable attention in density estimation context; see, for example the monographs of Silverman (1986) and Wand and Jones (1995). However, applications of kernel smoothing in distribution function estimation are relatively few. Some theoretical properties of a kernel distribution function estimator have been investigated by Nadaraya (1964), Reiss (1981), and Azzalini (1981). Lloyd (1998) proposed a nonparametric estimator of ROC by using kernel estimators for the distribution functions F0 and F1. Lloyd and Yong (1999) showed that Lloyd’s estimator has better mean squared error properties than the empirical ROC curve estimator. However, his estimator has some drawbacks. For example, Lloyd’s estimator is unreliable near the end points of the support of the ROC curve due to so-called “boundary effects” that occur in nonparametric functional estimation. Although there is a vast literature on boundary correction in density estimation context, boundary effects problem in distribution function context has been less studied. In this paper, we develop a new kernel type estimator of the ROC curve that removes boundary effects near the end points of the support. Our estimator is based on a new boundary corrected kernel estimator of distribution functions and it is based on ideas of Karunamuni and Alberts (2005a, 2005b, 2006), Zhang and Karunamuni (1998, 2000), (Karunamuni and Zhang, 2008), and Zhang, Karunamuni, and Jones (1999) developed for boundary correction in kernel density estimation. The basic technique of construction of the proposed estimator is kind of a generalized reflection method involving reflecting a transformation of the observed data. In fact, the proposed method generates a class of boundary corrected estimators. We derive expressions for the bias and variance of the proposed estimator. Furthermore, the proposed estimator is compared with the “classical estimator” using simulation studies. We observe that the proposed estimator successfully remove boundary effects and performs considerably better than the “classical estimator”. Kernel smoothing in distribution function and ROC curve estimation is discussed in the next section. The proposed estimator is given in Section 3. Simulation results are given in Section 4. A real data example is analyzed in Section 5. Finally, some concluding remarks are given in Section 6. J. Kol´aˇcek and R. Karunamuni 19 2 Kernel Smoothing 2.1 Kernel ROC Estimator Suppose that independent samples X01, . . . , X0n0 and X11, . . . , X1n1 are available from some two unknown distributions F0 and F1, respectively, where F0 ∈ G0 and F1 ∈ G1 and G0 and G1 denote two groups of continuous distribution functions. Then a simple nonparametric estimator of the ROC curve R(p) = 1 − F1 F−1 0 (1 − p) , 0 < p < 1, is known as the empirical ROC curve given by RE(p) = 1 − F1 F−1 0 (1 − p) , 0 ≤ p ≤ 1 , where F0 and F1 denote the empirical distribution functions of F0 and F1 based on the data X01, . . . , X0n0 and X11, . . . , X1n1 , respectively; that is F0(x) = 1 n0 n0 i=1 I(X0i ≤ x) , F1(x) = 1 n1 n1 i=1 I(X1i ≤ x) . Note that R is not a continuous function. In fact, it is a step function on the interval [0, 1]. This is a notable weakness of the empirical ROC curve R(p). Since the ROC curve is a smooth function of p, we would expect to have an estimator that is smooth as well. Lloyd (1998) proposed a smooth estimator using kernel smoothing techniques. His idea is to replace unknown distribution F0 and F1 by two smooth kernel estimators. Specifically, he employed following kernel estimators of F0 and F1: F0(x) = 1 n0 n0 i=1 W x − X0i h0 , F1(x) = 1 n1 n1 i=1 W x − X1i h1 , where W(x) = x −1 K(t)dt, h0 and h1 denote bandwidths (h0 → 0 and h1 → 0 as n0 → ∞ and n1 → ∞, respectively), and K is a unimodal symmetric density function with support [−1, 1]. The corresponding estimator of the ROC curve R(p) is then given by R(p) = 1 − F1 F−1 0 (1 − p) , 0 ≤ p ≤ 1 . An example of a smooth estimate of R(p) using R(p) is illustrated in Figure 1. When G0 and G1 contain distributions with finite support then the estimator R exhibits boundary effects near the endpoints of the support due to the same boundary effects that occur in the uncorrected kernel estimators F0 and F1. The main purpose of this article is to improve the kernel distribution estimators and thereby to avoid boundary effects of smooth kernel ROC estimators. Details of the boundary problem with F0 and F1 are described in the next section. 2.2 Kernel Distribution Estimator and Boundary Effects Let f denote a continuous density function with support [0, a], 0 < a ≤ ∞, and consider nonparametric estimation of the cumulative distribution function F of f based on a random sample X1, . . . , Xn from f. Suppose that F(j) , the j-th derivative of F, exists and is 20 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FPR TPR R(p) Figure 1: Smooth estimate of R(p). continuous on [0, a], j = 0, 1, 2, with F(0) = F and F(1) = f. Then the traditional kernel estimator of F is given by Fh,K(x) = 1 n n i=1 W x − Xi h , W(x) = x −1 K(t)dt , where K is a symmetric density function with support [−1, 1] and h is the bandwidth (h → 0 as n → ∞). The basic properties of Fh,K(x) at interior points are well-known (e.g. Lejeune and Sarda, 1992), and under some smoothness assumptions these include, for h ≤ x ≤ a − h, E Fh,K(x) − F(x) = 1 2 β2f(1) (x)h2 + o(h2 ) nvar Fh,K(x) = F(x) (1 − F(x)) + hf(x) 1 −1 W(t) (W(t) − 1) dt + o(h) . The performance of Fh,K(x) at boundary points, i.e., for x ∈ [0, h) ∪ (a − h, a], however, differs from the interior points due to so-called “boundary effects” that occur in nonparametric curve estimation problems. More specifically, the bias of Fh,K(x) is of order O(h) instead of O(h2 ) at boundary points, while the variance of Fh,K(x) is of the same order. This fact can be clearly seen by examining the behavior of Fh,K inside the left boundary region [0, h]. Let x be a point in the left boundary, i.e., x ∈ [0, h]. Then we can write J. Kol´aˇcek and R. Karunamuni 21 x = ch, 0 ≤ c ≤ 1. The bias and variance of Fh,K(x) at x = ch are of the form E Fh,K(x) − F(x) = hf(0) −c −1 W(t)dt (1) + h2 f(1) (0) c2 2 +c −c −1 W(t)dt− c −1 tW(t)dt + o(h2 ) nvar Fh,K(x) = F(x)(1 − F(x)) + hf(0) c −1 W2 (t)dt − c + o(h) . (2) From expression (1) it is now clear that the bias of Fh,K(x) is of order O(h) instead of O(h2 ). To remove this boundary effect in kernel distribution estimation we investigate a new class of estimators in the next section. 3 The Proposed Estimator In this section we propose a class of estimators of the distribution function F of the form Fh,K(x) = 1 n n i=1 W x − g1(Xi) h − W − x + g2(Xi) h , (3) where h is the bandwidth, K is a symmetric density function with support [−1, 1], and g1 and g2 are two transformations that need to be determined. The same type of estimator in density estimation case has been discussed in Zhang et al. (1999). As in the preceding paper, we assume that gi, i = 1, 2, are nonnegative, continuous and monotonically increasing functions defined on [0, ∞). Further assume that g−1 i exists, gi(0) = 0, g (1) i (0) = 1, and that g (2) i exists and is continuous on [0, ∞), where g (j) i denotes the j-th derivative of gi, with g (0) i = gi and g−1 i denoting the inverse function of gi, i = 1, 2. We will choose g1 and g2 such that Fh,K(x) ≥ 0 everywhere. Note that the i-th term of the sum in (3) can be expressed as W x − g1(Xi) h − W − x + g2(Xi) h = x+g2(Xi) h −x+g1(Xi) h K(t)dt . The preceding integral is non-negative provided the inequality −x+g1(Xi) ≤ x+g2(Xi) holds. Since x ≥ 0, the preceding inequality will be satisfied if g1 and g2 are such that g1(Xi) ≤ g2(Xi) for i = 1, . . . , n. Thus we will assume that g1 and g2 are chosen such that g1(x) ≤ g2(x) for x ∈ [0, ∞) for our proposed estimator. Now, we can obtain the 22 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 bias and variance of (3) at x = ch, 0 ≤ c ≤ 1, as E Fh,K(x) − F(x) = h2 f(1) (0) c2 2 + 2c −c −1 W(t)dt − c −c tW(t)dt −f(0)g (2) 1 (0) c −1 (c − t)W(t)dt (4) −f(0)g (2) 2 (0) −c −1 (c + t)W(t)dt + o(h2 ) nvar Fh,K(x) = F(x)(1 − F(x)) + hf(0) c −1 W2 (t)dt −2 c −1 W(t)W(t − 2c)dt + −c −1 W2 (t)dt + o(h) . (5) The proofs of (4) and (5) are given in the Appendix. Note that the contribution of g2 on the bias vanishes as c → 1. By comparing expressions (1), (4), (2), and (5) at boundary points we can see that the variances are of the same order and the bias of Fh,K(x) is of order O(h) whereas the bias of Fh,K(x) is of order O(h2 ). So our proposed estimator removes boundary effects in kernel distribution estimation since the bias at boundary points is of the same order as the bias at interior points. It is clear that there are various possible choices available for the pair (g1, g2). However, we will choose g1 and g2 so that the condition Fh,K(0) = 0 will be satisfied because of the fact that F(0) = 0. A sufficient (but not necessary) condition for the preceding condition to be satisfied is that g1 and g2 must be equal. Thus we need to construct a single transformation function g such that g = g1 = g2. Other important properties that are desirable in the estimator Fh,K are the local adaptivity (i.e., the transformation function g depends on c) and that Fh,K(x) being equal to the usual kernel estimator Fh,K(x) at interior points. For the latter, g must satisfy that g(y) → y as c → 1. In order to display the dependance of g on c, 0 ≤ c ≤ 1, we shall denote g by gc in what follows. Summarizing all the assumptions, it is clear now that gc should satisfy the conditions (i) gc : [0, ∞) → [0, ∞), gc is continuous, monotonically increasing and g (i) c exists, i = 1, 2. (ii) g−1 c (0) = 0 and g (1) c (0) = 1. (iii) gc(y) → y for c → 1. Functions satisfying conditions (i) to (iii) are easy to construct. The trivial choice is gc(y) = y, which represents the “classical” reflection method estimator. Based on extensive simulations, we observed that the following transformation adapts well to various shapes of distributions: gc(y) = y + 1 2 Icy2 , (6) for y ≥ 0 and 0 ≤ c ≤ 1, where Ic = −c −1 W(t)dt. Remark: Some discussion on the above choice of gc and other various improvements that can be made would be appropriate here. It is possible to construct functions gc that improve the bias further under some additional conditions. For instance, if one examines J. Kol´aˇcek and R. Karunamuni 23 the right hand side of bias expansion (4) then it is not difficult to see that the terms inside bracket (i.e., the coefficient of h2 ) can be made equal to zero if gc is appropriately chosen. Indeed, if gc is chosen such that f(0)g(2) c (0) c −1 (c − t)W(t)dt + −c −1 (c + t)W(t)dt = f(1) (0) c2 2 + 2c −c −1 W(t)dt − c −c tW(t)dt , then the bias of Fh,K(x) would be theoretically of order O(h3 ). For such a function gc, the second derivative at zero, g (2) c (0), will depend on the ratio d1 = f(1) (0)/f(0). In this case, the function gc would probably be some cubic polynomial; see e.g. Karunamuni and Alberts (2005a, 2005b, 2006). Then the problem of estimation of d1 naturally arises as in the preceding paper. Another problem that one would face is that the second derivative g (2) c (0) may not go to 0 when c → 1 as in the case of density estimation context. Thus one may not be able to find any function gc which satisfies condition (iii) and hence the estimator Fh,K loses the property of “natural extension” to the classical estimator outside the boundary points. These are basically the main reasons why we decided to implement a quadratic function defined in (6) as our choice of transformation. 4 Simulation To test the effectiveness of our estimator, we simulated its performance against the reflection method. The simulation is based on 1000 replications. In each replication, the random variables X0 ∼ Exp(2) and X1 ∼ Gamma(3, 2) were generated and the estimate of the ROC curve was computed. The probability distributions of both groups G0 and G1 are illustrated in Figure 2. In all replications sample sizes of n0 = n1 = 50 were used. In this case, the actual global optimal bandwidths (see Azzalini, 1981) for F0 and F1 are hF0 = 2.9149 and hF1 = 5.8298, respectively. For the kernel estimation of the cumulative distributions we used the quartic kernel K(x) = 15 16 (1 − x2 )2 I[−1,1], where IA is the indicator function on the set A. In our experience, the quality of estimated curve by using this kernel is not too sensitive to an optimal bandwidth choice. Hence we used this kernel also in the next section. For each ROC curve we have calculated the mean integrated squared error (MISE) on the interval [0, 1] over all 1000 replications and have displayed the results in a boxplot in Figure 3. The variance of each estimator can be accurately gauged by the whiskers of the plot. The values of means and standard deviations for MISE of each method are given in Table 1. We also obtained 10 typical realizations of each estimator and displayed these in Figure 4 for comparison purposes with the theoretical ROC curve. The solid line represents the theoretical ROC curve and the dotted lines illustrate the 10 realizations. The final estimate of the ROC curve depends on estimates of the cumulative distribution functions F0 and F1. While boundary effects cause problems by estimating F0 and 24 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 0 5 10 15 20 25 30 35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 G0 G1 Figure 2: The probability distribution of groups G0 and G1. Table 1: Means and standard deviations of the MISE. Method Mean STD Proposed 0.0053 0.0047 Reflection 0.0065 0.0050 Classical 0.0084 0.0054 F1 inside the left boundary region, the quality of the final estimate of the ROC can also be influenced by these effects near the right boundary of the interval [0, 1] as well. As we can see in Figure 4, the biggest difference between the above mentioned methods is in the second half part of the interval [0, 1]. Table 1 describes the performance of our proposed method with respect to the MISE. The values of the mean and the standard deviation for the MISE were smallest in case of our proposed estimator. Although the theoretical bias of our estimator is of the same order as in the case of the reflection method, the numerical results of estimators of the ROC curves were better for our estimator in the simulation. In our opinion, this is due to the fact that our estimator is locally adaptive. 5 Consumer Loans Data In this example we used some (unspecified) scoring function to predict the solidity of a client. The goal here is to determine which clients are able to pay their loans. We considered a test set of 332 clients; 309 paid their loans (group G0) and 22 had problems with J. Kol´aˇcek and R. Karunamuni 25 1 2 3 0 0.01 0.02 0.03 0.04 0.05 0.06 MISE Method Figure 3: Boxplots of the MISE over [0, 1] for our proposed method (1), the reflection method (2), and the classical estimator with boundary effects (3). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (2) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (3) Figure 4: Estimates of the ROC for our proposed method (1), the reflection method (2), and the classical estimator with boundary effects (3). payments or did not pay (group G1). We used the ROC curve to assess the discrimination between clients with and without a good solidity. It is of interest for us to know here if our scoring function is a good predictor of the solidity. Estimates of ROC are illustrated in Figure 5. The dashed line represents the estimate obtained by our proposed method and the solid line is for the kernel ROC with boundary effects. When choosing the optimal bandwidths for distribution function estimation, we used the method described in Horov´a, Kol´aˇcek, Zelinka, and El-Shaarawi (2008). A somewhat similar method for density estimation is given in Sheather and Jones (1991). The optimal bandwidths for distribution functions F0 and F1 were estimated as ˆhF0 = 26 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 5: The estimate of the ROC for consumer the loans data. 0.0068 and ˆhF1 = 0.0286, respectively. From the estimates of the ROC one can see that the scoring function is not a good predictor of the solidity of a client. This fact could be also affected by the different sizes of both groups. When group G1 is too small it causes larger boundary effects. It is clearly visible that the estimate of the ROC obtained by the classical estimator (solid line) has some values under the diagonal of the unit square. However, this situation does not show up theoretically. Thus there is a larger influence of boundary effects to the quality of final estimates of the ROC. 6 Conclusion In this paper we proposed a new kernel-type distribution estimator to avoid the difficulties near the boundary. The technique implemented is a kind of generalized reflection method involving reflecting a transformation of the data. The proposed method generates a class of boundary corrected estimators and it is based on ideas of boundary corrections for kernel density estimators presented in Karunamuni and Alberts (2005a, 2005b, 2006). We showed some good properties of our proposed method (e.g., local adaptivity). Furthermore, it is shown that bias of the proposed estimator is smaller than that of the “classical” case. J. Kol´aˇcek and R. Karunamuni 27 Acknowledgements The research was supported by the Jaroslav H´ajek center for theoretical and applied statistics (grant No. LC 06024). The second author’s research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. Appendix Proof of (4). For x = ch, 0 ≤ c ≤ 1, using the property W(t) = 1 − W(−t) we obtain E(Fh,K(x)) = E W x − g1(Xi) h − E W − x + g2(Xi) h = ∞ 0 W x − g1(y) h f(y)dy − ∞ 0 W − x + g2(y) h f(y)dy = h c −1 W(t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt − h −c −1 W(t) f g−1 2 ((−c − t)h) g (1) 2 g−1 2 ((−c − t)h) dt = h −c −1 W(t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) − f g−1 2 ((−c − t)h) g (1) 2 g−1 2 ((−c − t)h) dt +h c −c (1 − W(−t)) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt = h −c −1 W(t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) − f g−1 2 (−c − t)h) g (1) 2 g−1 2 ((−c − t)h) dt +F g−1 1 (2ch) − h c −c W(t) f g−1 1 ((c + t)h) g (1) 1 g−1 1 ((c + t)h) dt . Using a Taylor expansion of order 2 on the function F g−1 1 (·) we have F g−1 1 (2ch) = F(0) + f(0)2ch + f(1) (0) − f(0)g (2) 1 (0) 2c2 h2 + o(h2 ) . By the existence and continuity of F(2) (·) near 0, we obtain for x = ch F(0) = F(x) − f(x)ch + 1 2 f(1) (x)c2 h2 + o(h2 ) f(x) = f(0) + f(1) (0)ch + o(h) f(1) (x) = f(1) (0) + o(1) . Therefore, F g−1 1 (2ch) = F(x) + f(0)ch + 3 2 f(1) (0) − 2f(0)g (2) 1 (0) c2 h2 + o(h2 ) . (7) Now, (7) and a Taylor expansion of order 1 of the functions f g−1 1 (·) g (1) 1 g−1 1 (·) and f g−1 2 (·) g (1) 2 g−1 2 (·) 28 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 give E Fh,K(x) − F(x) = h −c −1 W(t) 2f(1) (0)ch − f(0)h (c − t)g (2) 1 (0) + (c + t)g (2) 2 (0) + o(h) dt + f(0)ch + 3 2 f(1) (0) − 2f(0)g (2) 1 (0) c2 h2 + o(h2 ) − h c −c W(t) f(0) + f(1) (0) − f(0)g (2) 1 (0) (c + t)h + o(h) dt = h f(0)c − f(0) c −c W(t)dt + h2 3 2 f(1) (0)c2 + 2f(1) (0)c −c −1 W(t)dt − 2f(0)g (2) 1 (0)c2 − f(0)g (2) 1 (0) −c −1 (c − t)W(t)dt − f(0)g (2) 2 (0) −c −1 (c + t)W(t)dt − f(1) (0) − f(0)g (2) 1 (0) c −c (c + t)W(t)dt + o(h2 ) . From the symmetry of K and the definition W(x), one can write W(x) = 1 2 + b(x), where b(x) = −b(−x) for all x such that |x| ≤ 1. Thus c −c W(t)dt = c and therefore the coefficient of h is zero. So after some algebra we obtain the bias expression as E Fh,K(x) − F(x) = h2 f(1) (0) c2 2 + 2c −c −1 W(t)dt − c −c tW(t)dt −f(0)g (2) 1 (0) c −1 (c − t)W(t)dt − f(0)g (2) 2 (0) −c −1 (c + t)W(t)dt + o(h2 ) . Proof of (5). Observe that for x = ch, 0 ≤ c ≤ 1, we have nvar Fh,K(x) = 1 n var n i=1 W x − g1(Xi) h − W − x + g2(Xi) h = E W x − g1(Xi) h − W − x + g2(Xi) h 2 − E W x − g1(Xi) h − W − x + g2(Xi) h 2 = A1 − A2 , J. Kol´aˇcek and R. Karunamuni 29 where A1 = E W x − g1(Xi) h − W − x + g2(Xi) h 2 = ∞ 0 W x − g1(y) h − W − x + g2(y) h 2 f(y)dy = ∞ 0 W2 x − g1(y) h + W2 − x + g2(y) h f(y)dy − ∞ 0 2W x − g1(y) h W − x + g2(y) h f(y)dy = h −c −1 W2 (t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) + f g−1 2 ((−c − t)h) g (1) 2 g−1 2 ((−c − t)h) dt +h c −c W2 (t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt − ∞ 0 2W x − g1(y) h W − x + g2(y) h f(y)dy = A1,1 + A1,2 − A1,3 . Using a Taylor expansion as in the last proof, it can be shown that A1,1 = h −c −1 W2 (t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) + f g−1 2 ((−c − t)h) g (1) 2 g−1 2 ((−c − t)h) dt = h −c −1 W2 (t) (2f(0) + o(1)) dt . For A1,2 we use the identity W(t) = 1 − W(−t) and similarly as in the last proof we get A1,2 = h c −c W2 (t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt = h c −c 1 − 2W(−t) + W2 (−t) f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt = h c −c f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt − 2h c −c W(t) f g−1 1 ((c + t)h) g (1) 1 g−1 1 ((c + t)h) dt +h c −c W2 (t) f g−1 1 ((c + t)h) g (1) 1 g−1 1 ((c + t)h) dt = F g−1 1 (2ch) − 2h c −c W(t) (f(0) + o(1)) dt + h c −c W2 (t) (f(0) + o(1)) dt = F(x) − f(0)ch + hf(0) c −c W2 (t)dt + o(h) . Using the continuity of g (2) i , gi(0) = 0, and g (1) i (0) = 1, i = 1, 2, and by a Taylor 30 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 expansion of order 2 on g2 g−1 1 (·) , we have g2 g−1 1 ((c − t)h) = g2 g−1 1 (0) + g (1) 2 g−1 1 (0) g (1) 1 g−1 1 (0) (c − t)h + o(h) = (c − t)h + o(h) . With the preceding expansion we obtain A1,3 = ∞ 0 2W x − g1(y) h W − x + g2(y) h f(y)dy = 2h c −1 W(t)W − x h − g2 g−1 1 ((c − t)h) h f g−1 1 ((c − t)h) g (1) 1 g−1 1 ((c − t)h) dt = 2h c −1 W(t)W −ch − (c − t)h − o(h) h (f(0) + o(1)) dt = 2hf(0) c −1 W(t)W(t − 2c)dt + o(h) . Now we can express A1 as A1 = A1,1 + A1,2 − A1,3 = 2hf(0) −c −1 W2 (t)dt + F(x) − f(0)ch + hf(0) c −c W2 (t)dt −2hf(0) c −1 W(t)W(t − 2c)dt + o(h) = F(x) + hf(0) 2 −c −1 W2 (t)dt − c + c −c W2 (t)dt − 2 c −1 W(t)W(t − 2c)dt +o(h) . With the expression obtained for the bias we obtain the expression for A2 as A2 = E W x − g1(Xi) h − W − x + g2(Xi) h 2 = E Fh,K(x) 2 = F2 (x) + o(h) . Finally, we obtain the variance of the estimator as nvar Fh,K(x) = A1 − A2 = F(x) + hf(0) 2 −c −1 W2 (t)dt − c + c −c W2 (t)dt − 2 c −1 W(t)W(t − 2c)dt −F2 (x) + o(h) = F(x)(1 − F(x)) +hf(0) 2 −c −1 W2 (t)dt − c + c −c W2 (t)dt − 2 c −1 W(t)W(t − 2c)dt + o(h) . J. Kol´aˇcek and R. Karunamuni 31 References Azzalini, A. (1981). A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika, 68, 326-328. Horov´a, I., Kol´aˇcek, J., Zelinka, J., and El-Shaarawi, A. H. (2008). Smooth estimates of distribution functions with application in environmental studies. Advanced topics on mathematical biology and ecology, 122-127. Karunamuni, R. J., and Alberts, T. (2005a). A generalized reflection method of boundary correction in kernel density estimation. Canadian Journal of Statistics, 33, 497- 509. Karunamuni, R. J., and Alberts, T. (2005b). On boundary correction in kernel density estimation. Statistical Methodology, 2, 191-212. Karunamuni, R. J., and Alberts, T. (2006). A locally adaptive transformation method of boundary correction in kernel density estimation. Journal of Statistical Planning and Inference, 136, 2936-2960. Karunamuni, R. J., and Zhang, S. (2008). Some improvements on a boundary corrected kernel density estimator. Statistics & Probability Letters, 78, 497-507. Lejeune, M., and Sarda, P. (1992). Smooth estimators of distribution and density functions. Computational Statistics & Data Analysis, 14, 457-471. Lloyd, C. J. (1998). The use of smoothed ROC curves to summarise and compare diagnostic systems. Journal of the American Statistical Association, 93, 1356-1364. Lloyd, C. J., and Yong, Z. (1999). Kernel estimators of the ROC curve are better than empirical. Statistics and Probability Letters, 44, 221-228. Nadaraya, E. A. (1964). Some new estimates for distribution functions. Theory of Probability and its Application, 15, 497-500. Reiss, R. D. (1981). Nonparametric estimation of smooth distribution functions. Scandinavian Journal of Statistics, 8, 116-119. Sheather, S. J., and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690. Silverman, W. R. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. Wand, M. P., and Jones, M. C. (1995). Kernel Smoothing. London: Chapman and Hall. Zhang, S., and Karunamuni, R. J. (1998). On kernel density estimation near endpoints. J. Statist. Planning and Inference, 70, 301–316. Zhang, S., and Karunamuni, R. J. (2000). On nonparametric density estimation at the boundary. Nonparametric Statistics, 12, 197–221. Zhang, S., Karunamuni, R. J., and Jones, M. C. (1999). An improved estimator of the density function at the boundary. Journal of the American Statistical Association, 94, 1231–1241. 32 Austrian Journal of Statistics, Vol. 38 (2009), No. 1, 17–32 Authors’ addresses: Jan Kol´aˇcek Department of Mathematics and Statistics Faculty of Science Kotl´aˇrsk´a 2 611 37 Brno Czech Republic E-Mail: kolacek@math.muni.cz Rohana J. Karunamuni Department of Mathematical and Statistical Sciences University of Alberta T6G 2G1 Edmonton Canada E-Mail: R.J.Karunamuni@ualberta.ca Computational Statistics (2008) 23:63–78 DOI 10.1007/s00180-007-0068-6 ORIGINAL PAPER Plug-in method for nonparametric regression Jan Koláˇcek Accepted: 5 October 2006 / Published online: 25 September 2007 © Springer-Verlag 2007 Abstract The problem of bandwidth selection for non-parametric kernel regression is considered. We will follow the Nadaraya–Watson and local linear estimator especially. The circular design is assumed in this work to avoid the difficulties caused by boundary effects. Most of bandwidth selectors are based on the residual sum of squares (RSS). It is often observed in simulation studies that these selectors are biased toward undersmoothing. This leads to consideration of a procedure which stabilizes the RSS by modifying the periodogram of the observations. As a result of this procedure, we obtain an estimation of unknown parameters of average mean square error function (AMSE). This process is known as a plug-in method. Simulation studies suggest that the plug-in method could have preferable properties to the classical one. Keywords Bandwidth selection · Fourier transform · Kernel estimation · Nonparametric regression 1 Introduction In nonparametric regression estimation, a critical and inevitable step is to choose the smoothing parameter (bandwidth) to control the smoothness of the curve estimate. The smoothing parameter considerably affects the features of the estimated curve. Although in practice one can try several bandwidths and choose a bandwidth subjectively, automatic (data-driven) selection procedures could be useful for many situations; see Silverman (1985) for more examples. Supported by the MSMT: LC 06024. J. Koláˇcek (B) Faculty of Science, Masaryk University, Janackovo nam. 2a, Brno, Czech Republic e-mail: kolacek@math.muni.cz 123 64 J. Koláˇcek Several automatic bandwidth selectors have been proposed and studied in Craven and Wahba (1979), Härdle (1990), Härdle et al. (1988), Droge (1996), and references given therein. It is well recognized that these bandwidth estimates are subject to large sample variation. The kernel estimates based on the bandwidths selected by these procedures could have very different appearances. Due to the large sample variation, classical bandwidth selectors might not be very useful in practice. In the simulation study of Chiu (1990), it was observed that Mallows’ criterion gives smaller bandwidths more frequently than predicted by the asymptotic theorems. Chiu (1990) provided an explanation for the cause and suggested a procedure to overcome the difficulty. By applying the procedure, we introduce a new method for bandwidth selection which gives much more stable bandwidth estimates. 2 Kernel regression Consider a standard regression model of the form Yt = m(xt ) + εt , t = 0, . . . , T − 1, T ∈ N, where m is an unknown regression function, xt are design points, Yt are measurements and εt are independent random variables for which E(εt ) = 0, var(εt ) = σ2 > 0, t = 0, . . . , T − 1. The aim of kernel smoothing is to find suitable approximation m of the unknown function m. In next we will assume 1. The design points xt are equidistantly distributed on the interval [0, 1], that is xt = t/T, t = 0, . . . , T − 1. 2. We use a “cyclic design”, that is, suppose m(x) is a smooth periodic function and the estimate is obtained by applying the kernel on the extended series Yt , where Yt+kT = Yt for k ∈ Z. Similarly xt = t/T , t ∈ Z. Lip[a, b] denotes the class of continuous functions satisfying the inequality |g(x) − g(y)| ≤ L|x − y| ∀x, y ∈ [a, b], L > 0, L is a constant. Definition Let κ be a nonnegative even integer and assume κ ≥ 2. The function K ∈ Lip[−1, 1], support(K) = [−1, 1], satisfying the following conditions (i) K(−1) = K(1) = 0 (ii) 1 −1 x j K(x)dx = ⎧ ⎨ ⎩ 0, 0 < j < κ 1, j = 0 βκ = 0, j = κ, is called a kernel of order κ and a class of all these kernels is marked S0κ. These kernels are used for an estimation of the regression function (see Wand and Jones 1995). 123 Plug-in method for nonparametric regression 65 Let K ∈ S0κ, set Kh(.) = 1 h K( . h ), h ∈ (0, 1). A parameter h is called a bandwidth. Commonly used non-parametric methods for estimating m(x) are the kernel esti- mators 1. Nadaraya–Watson estimator (Nadaraya 1964; Watson 1964) mNW (x; h) = 2T −1 k=−T Kh(xk − x)Yk 2T −1 k=−T Kh(xk − x) 2. Local linear estimator (Stone 1977; Cleveland 1979) mLL(x; h) = 1 T 2T −1 k=−T {ˆs2(x; h) − ˆs1(x; h)(xk − x)}Kh(xk − x)Yk ˆs2(x; h)ˆs0(x; h) − ˆs1(x; h)2 where ˆsr (x; h) = 1 T 2T −1 k=−T (xk − x)r Kh(xk − x). In the cyclic design, the kernel estimators can be generally expressed as m(x; h) = 2T −1 k=−T W ( j) k (x)Yk, where the weights W ( j) k (x), j ∈ {NW, LL} correspond to the weights of estimators mNW , mLL. The assumption of the circular model leads to the fact, that the weights of Nadaraya–Watson and local linear estimator are identical at design points, that is W (LL) k (xt ) = W (NW) k (xt ), for k ∈ {−T, −T − 1, . . . , 2T − 1}, t ∈ {0, 1, . . . , T − 1}, so in next, we will write only Wk(xt ) without upper index. Let K ∈ S0κ, h ∈ (0, 1), t ∈ {0, . . . , T −1}. Then the sum 2T −1 k=−T Kh(xk − xt ) = T −1 k=−T +1 Kh(xk) is independent on t. Set CT := T −1 k=−T +1 Kh(xk). We can simply write the value of weight functions at design points xt , t = 0, . . . , T − 1 Wk(xt ) = 1 CT Kh(xk − xt ). The optimal bandwidth considered here is hopt, the minimizer of the average mean squared error (AMSE) RT (h) = 1 T E T −1 t=0 {m(xt ) − m(xt ; h)}2 . 123 66 J. Koláˇcek Let K ∈ S0κ. Under some mild conditions, AMSE converges to RT (h) = σ2V (K) T h + h2κ (κ!)2 β2 κ Aκ, (1) where V (K) = 1 −1 K2 (x)dx, βκ = 1 −1 xκ K(x)dx, Aκ = 1 0 m(κ) (x) 2 dx. This function has an unique minimum hopt hopt = σ2V (K)(κ!)2 2κTβ2 κ Aκ 1 2κ+1 (2) (for more details, see Wand and Jones 1995). There exist many estimators of this error function, which are asymptotically equivalent and asymptotically unbiased (see Härdle 1990; Chiu 1990, 1991). However, in simulation studies, it is often observed that most selectors are biased toward undersmoothing and give smaller bandwidths more frequently than predicted by asymptotic results. Most bandwidth selectors are based on the residual sum of squares (RSS) RSST (h) = 1 T T −1 t=0 {Yt − m(xt ; h)}2 . For example Rice (see Rice 1984) considered RT (h) = RSST (h) − ˆσ2 + 2 ˆσ2 w0, (3) where ˆσ2 is an estimate of σ2 ˆσ2 = 1 2T − 2 T −1 t=1 (Yt − Yt−1)2 . (4) The estimate ˆhopt of optimal bandwidth is defined as ˆhopt = arg min RT (h). 3 Use of Fourier transformation Let Mt = m(xt ), t = 0, . . . , T − 1. The periodogram of the vector of observations YYY is defined by IYλ IYλ = |Y− λ |2 /2πT, 123 Plug-in method for nonparametric regression 67 where Y− λ = T −1 k=0 Yke− i2πkλ T is the finite Fourier transform of the vector YYY. This transformation is denoted by YYY− = DFT −(YYY). The periodograms and Fourier transforms of the series εεε and MMM are defined similarly. Under mild conditions, the periodogram ordinates Iεt on Fourier frequencies 2πt T , for t = 1, . . . , N = T −1 2 , are approximately independently and exponentially distributed with means σ2 2π . Here [x] means the greatest integer less or equal to x. Definition Let xxx = (x0, . . . , xT −1), yyy = (y0, . . . , yT −1) ∈ CT ; zt = T −1 k=0 x t−k T yk, where t − k T marks (t − k)mod T . Then zzz = (z0, . . . , zT −1) is called the discrete cyclic convolution of vectors xxx and yyy; we write zzz = xxx ⊛ yyy. Let us define a vector www := (w0, w1, . . . , wT −1), where wt = W0(xt − 1) + W0(xt ) + W0(xt + 1). Let h ∈ (0, 1), K ∈ S0κ, t ∈ {0, . . . , T −1}. Then we can write m(xt ; h) as a discrete cyclic convolution of vectors www and YYY. m(xt ; h) = T −1 k=0 w t−k T Yk. (5) Applying Parseval’s formula yields RSST (h) = 4π T N t=1 IYt 1 − w− t 2 , (6) where w− t = T −1 k=−T +1 W0(xk)e− i2πkt T is the finite Fourier transform of www (see Chiu 1990, for details). From (3) and (6) we arrive at the equivalent expression for RT (h) RT (h) = 4π T N t=1 IYt {1 − w− t }2 − ˆσ2 + 2 ˆσ2 w0. (7) 123 68 J. Koláˇcek Similarly, RT (h) = 4π T N t=1 IMt + σ2 2π {1 − w− t }2 − σ2 + 2σ2 w0. (8) 4 The motivation and the plug-in method Let D(h) = RT (h) − RT (h). From previous expressions we obtain D(h) = 4π T N t=1 IYt − IMt − σ2 2π {1 − w− t }2 . (9) The periodogram ordinates IMt decrease rapidly for smooth m(x). So IYt do not contain much information about IMt at high frequencies (for the rigorous proof see Rice 1984). This leads to the consideration of the procedure proposed by Chiu (1991). The main idea is to modify RSS to make it less variable. We find the first index J1 such that IYJ1 < c ˆσ2/2π for some constant c > 1, where ˆσ2 is an estimate of σ2. The constant c sets a threshold. In our experience, setting 1 < c < 3 yields good results. The modified residual sum of squares is defined by MRSST (h) = 2π T T −1 t=0 ˜IYt {1 − w− t }2 , where ˜IYt = IYt , t < J1 ˆσ2/2π, t ≥ J1, (see Figs. 1, 2). Thus, the proposed selector is RT (h) = MRSST (h) − ˆσ2 + 2 ˆσ2 w0 (10) and the new estimate of optimal bandwidth ˆhopt = arg min RT (h) [for more details see Chiu (1990, 1991)]. To simplify the discussion below, set c = 2 and rewrite (10) to the formula in next lemma. Lemma 1 Let J1 be the least index, that IYJ1 < ˆσ2/πT . Then RT (h) = ˆσ2 T T −1 t=0 (w− t )2 + 4π T J1−1 t=1 IYt − ˆσ2 2π {1 − w− t }2 . 123 Plug-in method for nonparametric regression 69 a 0 5 10 15 20 25 30 35 40 0 0.02 0.06 0.08 0.1 0.12 0.14 0.16 Fig. 1 The periodogram ordinates IYt as a function of t, a = 2 ˆσ2 2π a b 0 5 10 15 20 25 30 35 40 0 0.06 0.08 0.1 0.12 0.14 0.16 Fig. 2 The modified periodogram ordinates ˜IYt as a function of t, a = 2 ˆσ2 2π , b = ˆσ2 2π The main idea of plug-in method is to estimate unknown parameters σ2 and Aκ in the expression (2) for the optimal bandwidth hopt, which is the minimum of RT (h) RT (h) = σ2V (K) T h + h2κ (κ!)2 β2 κ Aκ. As an estimate of σ2 we can use (4), but for Aκ the situation is more complicated. From the previous considerations we can replace the error function RT (h) by the selector RT (h) expressed in Lemma 1. If we compare these two error functions, we arrive at results described in next theorems. Theorem 1 Let www− be the discrete Fourier transformation of vector www. Then it holds T −1 t=0 (w− t )2 = 1 h V (K) + O(T −1 ). (11) 123 70 J. Koláˇcek The previous theorem implies that the first term of RT (h) estimates the first term of RT (h), that is ˆσ2 T T −1 t=0 (w− t )2 = σ2V (K) T h + O(T −2 ). Innext,wewillcomparethesecondtermsintheseerrorfunctionstoobtainanestimator for Aκ. Let ε > 0, h ∈ (0, 1), set J2 the last index from {0, . . . , T − 1} for which J2 ≤ κ+1 √ ε(κ + 1)! 2πh . Let’s remark that the parameter ε is an error of Taylor’s approximation used in the proof of Theorem 2 and the parameter h is some “starting" approximation of hopt. In our experience, setting ε = 10−3 and h = κ T yields good results. In next we will request both conditions for indexes J1 and J2 hold at the same time, so we will define the index J J = min{J1, J2 + 1}. (12) Theorem 2 Let J be the index defined by (12). Then for all j ∈ N, 1 ≤ j ≤ J − 1, it holds 1 (2π j)κ (1 − w− j ) = (−1) κ 2 +1 hκ κ! βκ + c + O(T −1 ), (13) where c is a constant satisfying |c| < ε. By using the result of this theorem we can deduce the estimator of unknown parameter Aκ. Definition Let J be the index defined by (12). Then the estimator of the parameter Aκ is of the form Aκ = 4π T J−1 j=1 (2π j)2κ IYj − ˆσ2 2π . So we can estimate the error function (1) RT (h) = ˆσ2V (K) T h + h2κ (κ!)2 β2 κ Aκ, (14) and its minimum ˆhopt = ˆσ2V (K)(κ!)2 2κTβ2 κ Aκ 1 2κ+1 . (15) 123 Plug-in method for nonparametric regression 71 Table 1 Kernels of class S0κ κ K(x) 2 − 3 4 (x2 − 1) 4 15 32 (x2 − 1)(7x2 − 3) 6 − 105 256 (x2 − 1)(33x4 − 30x2 + 5) Table 2 Summary of sample means and standard deviations of bandwidth estimates κ = 2; hopt = 0.1374 κ = 4; hopt = 0.3521 κ = 6; hopt = 0.5783 E(ˆhopt) std(ˆhopt) E(ˆhopt) std(ˆhopt) E(ˆhopt) std(ˆhopt) Rice 0.1269 0.0402 0.3354 0.0938 0.4432 0.1078 Plug-in 0.1383 0.0074 0.3422 0.0348 0.5604 0.0623 The parameter ˆhopt given by (15) is the estimator of the theoretical optimal bandwidth hopt obtained by plug-in method. We would like to point out the computational aspect of the plug-in method. It has preferable properties to classical methods, because there is no problem of minimization of any error function. Also the sample size necessary to compute the estimation is far less than for classical methods. On the other side, a small disadvantage could be the fact, that we need some “starting” approximation of unknown parameter h. 5 A simulation study We carried out a small simulation study to compare the performance of the bandwidth estimates. The observations, Yt , for t = 0, . . . , T − 1 = 74, were obtained by adding independent Gaussian random variables with mean zero and variance σ2 = 0.2 to the function m(x) = sin(2πx). Table 1 describes kernels used in our simulation study. The theoretical optimal bandwidth (see Wand and Jones 1995; Koláˇcek 2005) for these cases are given in Table 2. Two hundred series were generated. Table 2 summarizes the sample means and the sample standard deviations of bandwidth estimates, E(ˆh) is the average of all 200 values and std(ˆh) is their standard deviation. Figure 3 illustrates the histogram of results of all 200 experiments for κ = 2. As we can see, the standard deviation of all results obtained by plug-in method is less than the value of case of Rice’s selector and also the mean of these results is closer to theoretical optimal bandwidth. 123 72 J. Koláˇcek 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 50 100 150 h 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 50 100 150 h Fig. 3 The histogram of results of all 200 experiments obtained by Rice’s selector (grey) and by plug-in method (black) 6 Examples In this section, we will solve some practical examples. We used the data from Eurostat1 and followed the count of marriages in Austria and Switzerland in May in 1950–2003. We transformed the data to the interval [0, 1] and used two selectors to get the optimal bandwidth. Firstly, we found the optimal bandwidth by the Rice’s selector RT (h), which is the classical bandwidth selector. Then we used our proposed selector RT (h). We made estimations of the regression function with both bandwidths by using the kernel of order (0, 4) K(x) = 15 16 (7 2 x4 − 5x2 + 3 2 ), |x| ≤ 1 0, |x| > 1. We used Nadaraya–Watson estimator to obtain final result. 1 see http://epp.eurostat.cec.eu.int. 123 Plug-in method for nonparametric regression 73 1950 1960 1970 1980 1990 2000 2010 3500 4000 4500 5000 5500 6000 6500 7000 Fig. 4 Estimation of the regression function (solid line). The parameter h = 0.0740 was found by Rice’s selector RT (h) 1950 1960 1970 1980 1990 2000 2010 3500 4000 4500 5000 5500 6000 6500 7000 Fig. 5 Estimation of the regression function (solid line). The parameter h = 0.2180 was found by plug-in method RT (h) Marriages in Switzerland In this example we followed the count of marriages in Switzerland in May in 1950– 2003. In this case, the bandwidth obtained by Rice’s selector is too small and the final curve is undersmoothed (Figs. 4, 5). Marriages in Austria In this example we followed the count of marriages in Austria in May in 1950–2003. In this case, we think that the value of the bandwidth obtained by Rice’s selector is too large and the final curve is oversmoothed (Figs. 6, 7). If we compare results of both examples we can see, that the plug-in method is more stable then the classical one. 7 Conclusion The problem of bandwidth selection for non-parametric kernel regression is considered. In many studies, there was often observed that classical methods give smaller 123 74 J. Koláˇcek 1950 1960 1970 1980 1990 2000 2010 4500 5000 5500 6000 6500 7000 7500 8000 Fig. 6 Estimation of the regression function (solid line). The parameter h = 0.4084 was found by Rice’s selector RT (h) 1950 1960 1970 1980 1990 2000 2010 4500 5000 5500 6000 6500 7000 7500 8000 Fig. 7 Estimation of the regression function (solid line). The parameter h = 0.2945 was found by plug-in method RT (h) bandwidths more frequently than predicted by the asymptotic theorems. Chiu (1990) provided an explanation for the cause and suggested a procedure to overcome the difficulty. By applying the procedure, we introduced a new approach to estimate unknown parameters of average mean square error function (AMSE) (this process is known as a plug-in method). Let us remark that Chiu’s procedure was proposed for Pristley–Chao estimator and for a special class of symmetric probability density functions from S02 as kernels. We followed the Nadaraya–Watson and local-linear estimator especially and extended the procedure to these estimators. It was shown they are identical in circular model (see Koláˇcek 2005). In this paper, this approach has been generalized for kernels from the class S0κ, κ even. The main result of this work is in Theorem 2 and in the resulting definition, where the unknown parameter Aκ is estimated. Simulation study and practical examples suggest that our proposed method could have preferable properties to the classical one. We remark that the proposed method is developed for a rather limited case: circular design and equally spaced design points. Further research is required for more general situations. 123 Plug-in method for nonparametric regression 75 8 Appendix Lemma 1 Let J1 be the least index, that IYJ1 < ˆσ2/πT . Then RT (h) = ˆσ2 T T −1 t=0 (w− t )2 + 4π T J1−1 t=1 IYt − ˆσ2 2π {1 − w− t }2 . Proof RT (h) = 4π T N t=1 ˜IYt {1 − w− t }2 − ˆσ2 + 2 ˆσ2 w0 = 4π T J1−1 t=1 IYt {1 − w− t }2 + 4π T N t=J1 ˆσ2 2π {1 − w− t }2 − ˆσ2 + 2 ˆσ2 w0 = 4π T J1−1 t=1 IYt − ˆσ2 2π {1 − w− t }2 + ˆσ2 T T −1 t=0 {1 − w− t }2 − ˆσ2 + 2 ˆσ2 w0 = 4π T J1−1 t=1 IYt − ˆσ2 2π {1 − w− t }2 + ˆσ2 T T − 2T w0 + T −1 t=0 (w− t )2 − ˆσ2 + 2 ˆσ2 w0 = 4π T J1−1 t=1 IYt − ˆσ2 2π {1 − w− t }2 + ˆσ2 T T −1 t=0 (w− t )2 . Lemma 2 Let t ∈ {0, . . . , T − 1}, then W0(xt ) = 1 T Kh(xt ) + O(T −2 ). Proof W0(xt ) = 1 TCT Kh(xt ), where CT = 1 T T −1 k=−T +1 Kh(xk). We can express this constant in another way CT = 1 −1 K(x)dx + O(T −1 ) = 1 + O(T −1 ) 123 76 J. Koláˇcek and after substitution we arrive at the result W0(xt ) = 1 T (1 + O(T −1)) Kh(xt ) = 1 T Kh(xt ) + O(T −2 ). Theorem 1 Let www− be the discrete Fourier transformation of vector www. Then it holds T −1 t=0 (w− t )2 = 1 h V (K) + O(T −1 ). Proof T −1 t=0 (w− t )2 = T −1 t=0 |w− t |2 = T −1 t=0 w− t w− t = T −1 t=0 T −1 j=−T +1 T −1 k=−T +1 W0(x j )W0(xk)e i2π(k− j)t T = T −1 j=−T +1 T −1 k=−T +1 W0(x j )W0(xk) T −1 t=0 e i2π(k− j)t T = T T −1 k=−T +1 W2 0 (xk) = T −1 k=−T +1 1 T K2 h (xk) + O(T −1 ) = 1 −1 K2 h (u)du + O(T −1 ) = 1 h 1 −1 K2 (x)dx + O(T −1 ). Theorem 2 Let J be the index defined by (12). Then for all j ∈ N, 1 ≤ j ≤ J − 1, it holds 1 (2π j)κ (1 − w− j ) = (−1) κ 2 +1 hκ κ! βκ + c + O(T −1 ), (16) where c is a constant satisfying |c| < ε. Proof 1 (2π j)κ (1 − w− j ) = 1 (2π j)κ 1 − 2 T −1 t=0 W0(xt ) cos 2πt j T = 1 (2π j)κ 1 − 2 T −1 t=0 1 T Kh(xt ) cos 2πt j T + O(T −1 ) 123 Plug-in method for nonparametric regression 77 = 1 (2π j)κ ⎧ ⎨ ⎩ 1 − 2 1 0 Kh(u) cos(2π ju)du ⎫ ⎬ ⎭ + O(T −1 ) = 1 (2π j)κ ⎧ ⎨ ⎩ 1 −1 Kh(u)du − 1 −1 Kh(u) cos(2π ju)du ⎫ ⎬ ⎭ + O(T −1 ) = 1 (2π j)κ 1 −1 {1 − cos(2π ju)}Kh(u)du + O(T −1 ). We can replace the function 1 − cos(2π ju) by Taylor’s polynomial of degree κ. Let Rκ is an error of this approximation 1 (2π j)κ (1 − w− j ) = 1 (2π j)κ 1 −1 (2π ju)2 2 − (2π ju)4 24 + · · · + (−1) κ 2 +1 (2π ju)κ κ! ×Kh(u)du + Rκ (2π j)κ + O(T −1 ) = (−1) κ 2 +1 κ! 1 −1 uκ Kh(u)du + Rκ (2π j)κ + O(T −1 ) = (−1) κ 2 +1 hκ κ! 1 −1 xκ K(x)dx + Rκ (2π j)κ + O(T −1 ). The last two terms are negligible, because O(T −1) tends to zero with T → ∞ and from the assumptions for index j holds Rκ (2π j)κ ≤ ε (2π)κ for any ε > 0. References Cleveland WS (1979) Robust locally weighted regression and smoothing scatter plots. J Am Stat Assoc 74:829–836 Craven P, Wahba G (1979) Smoothing noisy data with spline function. Numer Math 31:377–403 Chiu ST (1991) Some stabilized bandwidth selectors for nonparametric regression. Ann Stat 19:1528–1546 Chiu ST (1990) Why bandwidth selectors tend to choose smaller bandwidths, and a remedy. Biometrika 77:222–226 Droge B (1996) Some comments on cross-validation. Stat Theory Comput Aspects Smooth 178–199 Härdle W (1990) Applied nonparametric regression. Cambridge University Press, Cambridge Härdle W, Hall P, Marron JS (1988) How far are automatically chosen regression smoothing parameters from their optimum? J Am Stat Assoc 83:86–95 Koláˇcek J (2005) Kernel estimation of the regression function. PhD-thesis, Brno Nadaraya EA (1964) On estimating regression. Theory Probab Appl 10:186–190 Rice J (1984) Bandwidth choice for nonparametric regression. Ann Stat 12:1215–1230 123 78 J. Koláˇcek Silverman BW (1985) Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J Roy Stat Soc Ser B 47:1–52 Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645 Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, London Watson GS (1964) Smooth regression analysis. Shankya Ser A 26:359–372 123 AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 281–288 A Comparative Study of Boundary Effects for Kernel Smoothing Jan Kol´aˇcek1 and Jitka Pomˇenkov´a Masaryk University, Brno, Czech Republic Abstract: The problem of boundary effects for nonparametric kernel regression is considered. We will follow the problem of bandwidth selection for Gasser-M¨uller estimator especially. There are two ways to avoid the difficulties caused by boundary effects in this work. The first one is to assume the circular design. This idea is effective for smooth periodic regression functions mainly. The second presented method is reflection method for kernel of the second order. The reflection method has an influence on the estimate outside edge points. The method of penalizing functions is used as a bandwidth selector. This work compares both techniques in a simulation study. Keywords: Bandwidth Selection, Kernel Estimation, Nonparametric Regres- sion. 1 Basic Terms and Definitions Consider a standard regression model of the form Yi = m(xi) + εi , i = 1, . . . , n , n ∈ N , where m is an unknown regression function, xi are design points, Yi are measurements and εi are independent random variables for which E(εi) = 0 , var(εi) = σ2 > 0 , i = 0, . . . , n . The aim of kernel smoothing is to find suitable approximation m of an unknown function m. In next we will assume the design points xi are equidistantly distributed on the interval [0, 1], that is xi = (i − 1)/n, i = 1, . . . , n. Lip[a, b] denotes the class of continuous functions satisfying the inequality |g(x) − g(y)| ≤ L|x − y| , ∀x, y ∈ [a, b] , L > 0 , L is a constant. Definition. Let κ be a nonnegative even integer and assume κ ≥ 2. The function K ∈ Lip[−1, 1], support(K) = [−1, 1], satisfying the following conditions 1. K(−1) = K(1) = 0 2. 1 −1 xj K(x) dx =    0, 0 < j < κ 1, j = 0 βκ = 0, j = κ, is called a kernel of order κ and a class of all these kernels is marked S0κ. These kernels are used for an estimation of the regression function (see Wand and Jones, 1995). Let K ∈ S0κ, set Kh(·) = 1 h K( · h ), h ∈ (0, 1). A parameter h is called a bandwidth. 1 Supported by the GACR: 402/04/1308 282 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288 2 Kernel Estimation of the Regression Function Commonly used non-parametric methods for estimating m(x) are the kernel estimators Gasser–M¨uller estimators (1979) mGM (x; h) = n i=1 Yi si si−1 Kh(t − x) dt , where si = xi + xi+1 2 , i = 1, . . . , n − 1 , s0 = 0 , sn = 1 . The kernel estimators can be generally expressed as m(x; h) = n i=1 Wi(x)Yi , where the weights Wi(x) correspond to the weights of the estimators mGM . The quality of the estimated curve is affected by the smoothing parameter h, which is called a bandwidth. The optimal bandwidth considered here is hopt, the minimizer of the average mean squared error (AMSE) Rn(h) = 1 n E n i=1 (m(xi) − m(xi; h))2 . Let K ∈ S0κ. There exist many estimators of this error function, which are asymptotically equivalent and asymptotically unbiased (see Chiu, 1991, 1990; H¨ardle, 1990). Most of them are based on the residual sum of squares (RSS) RSSn(h) = 1 n n i=1 [Yi − m(xi; h)]2 . We will use the method of penalizing functions (see Kol´aˇcek, 2005, 2002) for choosing the smoothing parameter. So the prediction error RSSn(h) is adjusted by some penalizing function Ξ(n−1 Wi(xi)), that is, modified to Rn(h) = 1 n n i=1 [m(xi; h) − Yi]2 · Ξ(n−1 Wi(xi)) . The reason for this adjustment is that the correction function Ξ(n−1 Wi(xi)) penalizes values of h too low. For example Rice (see Rice, 1984) considered ΞR(u) = 1 1 − 2u . This penalizing function will be used. J. Kol´aˇcek and J. Pomˇenkov´a 283 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 −1.5 −1 −0.5 0 0.5 1 1.5 x x+hx−h Figure 1: Demonstration of boundary effects. 3 Boundary Effects In the finite sample situation, the quality of the estimate in the boundary region [0, h] ∪ [1 − h, 1] is affected since the effective window is [x − h, x + h] ⊂ [0, 1] so, that the finite equivalent of the moment conditions on the kernel function does not apply any more. There are several methods to avoid the difficulties caused by boundary effects. 3.1 Cyclic Model One of possible ways to solve problem of boundary effects is to use a cyclic design. That is, suppose m(x) is a smooth periodic function and the estimate is obtained by applying the kernel on the extended series Yi, where Yi+kn = Yi for k ∈ Z. Similarly xi = (i−1)/n, i ∈ Z. In the cyclic design, the kernel estimators can be generally expressed as m(x; h) = 2n i=−n+1 Wi(x)Yi , where the weights Wi(x) correspond to the weights of estimators mGM Wi(x) = si si−1 Kh(t − x)dt, where si = xi + xi+1 2 , i = −n + 1, . . . , 2n − 1 , s−n = −1 , s2n = 2 . 284 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288 Let us define a vector www := (w1, . . . , wn), where wi = W1(xi − 1) + W1(xi) + W1(xi + 1) . Let h ∈ (0, 1), K ∈ S0κ, i ∈ {1, . . . , n}. Then we can write m(xi; h) as a discrete cyclic convolution of vectors www and YYY . m(xi; h) = n k=1 wn Yk , (1) where < i − k >n marks (i − k)mod n. We write mmm = www ⊛ YYY , where mmm = (m(x1; h), . . . , m(xn; h)). As the bandwidth selector the method of Rice’s penalizing function will be used. In the case of cyclic model, we can simplify the error function Rn(h), because the weights Wi(xi) are independent on i. Set I(h) := 1/2n −1/2n Kh(x)dx . Then we can express Rn(h) as Rn(h) = n n − 2 I(h) RSSn(h) (2) and the estimate ˆhopt of optimal bandwidth is defined as ˆhopt = arg min h∈(0,1) RT (h) . 3.2 Reflection Technique Let’s have observations (xi, Yi), i = 1, . . . , n, regression model described in Section 1 and design points xi ∈ [0, 1] such that 0 = a ≤ x1 ≤ · · · ≤ xn ≤ b = 1 . Now, technique for design points reflection will be discussed. We may begin by estimating the function m at edge points a and b with corresponding bandwidth for these points, ha and hb, and edge kernels KL, KR ∈ S02: m(a) = 1 ha n i=1 Yi si si−1 KL a − u ha du , m(b) = 1 hb n i=1 Yi si si−1 KR b − u hb du . J. Kol´aˇcek and J. Pomˇenkov´a 285 For the bandwidth choice ha, hb and the edge kernels KL, KR for m(a), m(b) see Pomˇenkov´a (2005). Further data reflection will be made. We proceed from original data set (xi, Yi), i = 1, . . . , n. For obtaining left mirrors point (a, m(a)) and following relations xLi = 2a − xi , YLi = 2m(a) − Yi are used. For obtaining right mirrors point (b, m(b)) and following relations xRi = 2b − xn−i+1 , YRi = 2m(b) − Yn−i+1 are used. Then original data set (xi, Yi) is connected with left mirrors (xLi, YLi) and with right mirrors (xRi, YRi). By this connection new data set which is called pseudodata and denoted as (xj, Y j), j = 1, . . . , 3n. How to find the bandwidth for an estimate on pseudodata at the design points will be in next. Finally, the function m in design points including points a and b using the pseudodata is estimated. Let K ∈ S02 be a symmetric second-order kernel with support [−1, 1]. The final estimate of function m at points of plan xi, i = 0, . . . , n + 1, where x0 = a, xn+1 = b on pseudodata xj, j = 1, . . . , 3n, with kernel K and bandwidth h is defined m(x) = 1 h 3n j=1 Y j sj sj−1 K x − u h du , where sj = xi + xi+1 2 , j = 1, . . . , 3n − 1 , s0 = −1 , s3n = 2 . Bandwidth selection for pseudodata In this part an estimate of the bandwidth for pseudodata will be searched. Note that estimates at edge points m(a), m(b) are functions of h. Therefore, for any chosen value h ∈ H = [1/n, 2] values m(a), m(b) have to be enumerated, then data reflection is made and pseudodata are obtained. Hereafter, on this pseudodata minimum of the function is searched. To find value h using a Rice penalization function is proposed. Consider pseudodata (xj, Y j), j = 1, . . . , 3n, ¯xj ∈ [−1, 2], m(x) defined as above. Then Rn(h) = 1 n n i=1 [m(xi; h) − Yi]2 · 1 1 − 2xi . The resulting bandwidth h = ˆhopt is the value h that corresponds to the minimum of the function Rn(h), i.e. ˆhopt = arg min h∈H Rn(h) . (3) 286 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288 4 A Simulation Study We carried out a small simulation study to compare the performance of the bandwidth estimates. The observations Yi, for i = 1, . . . , n = 75, were obtained by adding independent Gaussian random variables with mean zero and variance σ2 = 0.2 to the function m(x) = cos(9x − 7) − (3 + x12 )/6 + 8x−1 . We made estimations of the regression function by using the kernel of order 2 K(x) = −3 4 (x2 − 1), |x| ≤ 1 0, |x| > 1 . In this case, there was selected ˆh = 0.0367 by using an estimate without any elimination of boundary effects (Figure 2). At the second, there was selected ˆh = 0.0867 by using the method of cyclic model (Figure 3) and at the third, there was selected ˆh = 0.2036 by using the reflection method (Figure 4). From the figures it can be seen that both, cyclic model and reflection method, are very useful for removing problems caused by boundary effects. 5 A Practical Example We carried out a short real application to compare the performance of the bandwidth estimates. The observations Yi, for i = 1, . . . , n = 230, were average spring temperatures measured in Prague between 1771 – 2000. The data were obtained from Department of Geography, Masaryk University. We made estimations of the regression function by using the kernel of order 2 K(x) = −3 4 (x2 − 1), |x| ≤ 1 0, |x| > 1 . In this case, there was selected ˆh = 0.0671 by using an estimate without any elimination of boundary effects (Figure 5). At the second, there was selected ˆh = 0.0671 by using the method of cyclic model (Figure 6) and at the third, there was selected ˆh = 0.2211 by using the reflection method (Figure 7). These figures show that both, cyclic model and reflection method, are very useful for removing problems caused by boundary effects. References Chiu, S. (1990). Why bandwidth selectors tend to choose smaller bandwidths, and a remedy. Biometrika, 77, 222-226. Chiu, S. (1991). Some stabilized bandwidth selectors for nonparametric regression. Annals of Statistics, 19, 1528-1546. H¨ardle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press. Kol´aˇcek, J. (2002). Kernel estimation of the regression function – bandwidth selection. Summer School DATASTAT’01 Proceedings FOLIA, 1, 129-138. J. Kol´aˇcek and J. Pomˇenkov´a 287 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −13.5 −13 −12.5 −12 −11.5 −11 −10.5 −10 −9.5 −9 Figure 2: Graph of smoothness function with bandwidth h = 0.0367, the real regression function m, an estimate of m. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −13.5 −13 −12.5 −12 −11.5 −11 −10.5 −10 −9.5 −9 Figure 3: Graph of smoothness function with bandwidth h = 0.0867, the real regression function m, an estimate of m. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −13.5 −13 −12.5 −12 −11.5 −11 −10.5 −10 −9.5 −9 Figure 4: Graph of smoothness function with bandwidth h = 0.2036, the real regression function m an estimate of m. Kol´aˇcek, J. (2005). Kernel Estimators of the Regression Function. Brno: PhD-Thesis. Pomˇenkov´a, J. (2005). Some Aspects of Regression Function Smoothing (in Czech). Ostrava: PhD-Thesis. Rice, J. (1984). Bandwidth choice for nonparametric regression. The Annals of Statistics, 12, 1215-1230. Wand, M., and Jones, M. (1995). Kernel Smoothing. London: Chapman & Hall. 288 Austrian Journal of Statistics, Vol. 35 (2006), No. 2&3, 281–288 1750 1800 1850 1900 1950 2000 4 5 6 7 8 9 10 11 12 13 Figure 5: Graph of smoothness function with bandwidth h = 0.0671, an estimate of m. 1750 1800 1850 1900 1950 2000 4 5 6 7 8 9 10 11 12 13 Figure 6: Graph of smoothness function with bandwidth h = 0.0671, an estimate of m. 1750 1800 1850 1900 1950 2000 4 5 6 7 8 9 10 11 12 13 Figure 7: Graph of smoothness function with bandwidth h = 0.2211, an estimate of m. Authors’ address: Jan Kol´aˇcek, Jitka Pomnˇenkov´a Masaryk University in Brno Department of Applied Mathematics Jan´aˇckovo n´amˇest´ı 2a CZ-602 00 Brno Czech Republic E-mail: kolacek@math.muni.cz