Journal of Nonparametric Statistics, 2013 Vol. 25, No. 4, 829–853, http://dx.doi.org/10.1080/10485252.2013.810742 Large sample results for varying kernel regression estimates Hira L. Koula and Weixing Songb * Downloaded by [98.239.145.180] at 20:47 16 April 2014 a Department of Statistics and Probability, Michigan State University, East Lansing, b Department of Statistics, Kansas State University, Manhattan, KS, USA MI, USA; (Received 16 December 2012; accepted 23 May 2013) The varying kernel density estimates are particularly designed for positive random variables. Unlike the commonly used symmetric kernel density estimates, the varying kernel density estimates do not suffer from the boundary problem. This paper establishes asymptotic normality and uniform almost sure convergence results for a varying kernel density estimate when the underlying random variable is positive. Similar results are also obtained for a varying kernel nonparametric estimate of the regression function when the covariate is positive. Pros and cons of the varying kernel regression estimate are also discussed via a simulation study. Keywords: varying kernel regression; inverse gamma distribution; almost sure convergence; central limit theorem AMS Subject Classifications: 62G08; 62G20 1. Introduction In this paper, we investigate consistency and asymptotic normality of a varying kernel density estimator when the density function is supported on (0, ∞). We also propose a varying kernel regression function estimator when a covariate in the underlying regression model is non-negative and investigate its similar asymptotic properties. The problem of estimating density function of a random variable X taking values on the real line has been of a long-lasting research interest among statisticians, and numerous interesting and fundamental results have been obtained. Kernel density estimation method no doubt is the most popular among all the proposed nonparametric procedures. In the commonly used kernel estimation setup, the kernel function K is often chosen to be a density function symmetric around 0 satisfying some moment conditions. With a random sample X1 , X2 , . . . , Xn of X, and !a bandwidth h depending on n, the kernel estimate of density function f of X is fˆ (x) = (nh)−1 ni=1 K((x − Xi )/h). The contribution from each sample point Xi to fˆ (x) is mainly controlled by how far Xi is from x on the h-scale. Moreover, because of the symmetry of K, the sample points on either sides of x with the same distance from x make the same contribution to the estimate. Consequently, symmetric kernel assigns positive weights outside the density support set near the boundaries, which is also the very reason why the commonly used symmetric kernel density estimates have the unpleasant boundary *Corresponding author. Email: weixing@ksu.edu © American Statistical Association and Taylor & Francis 2013 Downloaded by [98.239.145.180] at 20:47 16 April 2014 830 H.L. Koul and W. Song problem. This boundary problem is also present in the Nadaraya–Watson (N–W) estimators of a nonparametric regression function. Numerous ways have been proposed to remove the boundary effect. In the context of density estimation, see Schuster (1985), Marron and Ruppert (1994), Jones (1993), Fan and Gijbels (1992), Cowling and Hall (1996), etc.; for nonparametric regression, see Gasser and Müller (1979), Müller (1991), Müller and Wang (1994), John (1984), and references therein. The research on estimating the density functions not supported on the entire real line using asymmetric density kernels started from late 1990s. When density has a compact support, motivated by the Bernstein polynomial approximation theorem in mathematical function analysis, Chen (1999) proposed Beta kernel density estimators and analysed bias and variance of these estimators. By reversing the role of estimation point and data point in Chen’s (1999) estimation procedure, and using the Gaussian copula kernel, Jones and Henderson (2007) proposed two density estimators. When density functions are supported on (0, ∞), Chen (2000b) constructed a Gamma kernel density estimate and Scaillet (2004) proposed an inverse Gaussian kernel and a reciprocal inverse Gaussian kernel density estimate. Chaubey, Sen and Sen (2012) also proposed a density estimator for non-negative random variables via smoothing of the empirical distribution function using a generalisation of Hille’s lemma. A varying kernel density estimate, which is an asymmetric kernel density estimate and based on a modification of Chen’s Gamma kernel density estimate, was recently proposed by Mnatsakanov and Sarkisian (2012) (M–S). Compared to the traditional symmetric kernel estimation procedures, there are two unique features about all of the above asymmetric kernel methods: (1) the smoothness of the density estimate is controlled by the shape or scale parameter of the asymmetric kernel and the location where the estimation is made; and (2) the asymmetric kernels have the same support as the density functions to be estimated, thus the kernels do not allocate any weight outside the support. As a consequence, all of the above asymmetric kernel density estimators can effectively reduce the boundary bias and they all achieve the optimal rate of convergence for the mean integrated squared error. Some asymmetric density estimators are bona fide densities, such as the ones proposed by Jones and Henderson (2007). Most of them are not, but they become one after a slight modification, for example, the M–S varying kernel density estimate. In principle, the commonly used symmetric kernel density estimate, in which the kernel is supported over a symmetric interval around 0, can still be used for estimating the density function of a random variable with some restricted range, but the resulting estimate itself may not be a density function any more. For example, using standard normal kernel to estimate the density function of a positive random variable, the resulting kernel density estimate does not integrate to 1 over (0, ∞). Most of the research on asymmetric kernel estimation methodology has been focused on the density estimation, and the asymptotic theories are limited to the bias, variance or mean square error (MSE) derivations. To the best of our knowledge, literature is scant on the investigation of the consistency of asymmetric kernel density estimators except for Bouezmarni and Rolin (2003) and Chaubey et al. (2012). Nothing is available on their asymptotic distributions. The situation in the context of nonparametric regression is also surprising. Using Beta or Gamma kernel function, Chen (2000a, 2002) proposed the local linear estimators for regression function and derived their asymptotic bias and variance, but did not analyse their asymptotic distributions. The present paper makes an attempt at filling this void by investigating the large sample properties of the M–S kernel procedure in the fields of both density and regression function estimations. First, in the context of density estimation, we investigate the asymptotic normality and uniform almost sure convergence of the M–S kernel density estimate. Second, in the context of nonparametric regression, we investigate the asymptotic behaviour of the M–S kernel regression function estimate. We derive its asymptotic conditional bias and conditional variance, and establish its uniform almost sure consistency and the asymptotic normality. Third, bandwidth selection is explored 831 Journal of Nonparametric Statistics for the sake of implementing the methodology. As a byproduct, the paper provides a theoretical framework for investigating the similar properties of other asymmetric kernel estimators. 2. M–S kernel regression estimation Suppose X1 , X2 , . . . , Xn is a random sample from a population X supported on (0, ∞). Let " αx # 1 " αx #α exp − , t"(α) t t Kα∗ (x, t) = α > 0, t > 0, x > 0. (1) Downloaded by [98.239.145.180] at 20:47 16 April 2014 For a sequence of positive real numbers αn , M–S proposed the following nonparametric estimate for the density of X: n fα∗n (x) n 1$ ∗ 1$ 1 = Kαn (x, Xi ) = n i=1 n i=1 Xi "(αn ) % αn x Xi &α n % & αn x exp − . Xi (2) The estimate (2) is constructed using the technique of recovering a function from its Mellin transform applied in the moment-identifiable problem. There is a close connection between Kα∗ (x, t) and the Gamma and inverse Gamma density functions. For each fixed t, Kα∗ (·, t) is a Gamma density function with scale parameter t/α and shape parameter α + 1; for each fixed x, Kα∗ (x, ·) is the density function of an inverse Gamma distribution with scale parameter α and shape parameter αx. Unfortunately, as seen in M–S, the asymptotic bias of fα∗n (x) depends on the first derivative of the underlying density function of X, which is due to the fact that αx/(α − 1), instead of x, is the mean of the density Kα∗ (x, t) viewed as a function of t. To reduce the bias, M–S used a modified version of Kα∗ (x, t), viz, Kα (x, t) = " αx #α+1 " αx # 1 exp − , t"(α + 1) t t (3) to construct the density estimate. For fixed x, Kα (x, ·) now is the density function of an inverse Gamma distribution with shape parameter α + 1 and scale parameter αx, the mean of which is exactly x; for fixed t, Kα (·, t) is not a Gamma density function anymore, but αKα (x, t)/(α + 1) is a Gamma density function with shape parameter α + 2 and scale parameter t/α. These properties indeed imply a very interesting connection of the M–S kernel Kα (x, t) and the normal kernel used in the commonly used density estimate for large values of α. In fact, for a fixed x, let Tα be a random variable having density function Kα (x, ·), and for a fixed t, let Xα be a random variable having density function αKα (·, t)/(α + 1). Then, one can verify that % % & & √ √ Tα Xα α α − 1 →d N(0, 1), − 1 →d N(0, 1), as α → ∞. x t √ Here, and in the following, →d denotes the convergence in distribution. If we let h = 1/ α, then from the above facts it follows that as α → ∞, % & % & 1 x/t − 1 1 t/x − 1 Kα (x, t) ≈ φ or Kα (x, t) ≈ φ , h h h h where φ(·) denotes the standard normal density function. Therefore, the M–S kernel Kα approximately behaves like the standard normal kernel, while the distance between x and t is not the usual Euclidean distance |x − t|, but rather the relative distance |x − t|/t or |x − t|/x; for the commonly used kernel function, x and t are symmetric in the sense of difference, while in the kernel function 832 0.8 0.4 0.0 Downloaded by [98.239.145.180] at 20:47 16 April 2014 1.2 1.6 2.0 2.4 2.8 3.2 3.6 H.L. Koul and W. Song 0.5 1 2 3 x Figure 1. The kernel function Kα (x, t) for the four pseudo-data points listed in the text and two choices of α. The solid curves are for α = 5, and the dotted curves for α = 20. The curve with the highest peak is for the data point 0.5, with the second highest is for the data point 1, and so on. √ Kα (x, t), x and t are asymptotically symmetric in the sense of division; the parameter 1/ α plays the role of bandwidth as in commonly used kernel setup. To have a better understanding about the smoothing effect of the kernel function Kα (x, t), we plot the functions for a pseudo-data set 0.5, 1, 2, 3 and α = 5, 20 over the range x ∈ (0, 7) which are shown in Figure 1. Clearly, all the curves are skewed to the right implying that more weights are put on the values to the right of the observed data points; as α gets larger, all the curves shrink towards the data points and the shape of the kernels changes according to the values of the data points. For a random sample X1 , X2 , . . . , Xn from the population X supported on (0, ∞), the M–S kernel density estimate based on the modified kernel Kα (x, t) is n 1$ Kα (x, Xi ). fˆn (x) = n i=1 n (4) The expression for the MSE of fˆn (x) is derived in M–S, as well as the L1 -consistency. Different from the commonly used symmetric kernel density estimates, the M–S kernel density estimates do not suffer from the boundary effect, which is confirmed both by the theories developed and simulation studies conducted in M–S. Although f (x) is not defined at x = 0, it is clear that fˆn (0) = 0 almost surely. This intrinsic constraint is only desirable if limx→0 f (x) = 0. Some other asymmetric kernel estimates also suffer from the similar disturbance, such as the inverse and reciprocal inverse Gaussian kernel estimates proposed in Scaillet (2004) and the copula-based kernel estimate suggested by Jones and Henderson (2007). If limx→0 f (x) > 0, then to analyse the boundary behaviour of fˆn (x) around 0, similar to the symmetric kernel case, we analyse the limiting behaviour of the bias in fˆn (x) at x = u/αn , where 0 < u < 1. This is done in Section 6. There is no discussion in the literature on the asymptotic normality of the M–S kernel density estimate (4). This paper will try to fill this void, not just because this topic itself is very interesting, but also because it has some very practical implications, for example, knowing the asymptotic Downloaded by [98.239.145.180] at 20:47 16 April 2014 Journal of Nonparametric Statistics 833 distribution of fˆn (x) enables us to construct confidence interval for the density function f (x). Parallel to the commonly used symmetric kernel estimation methodology, and also as a further development, we also investigate the large sample behaviour of the nonparametric estimators of regression function using the M–S kernel, when the covariate is positive. The relationship between a scalar response Y and a covariate X is often investigated through the regression model Y = m(X) + ε, where ε is the random error and X is one dimensional and a positive random variable. Furthermore, we assume that E(ε|X = x) = 0 and σ 2 (x) := E(ε2 |X = x) > 0, for almost all x > 0. Let {(Xi , Yi ), i = 1, 2, . . . , n} be a random sample from this regression model. Inspired by the construction of the N–W kernel regression estimate, the M–S kernel regression estimate of m(x) is defined to be !n Kαn (x, Xi )Yi m ˆ n (x) = !i=1 . (5) n i=1 Kαn (x, Xi ) In spite of the similarity between this estimate and its N–W kernel counterpart, the very different characteristics of the M–S kernel function from the commonly used symmetric kernel imply that many technical challenges encountered in the development of asymptotic theories for the new estimates are different. Under some regularity conditions on the underlying density function f (x) and the regression function m(x), asymptotic normality of the M–S kernel estimate m ˆ n (x), as well as its uniform consistency, is established in the paper. From the definition of Kαn , one can derive a much simpler expression for m ˆ n (x). In fact, after some cancelation, we have !n Xi−αn −2 exp(−αn x/Xi )Yi . m ˆ n (x) = !i=1 n −αn −2 exp(−αn x/Xi ) i=1 Xi This formula is mainly useful for the computation of m ˆ n , while Equation (5) is convenient for theoretical development. Being an asymmetric kernel, Kα∗ (x, t) defined in Equation (1) and Kα (x, t) defined in Equation (3) are rather different from the asymmetric kernels discussed in Cline (1988) and Abadir and Lawford (2004). In Equations (1) and (3), at each x > 0, the data points Xi behave like the scale parameter of x, while in Cline (1988) and Abadir and Lawford (2004), Xi s appear as the location parameter of x. Therefore, the inadmissibility of the asymmetric kernel proved in Cline (1988) does not apply to the varying kernels defined in Equations (1) and (3). The proposed estimation procedure is mainly developed for a univariate X. It is desirable to seek its extensions for higher dimensional positive covariates. Similar to the commonly used symmetric kernel regression, one way to proceed is to use the product kernel in the definition of the regression function estimate. Another way is to use a multivariate extension of Gamma or inverse Gamma density function as the kernel function. The product kernel method is the most straightforward and natural choice, and similar theoretical results as in one dimension can be easily derived. However, using multivariate extensions of Gamma or inverse Gamma density as the kernel may not be practical, since the multivariate Gamma density functions proposed in the literature all have complicated forms, which makes the computation and theoretical development of the corresponding varying kernel estimates much challenging. For some definitions of multivariate Gamma distribution, see Kotz, Balakrishnan and Johnson (2000). The paper is organised as follows. Section 3 discusses the large sample results about m ˆ n (x) along with the needed technical assumptions. In particular, it contains an approximate expression for the conditional MSE, and a central limit theorem, and a uniform consistency result about m ˆ n. Section 4 contains a discussion on the selection of the smoothing parameter αn . Findings of a simulation study are presented in Section 5, and the proofs of the main results appear in Section 6. Unless specified otherwise, all limits are taken as n → ∞. 834 3. H.L. Koul and W. Song ˆ n (x) Large sample results of m Downloaded by [98.239.145.180] at 20:47 16 April 2014 We start with analysing the asymptotic properties of the conditional bias and conditional variance, hence the conditional MSE, of m ˆ n (x) defined in Equation (5). Then, a typical application of Lindeberg–Feller central limit theorem will lead to the asymptotic normality of m ˆ n (x). As a byproduct, the asymptotic normality of the M–S kernel estimate fˆn (x) is also a natural consequence. Thus, confidence intervals for the true density function and regression function can be constructed. Finally, uniform almost sure convergence results of fˆn (x) and m ˆ n (x) over any bounded sub-intervals of (0, ∞) are developed by using the Borel–Cantelli lemma after verifying the Cram´er condition for the M–S kernel function. The following is a list of technical assumptions used for deriving these results: (A1) The second-order derivative of f (x) is continuous and bounded on (0, ∞). (A2) The second-order derivative of f (x)m(x) is continuous and bounded on (0, ∞). (A3) The second-order derivative of σ 2 (x) = E(ε2 |X = x) is continuous and bounded for all x > 0. (A4) For some δ > 0, the second-order derivative of E(|ε|2+δ |X = x) is continuous and bounded in x ∈ (0, ∞). √ (A5) αn → ∞, αn /n → 0. Condition (A1) on f (x) is the same as the one adopted by M–S when deriving the bias and variance of fˆn (x). Condition (A3) is required for dealing with the large sample argument pertaining to the random error and is not needed if one is willing to assume the homoscedasticity. Condition (A4) is needed in proving the asymptotic normality of the proposed estimators, while (A5) is a minimal condition needed for the smoothing parameter. Additional assumptions on αn as needed are stated in various theorems presented below. In the following, for any function g(x), g( (x) and g(( (x) denote the first and second derivatives of g(x), respectively. 3.1. Bias and variance The following theorem presents the asymptotic expansions of the conditional biases and the variances, hence the conditional MSE, of m ˆ n (x). Let ' ( ( ( m(( (x) σ 2 (x) 2 m (x)f (x) b(x) := x + , v(x) := (6) √ , f (x) 2 2xf (x) π and X := {X1 , X2 , . . . , Xn }. Theorem 3.1 Suppose the assumptions (A1), (A2), (A3), and (A5) hold. Then, for any x ∈ (0, ∞) with f (x) > 0, ) + % & 1 1 b(x) Bias(m ˆ n (x)|X) = + Op * √ , (7) + op αn αn n αn %√ & √ αn v(x) αn Var(m ˆ n (x)|X) = + op . (8) n n Thus, the conditional MSE of m ˆ n (x) has the asymptotic expansion ) + % & %√ & √ αn 1 1 b2 (x) v(x) αn MSE(m ˆ n (x)|X) = + op + op √ 5/4 . + op + n n αn2 αn2 nαn Journal of Nonparametric Statistics 835 Remark The unconditional version of Theorem 3.1 is very hard to derive. This is also true for N–W kernel regression. Although Härdle, Müller, Sperlich and Werwatz (2004) indicated that the conditional MSE of N–W kernel regression estimate could be derived from a linearisation technique, and the result is summarised in Theorem 4.1 in Härdle et al. (2004), the rigorous proof is not provided. But we can show that the unconditional version of Theorem 3.1 remains valid for m ˆ n∗ (x) = !n Kα (x, Xi )Yi !n n , + i=1 Kαn (x, Xi ) i=1 n−2 Downloaded by [98.239.145.180] at 20:47 16 April 2014 a slightly modified version of m ˆ n (x). The similar idea was used in Fan (1993) when dealing with the local linear regression, and a proof of the unconditional MSE of m ˆ n∗ (x) can follow the same thread as the proof of Theorem 3 in Fan (1993). Recalling the above discussion on the analogy between αn and the bandwidth in the commonly used symmetric kernel density estimate, one can easily see the similarity of the bias and variance expressions between the M–S kernel estimate and the N–W kernel estimate. Similar to the N–W kernel regression case, one can choose the optimal smoothing parameter αn,opt by minimising the leading term in the conditional MSE of m ˆ n with respect to αn . We can verify that αn,opt has the order of n2/5 , with the corresponding MSE having the order of n−4/5 . Recall the same order is obtained for the N–W kernel regression estimate based on the same criterion. 3.2. Asymptotic normality First, we give the asymptotic normality of the M–S kernel density estimate. Theorem 3.2 Suppose the assumptions (A1), (A4), and (A5) hold. Then, for any x ∈ (0, ∞) with f (x) > 0, % ' ( √ & f (x) αn −1/2 ˆ x 2 f (( (x) fn (x) − f (x) − →d N(0, 1). √ 2(αn − 1) 2xn π The asymptotic normality of fˆn (x) implies that fˆn (x) converges to f (x) in probability, hence 1/fˆn (x) converges to 1/f (x) in probability, whenever f (x) > 0. This result is used in the proof of the asymptotic normality of m ˆ n (x), which is stated in the next theorem. Theorem 3.3 Suppose the assumptions in Theorem 3.1 hold. Then, for any x ∈ (0, ∞) with f (x) > 0, % ' ( √ & v(x) αn −1/2 b(x) m ˆ n (x) − m(x) − →d N(0, 1), n αn − 1 where b(x) and v(x) are defined in Equation (6). It is noted that there is a non-negligible asymptotic bias appearing in the above results, a characteristic shared with the N–W kernel regression estimate. This bias can be eliminated by √ 5/4 under-smoothing which, in the current setup, is to select a larger αn such that n/αn → 0 √ without violating conditions αn → ∞, αn /n → 0. The large sample confidence intervals for m(x) thus can be constructed with the help of Theorem 3.3. 836 H.L. Koul and W. Song 3.3. Almost sure uniform convergence Downloaded by [98.239.145.180] at 20:47 16 April 2014 In this section, we develop an almost sure uniform convergence result for m ˆ n (x) over an arbitrary bounded sub-interval of (0, ∞). In the N–W kernel regression estimation scenario, a similar result is obtained by using the Borel–Cantelli lemma and the Bernstein inequality, but the Cramér condition must be verified before applying these well-known results. That is, for any fixed x > 0, k ≥ 2, we have to show that % √ &k−2 c α k E|Kα (x, X)| ≤ k! EKα2 (x, X) n for some positive constant c when α is large. ˆ n to m The following two theorems give the almost sure uniform convergence of fˆn to f and m over bounded sub-intervals of (0, ∞). 1/2 Theorem 3.4 In addition to (A1) and (A5), assume that αn log n/n → 0. Then, for any constants a and b such that 0 < a < b < ∞, ) 1/4 √ + % & 1 α log n n +o sup |fˆn (x) − f (x)| = O , a.s. √ αn n x∈[a,b] 1/2 Theorem 3.5 In addition to (A1)–(A5), assume that αn log n/n → 0. Then, for any constants a and b such that 0 < a < b < ∞, ) 1/4 √ + % & 1 αn log n ˆ n (x) − m(x)| = O +o sup |m , a.s. √ αn n x∈[a,b] By assuming some stronger conditions on the tails of f and m at the boundaries, the above uniform almost sure convergence results can be extended to be over some suitable intervals increasing to (0, ∞). However, we do not pursue it here simply because of the involved technical details and lack of a useful application. 4. Selection of smoothing parameters It is well known that the smoothing parameter plays a crucial role in nonparametric kernel regression. Abundant research has been conducted for the N–W kernel-type regression estimation methodology (see e.g. Wand and Jones 1994; Hart 1997 for more data-driven choices of the smoothing parameters in this setup). However, to the best of our knowledge, there is no work done for the asymmetric kernel regression. In this section, we propose several smoothing parameter selection procedures for implementing the M–S kernel technique. First, we recall the least-square cross-validation (LSCV) procedure from M–S and discuss its extension, k-fold LSCV. Second, we propose the smoothing parameter selection procedures in the nonparametric regression setup. The k-fold LSCV and the generalised cross-validation (GCV) will be discussed. These procedures are analogous to the commonly used data-driven procedures used in the N–W kernel regression estimation context. The theoretical properties, such as the consistency of these smoothing parameter selectors to some ‘optimal’ smoothing parameter, might be discussed in the similar way as in John (1984), Härdle, Hall and Marron (1988, 1992) and references therein. However, we will not investigate this important topic in the current paper, which deserves an independent in-depth study. 837 Journal of Nonparametric Statistics 4.1. Density estimation: k-fold LSCV The motivation of the LSCV comes from expanding the mean integrated square error (MISE) of fˆ . Define , n 2$ˆ LSCV(α) = fˆ 2 (x) dx − f−i (Xi ), n i=1 where fˆ−i (Xi ) is the leave-one-out M–S kernel density estimate for f (Xi ) without using the ith observation. Then, the LSCV smoothing parameter is defined by αˆ LSCV = argminα LSCV(α). For the M–S kernel density estimate (4), Downloaded by [98.239.145.180] at 20:47 16 April 2014 n n "(2α + 3) $ $ (Xi Xj )α+1 LSCV(α) = 2 2 n α" (α + 1) i=1 j=1 (Xi + Xj )2α+3 $ 1 2 − n(n − 1)"(α + 1) i+=j Xj % αXi Xj &α+1 % αXi exp − Xj & . An extension of the above leave-one-out LSCV is the k-fold LSCV procedure. First split the data into k roughly equal-sized parts; then for each part, calculate the prediction error based on the M–S kernel density estimate constructed from all data from other k − 1 parts; and finally, take the sum of the k prediction errors as the quantity to be minimised. In particular, for our current setup, the k-fold LSCV has a similar structure as the leave-one-out LSCV except for the second term now defined as % &α+1 % & n $ $ 2 1 αXi αXi 1 exp − , n"(α + 1) i=1 n − ni j∈D(i) Xj Xj Xj / where D(i) is the set of indices of the data part including Xi . For convenience, if we use D1 , D2 , . . . , Dk to denote the data subscripts in the first part, second part, and so on, then D(i) = {j : i, j ∈ Dl , l = 1, 2, . . . , k}, and ni is the size of D(i). The k-fold LSCV will reduce to the leave-one-out LSCV when k = n. 4.2. M–S kernel regression: k-fold LSCV The basic idea of LSCV in regression setup is to select the smoothing parameter by minimising prediction error. For this purpose, let m ˆ D/D(i) (Xi ) be the M–S kernel estimate of m(x) at x = Xi of the same type as m ˆ n (x) except that it is computed without using the data parts including the ith observation (Xi , Yi ), where D = {1, 2, . . . , n}. The LSCV smoothing parameter αˆ LSCV is the value of α that minimises the LSCV criterion CV(α) = = n $ i=1 n $ i=1 [Yi − m ˆ D/D(i) (Xi )]2 1 Yi − !n Xj−α−2 exp(−αXi /Xj )Yj j∈D(i) / !n Xj−α−2 exp(−αXi /Xj ) j∈D(i) / 22 . The independence between (Xi , Yi ) and m ˆ D/D(i) (Xi ) indicates that CV(α) will give an accurate assessment of how well the estimate m ˆ n (x) will predict future observations. 838 4.3. H.L. Koul and W. Song M–S kernel regression: GCV The GCV procedure from the N–W kernel regression can also be adapted to the current setup. Define Xj−α−2 exp(−αXi /Xj ) , i, j = 1, 2, . . . , n. wij = !n −α−2 exp(−αXi /Xk ) k=1 Xk Downloaded by [98.239.145.180] at 20:47 16 April 2014 Then, the GCV smoothing parameter αˆ GCV is the value of α that minimises the GCV criterion GCV(α) defined as ! ! n ni=1 [Yi − nj=1 wij Yj ]2 ! . GCV(α) = [n − ni=1 wii ]2 There is no one smoothing parameter selection procedure which is uniformly superior to others, in the sense that the selected smoothing values always produce estimates with smallest MSE. The simulation study conducted in the next section shows that for some data sets, a selection procedure might not even work. A common practice is to try several procedures and make an overall evaluation to decide a proper smoothing value. 5. Simulation study To evaluate the finite sample performance of the proposed M–S kernel regression estimates, we conducted a simulation study. In the simulation, the underlying density function of the design variable is chosen to be log-normal with µ = 0, σ = 1, and the random error ε to be normal with mean 0 and standard deviation 0.5. Two simple regression functions, m(x) = 1/x 2 , m(x) = (x − 1.5)2 , are considered. For m(x) = 1/x 2 , the estimate will be evaluated at 1024 equally spaced values over the interval (0.1, 1); for m(x) = (x − 1.5)2 , the estimate will be evaluated at 1024 equally spaced values over the interval (0, 3), and the sample sizes used are 100 and 200. Then, the MSEs between the estimated values and true values of the regression function will be used for comparison. It is always controversial when comparing two different nonparametric smoothing procedures, especially when one or both procedures involve the smoothing parameters which play a crucial role in determining the smoothness of the fitted regression function, since by selecting a proper smoothing parameter, one method can often be made to outperform the other. Therefore, for the sake of fairness in comparison, one should use the same criterion to select the smoothing parameters whenever possible. Unfortunately, sometimes the chosen criterion works for one procedure, but does not work for another. In this case, one might try different criteria for both procedures and make an overall comparison. The five-fold LSCV and GCV criteria are tried to select the bandwidth for both M–S kernel and N–W kernel estimates. The standard normal kernel is used to construct the N–W kernel estimate. Table 1 presents the simulation study when m(x) = 1/x 2 . The numbers within the parentheses are the smoothing values selected by various criteria, and the numbers outside the parentheses Table 1. MSE comparison: m(x) = 1/x 2 , x ∈ (0.1, 1). M–S kernel n 100 200 N–W kernel LSCV GCV n2/5 (22) 7.944 (29) 5.116 (12) 54.522 (10) 10.530 119.113 14.497 LSCV GCV h = n−1/5 × × × (0.009) 2.359 (0.398) 148.185 (0.347) 174.172 839 0 0 20 20 40 40 60 80 60 100 80 120 Journal of Nonparametric Statistics 0.4 0.6 0.8 1.0 0.2 0.6 0.8 1.0 n = 200 n = 100 Estimates of m(x) = 1/x 2 . Smoothing values are selected by LSCV and optimal order of MSE. 20 40 60 80 100 Figure 2. 0.4 0 Downloaded by [98.239.145.180] at 20:47 16 April 2014 0.2 0.2 0.4 0.6 0.8 1.0 n = 200 Figure 3. Estimates of m(x) = 1/x 2 . Smoothing values are selected by GCV. are the MSEs. For n = 100, the five-fold LSCV criterion and GCV do not work for the N–W procedure and a crossed sign is used in the table to indicate this case. For n = 200, LSCV still does not work for the N–W estimator, while GCV works. Also, h = n−1/5 , the bandwidth based on the optimal order of the conditional MSE, is used to calculate the N–W estimate. The five-fold LSCV criterion works for the M–S kernel estimate. We also try α = n2/5 , the smoothing α value based on the optimal order of the conditional MSE for the M–S kernel estimate, to calculate the M–S kernel estimate. The values in the parentheses are the values of smoothing parameters. Figure 2 provides a visual comparison between these two procedures. To keep the figure neat, we only plot the M–S kernel estimate with an LSCV bandwidth, and the N–W kernel estimate with h = n−1/5 . Here, and in the subsequent figures, the thick solid curve denotes the true regression function, the thin solid line denotes the M–S kernel estimate, and the dashed line is for the N–W estimate. Clearly, with respect to the boundary area, the M–S kernel estimate does better than the N–W kernel estimate. We also tried GCV criterion to choose the smoothing parameters. Figure 3 provides a visual comparison between M–S and N–W procedures with smoothing values selected 840 H.L. Koul and W. Song by GCV. The MSEs reported in Table 1 and Figure 3 clearly indicate that the GCV favours the N–W kernel estimate more than the M–S kernel estimate, although the N–W kernel estimate possesses a larger variability. We also tried fitting the regression function using the boundary kernel suggested by Gasser and Müller (1979). The MSEs are generally smaller than the N–W kernel estimates, but still much larger than the M–S kernel estimates. For example, for h = n−1/5 , the MSEs using the boundary kernel are 162.133 when n = 100 and 38.62 when n = 200. Table 2 reports the MSEs from the simulation study when m(x) = (x − 1.5)2 . Now the five-fold LSCV and GCV criteria work for both procedures. The M–S kernel estimate with both LSCV Table 2. MSE comparison: m(x) = (x − 1.5)2 , x ∈ (0, 3). n LSCV GCV n2/5 LSCV GCV h = n−1/5 (71) 0.025 (54) 0.014 (30) 0.016 (37) 0.012 (6.310) 0.049 (8.326) 0.055 (0.429) 0.086 (0.917) 0.250 (0.089) 0.065 (0.090) 0.067 (0.398) 0.030 (0.347) 0.030 0.0 0.0 0.5 0.5 1.0 1.0 1.5 1.5 2.0 2.0 2.5 100 200 N–W kernel 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 n = 200 n = 100 0.0 0.5 0.5 1.0 1.0 1.5 1.5 2.0 2.0 2.5 Figure 4. Estimates of m(x) = (x − 1.5)2 . Smoothing values are selected by LSCV and optimal order of MSE. 0.0 Downloaded by [98.239.145.180] at 20:47 16 April 2014 M–S kernel 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 n = 100 Figure 5. Estimates of m(x) = (x 1.0 1.5 n = 200 − 1.5)2 . Smoothing values are selected by GCV. 2.0 2.5 3.0 Journal of Nonparametric Statistics 841 and GCV bandwidths outperforms the N–W kernel estimates with all the selected bandwidths, but the contrary is true when both procedures use the bandwidth by minimising the asymptotic integrated mean square error (AIMSE) expression. Figure 4 shows the fitted curves from the M–S kernel estimate with an LSCV bandwidth, and the N–W kernel estimate with h = n−1/5 . It is clear that the M–S kernel estimator is a very promising competitor for the N–W kernel estimator. This point is reconfirmed by Figure 5, which shows the fitted curves from both kernel estimators with smoothing values selected by the GCV criterion. Downloaded by [98.239.145.180] at 20:47 16 April 2014 6. Proofs of main results This section contains the proofs of all the large sample results presented in Section 2. Inverse Gamma density function and its moments will be repeatedly referred to in the following proofs. For convenience, we list all the needed results here. Density function of an inverse Gamma distribution with shape parameter p and rate parameter λ is % % &p+1 & λp λ 1 g(u, p, λ) = exp − , u > 0. "(p) u u Its mean µ, variance τ 2 , and the fourth central moment ν4 , respectively, are µ= ν4 = Let λ , p−1 τ2 = λ2 , (p − 1)2 (p − 2) λ4 (3p + 15) . (p − 1)4 (p − 2)(p − 3)(p − 4) pk = k(αn + 2) − 1, λk = kαn x, k = 1, 2, . . . , x > 0. Write µk , τk , and ν4k for µ, τ , and ν4 when λ and p are replaced by λk and pk , respectively. The following lemma on the inverse Gamma distribution is crucial for the subsequent arguments. Lemma 6.1 Let l(u) be a function such that the second-order derivative of l(u) is continuous and bounded on (0, ∞). Then, for αn large enough, and for all x > 0 and k ≥ 1, , ∞ (2 − 2k)xl ( (x) g(u; pk , λk )l(u) du = l(x) + pk − 1 0 % & [(2 − 2k)2 (pk − 2) + k 2 αn2 ]x 2 l (( (x) 1 + +o . 2 2(pk − 1) (pk − 2) αn Proof of Lemma 6.1 Fix an x > 0. Note that µk := λk /(pk − 1) = x + (2 − 2k)x/(pk − 1). A Taylor expansion of l(µk ) around x up to the second order yields l(µk ) = l(x) + (2 − 2k)xl ( (x) (2 − 2k)2 x 2 l (( (ξ ) , + pk − 1 2(pk − 1)2 (9) where ξ is some value between x + (2 − 2k)x/(pk − 1) and x. Recall µk is the mean of g(u; pk , λk ). A Taylor expansion of l(u) around µk yields , ∞ , ∞ 1 l(u)g(u, pk , λk ) du = l(µk ) + l (( (µk ) (u − µk )2 g(u; pk , λk ) du 2 0 0 , 1 ∞ + (u − µk )2 g(u; pk , λk )[l (( (˜u) − l (( (µk )] du (10) 2 0 842 H.L. Koul and W. Song for some u˜ between u and µk . From Equation (9) and the continuity of l (( , we can verify that the two leading terms on the right-hand side of Equation (10) match the expansion in the lemma. Therefore, it is sufficient to show that the third term on the right-hand side of Equation (10) is of the order o(1/αn ). Since l (( is continuous, so it is uniformly continuous over any closed sub-intervals in (0, ∞). For any , > 0, select a 0 < γ < x, such that for any y with |y − x| ≤ γ , |l (( (x) − l (( (y)| < ,. Let δ1 = x − γ /2. The boundedness of l (( implies 3, 3 3 3 δ1 Downloaded by [98.239.145.180] at 20:47 16 April 2014 0 3 , 3 3 (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du3 ≤ c (( 2 (( δ1 0 (u − µk )2 g(u; pk , λk ) du. Note that the inverse Gamma density function g(u, pk , λk ) is unimodal, and the mode is αn x/(αn + 2), which approaches x when αn → ∞. Therefore, for αn large enough, δ1 < αn x/(αn + 2), and for all u ∈ (0, δ1 ), g(u, pk , λk ) ≤ g(δ1 , pk , λk ). Hence, , δ1 0 (u − µk )2 g(u; pk , λk ) du ≤ g(δ1 , pk , λk ) , δ1 0 ' u−x− (2 − 2k)x k(αn + 2) − 2 (2 du. Clearly the integral on the right-hand side is finite. From the definitions of pk and λk , g(δ1 , pk , λk ) = (kαn x)k(αn +2)−1 −k(αn +2) −kαn xδ1−1 δ e . "(k(αn + 2) − 1) 1 By the Stirling approximation, as αn → ∞, &k(αn +2)−2 kαn x ek(αn +2)−2 kαn x [1 + o(1)] √ k(αn + 2) − 2 2π[k(αn + 2) − 2] √ = O(x kαn ekαn αn ). (kαn x)k(αn +2)−1 = "(k(αn + 2) − 1) % (11) Therefore, −1 √ g(δ1 , pk , λk ) = O(x kαn ekαn δ1−kαn e−kαn xδ1 αn ) )' + % &(kαn √ x x =O exp 1 − αn . δ1 δ1 This relation and δ1 < x now readily implies that g(δ1 , pk , λk ) = o(1/αn ), which in turn implies that % & , δ1 1 2 (( (( (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o . (12) αn 0 Now take δ2 = x + γ /2. Then, 3 3, ∞ , 3 3 2 (( (( 3 3 (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du3 ≤ c 3 δ2 ∞ δ2 (u − µk )2 g(u; pk , λk ) du. But, , ∞ δ2 (kαn x)k(αn +2)−1 (u − µk ) g(u; pk , λk ) du = "(k(αn + 2) − 1) 2 , ∞ δ2 % &k(αn +2) % & 1 kαn x (u − µk ) exp − du. u u 2 843 Journal of Nonparametric Statistics The integral on the right-hand side is bounded above by 4 , ∞ δ2 % &k(αn +2)−2 % & 1 kαn x exp − du. u u By the change of variable, v = kαn x/u, we obtain Downloaded by [98.239.145.180] at 20:47 16 April 2014 , ∞ δ2 % &k(αn +2)−2 % & % &k(αn +2)−3 , kαn x/δ2 1 kαn x 1 exp − du = vk(αn +2)−4 exp(−v) dv. (13) u u kαn x 0 As a function of v, vk(αn +2)−4 exp(−v) is increasing in v ≤ k(αn + 2) − 4 and decreasing in v ≥ k(αn + 2) − 4. Since δ2 > x, so kαn x/δ2 < k(αn + 2) − 4 for αn sufficiently large. Therefore, for all v ∈ [0, kαn x/δ2 ], % & % & kαn x k(αn +2)−4 kαn x exp − . vk(αn +2)−4 exp(−v) ≤ δ2 δ2 Plugging the above inequality into Equation (13), we obtain that , ∞ δ2 % &k(αn +2)−2 % % & % & & % & 1 kαn x kαn x 1 k(αn +2)−3 kαn x k(αn +2)−4 kαn x exp − exp − du ≤ u u kαn x δ2 δ2 δ2 % &k(αn +2)−3 % & 1 kαn x = exp − . δ2 δ2 From Equation (11), we have , ∞ δ2 % & % & 1 k(αn +2)−3 kαn x exp − δ2 δ2 )' + % &(kαn % & √ x x 1 =O exp 1 − αn = o , δ2 δ2 αn 4 √ 5 (u − µk ) g(u; pk , λk ) du ≤ O x kαn ekαn αn · 2 because 0 < x < δ2 implies 0 < (x/δ2 ) exp(1 − x/δ2 ) < 1. Hence, % & , ∞ 1 2 (( (( (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o . αn δ2 (14) Finally, we shall show that % & , δ2 1 2 (( (( . (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o αn δ1 By uniform continuity of l(( , , , δ2 (u − µk )2 g(u; pk , λk )|l (( (˜u) − l (( (µk )| du ≤ , 0 δ1 ∞ (u − µk )2 g(u; pk , λk ) du, 6by∞ the fact 2that |˜u − µk | ≤ |u − µk | < γ , for u ∈ [δ1 , δ2 ] and αn sufficiently large. Because 0 (u − µk ) g(u; pk , λk ) du = O(1/αn ), we obtain , δ2 δ1 % 1 (u − µk ) g(u; pk , λk )|l (˜u) − l (µk )| du = , · O αn 2 (( (( & . (15) 844 H.L. Koul and W. Song The arbitrariness of , combined with Equations (12), (14), and (15) finally yield % & , ∞ 1 2 (( (( (u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o . αn 0 ! Hence, the desired result in the lemma. In particular, if k = 1, then % & , ∞ x 2 l (( (x) 1 g(u; p1 , λ1 )l(u) du = l(x) + . +o 2(αn − 1) αn 0 (16) Downloaded by [98.239.145.180] at 20:47 16 April 2014 To analyse the limiting behaviour of fˆn (x) as x → 0, similar to the symmetric kernel case, we analyse the limiting bias of fˆn (x) at x = u/αn , where 0 < u < 1. It is easy to see that fˆn % u αn & n = 1$ 1 n i=1 Xi "(αn + 1) % u Xi &αn +1 e−u/Xi . Let p = αn + 1, λ = u, we can show that % & , ∞ % & % & u u 1 ˆ E fn = g(x, p, λ)f (x) dx = f +O . αn αn αn 0 Therefore, fˆn (x) does not suffer from the boundary effect. The following decomposition of m ˆ n (x) will be used repeatedly in the proofs below. 1 2 Bn (x) + Vn (x) 1 1 m ˆ n (x) − m(x) = + − [Bn (x) + Vn (x)], f (x) fˆn (x) f (x) where n 1$ Kα (x, Xi )[m(Xi ) − m(x)], Bn (x) = n i=1 n n 1$ Vn (x) = Kα (x, Xi )εi , n i=1 n with Kαn (x, Xi ) defined in Equation (4). Now we are ready to prove Theorem 3.1. Proof of Theorem 3.1 First, we shall compute the conditional bias of m ˆ n (x). Direct calculations shows that E[m ˆ n (x)|X] − m(x) = Bn (x)/fˆn (x). Since fˆn (x) = f (x) + op (1), it suffices to discuss the asymptotic property of Bn (x). Note that EBn (x) = EKαn (x, X)m(X) − m(x)EKαn (x). But , ∞ " 1 αn x #αn +1 exp(−αn x/u) E(Kαn (x, X)m(X)) = m(u)f (u) du u u "(αn + 1) 0 , ∞ = g(u, p1 , λ1 )m(u)f (u) du, 0 where p1 = αn + 1, λ1 = αn x. Let H(u) = m(u)f (u). Applying Equation (16) with l(u) = H(u) and with l(u) = f (u), respectively, yields % & 1 x 2 H (( (x) E(Kαn (x, X)m(X1 )) = m(x)f (x) + +o , 2(αn − 1) αn ' ( % & x 2 f (( (x) 1 m(x)EKαn (x, X) = m(x) f (x) + . +o 2(αn − 1) αn Therefore, Journal of Nonparametric Statistics 845 % & x 2 [H (( (x) − m(x)f (( (x)] 1 EBn (x) = +o . 2(αn − 1) αn (17) Direct calculations show that x 2 [H (( (x) − m(x)f (( (x)]/2 = b(x)f (x), where b(x) is defined in Equation (6). Next, consider Downloaded by [98.239.145.180] at 20:47 16 April 2014 Var(Bn (x)) = 1 2 1 EKαn (x, X)[m(X) − m(x)]2 − [EKαn (x, X)(m(X) − m(x))]2 . n n Note that EKα2n (x, X)(m(X) − m(x))2 equals % & , ∞ 1 " αn x #2(αn +1) 1 2αn x exp − (m(u) − m(x))2 f (u) du u2 u "(αn + 1) u 0 , ∞ "(2αn + 3) = g(u; p2 , λ2 )(m(u) − m(x))2 f (u) du, xαn 22αn +3 " 2 (αn + 1) 0 where p2 = 2αn + 3, λ2 = 2αn x. By the Stirling approximation, for αn sufficiently large, √ αn "(2αn + 3) = √ [1 + o(1)]. 2α 2 +3 n αn 2 " (αn + 1) 2 π A Taylor expansion 6 ∞ of m(u) and f (u) around αn x/(αn + 1) up to the first order gives the following expansion for 0 g(u; p2 , λ2 )(m(u) − m(x))2 f (u) du: & % & , ∞% αn x 2 1 (m( (x))2 f (x) u− , g(u; p2 , λ2 ) du + o α + 1 α n n 0 by the assumptions (A1) and (A2), and the fact & % & , ∞% αn x 2 x 2 αn2 1 u− . g(u; p2 , λ2 ) du = =O 2 αn + 1 (αn + 1) (2αn + 1) αn 0 Therefore, & % 1 2 1 2 . EK (x, X)[m(X) − m(x)] = O √ n αn n αn From Equation (17), EBn (x) = O(1/αn ). Hence, & % & % 1 1 Var(Bn (x)) = O √ +O . n αn nαn2 (18) Therefore, Equations (17) and (18), and the fact x 2 [H (( (x) − m(x)f (( (x)]/2 = b(x)f (x) together yield ) + % & Bn (x) 1 1 b(x) + Op * √ + op . (19) = f (x) αn − 1 αn n αn Moreover, ' ( 1 E[m ˆ n (x)|X] − m(x) = + op (1) · [EBn (x) + Bn (x) − EBn (x)] f (x) ' % & % &( ( ' 1 1 1 b(x)f (x) = + Op , + op + op (1) · √ f (x) αn αn n αn which implies the claim (7) about the conditional bias of m ˆ n (x). 846 H.L. Koul and W. Song Next, we verify the claim (8) about the conditional variance of m ˆ n (x). In fact, with σ 2 (x) = 2 E(ε |X = x), % & n 1 1 $ 2 x Var[m ˆ n (x)|X] = · K αn σ 2 (Xi ). (20) ˆfn2 (x) n2 Xi i=1 Verify that under condition (A3) about σ 2 (x), 1 2 %√ & √ n σ 2 (x)f (x) αn αn 1 $ 2 2 E 2 Kαn (x, Xi )σ (Xi ) = , +o √ n i=1 n 2nx π Downloaded by [98.239.145.180] at 20:47 16 April 2014 which, together with Equation (20) and the fact fˆn (x) = f (x) + op (1), implies the claim (8). Proof of Theorem 3.2 ! Let ξin (x) = n−1 [Kαn (x, Xi ) − EKαn (x, X)]. Then, fˆn (x) = n $ i=1 ξin (x) + EKαn (x, X). Since EKαn (x, X) = f (x) + x 2 f (( (x)/2(αn − 1) + o(1/αn ), % & $ n x 2 f (( (x) 1 fˆn (x) − f (x) − +o = ξin (x). 2(αn − 1) αn i=1 Lindeberg–Feller Central Limit Theory (CLT) will be used to show the asymptotic normality ! of ni=1 ξin (x). For any a > 0, b > 0, and r > 1, using the well-known inequality (a + b)r ≤ 2r−1 (ar + br ), we have E|ξin (x)|2+δ ≤ n−(2+δ) 21+δ [E(Kαn (x, X))2+δ + (EKαn (x, X))2+δ ]. Let λδ = (2 + δ)αn x, pδ = (2 + δ)(2 + αn ) − 1.A tedious calculation shows that E(Kαn (x, X))2+δ can be written as , 1 "((2 + δ)(2 + αn ) − 1) ∞ g(u; pδ , λδ )f (u) du. (αn x)1+δ (2 + δ)(2+δ)(2+αn )−1 " 2+δ (αn + 1) 0 For n and αn , large enough, using the Stirling approximation, we have "((2 + δ)(2 + αn ) − 1) = O((2 + δ)(2+δ)(2+αn ) αn2(2+δ)−(5+δ)/2 ). " 2+δ (αn + 1) Also, we have 6∞ 0 g(u; pδ , λδ )f (u) du = f (x) + o(1). Hence, EKα2+δ (x, X) = O(αn(δ+1)/2 ). n Note that EKαn (x, X) = EKα2n (x, X) = , 0 ∞ g(u; p1 , λ1 )f (u) du, "(2αn + 3) xαn 22αn +3 " 2 (αn + 1) , ∞ 0 g(u; p2 , λ2 )f (u) du. 847 Journal of Nonparametric Statistics Hence, by Lemma 6.1, we obtain vn2 = Var −1 ) n $ 7 i=1 + ξin (x) = Var(fˆn (x)) EKα2n (x, X) − (EKαn (x, X))2 =n %√ & √ αn f (x) αn = . √ +o n 2nx π This fact together with EKαn (x, X) = f (x) + o(1) imply Downloaded by [98.239.145.180] at 20:47 16 April 2014 vn−(2+δ) n $ Eξin2+δ (x) i=1 = 2+δ nvn−(2+δ) Eξ1n 8 (21) )% √ & + αn δ/2 =O , n which converges to 0, by assumption (A4). Hence, the Lindeberg–Feller condition holds. This completes the proof of the Theorem 3.2. ! Proof of Theorem 3.3 Fix an x > 0. To show the asymptotic normality of m ˆ n (x), again we use the decomposition (21). We shall first show that Vn! (x) is asymptotically normal. For this purpose, let ηin = −1 n Kαn (x, Xi )εi so that Vn (x) = ni=1 ηin . Clearly, Eηin√= 0. By assumption (A3) on σ 2 (x), a √ 2 = [ αn f (x)σ 2 (x)/(2n2 x π)][1 + o(1)]. Therefore, routine argument leads to Eηin sn2 = Var ) n $ ηin i=1 + = 2 nEηin √ f (x)σ 2 (x) αn = [1 + o(1)]. √ 2nx π Using a similar argument as in dealing with E|ξin (x)|2+δ in the proof of Theorem 3.2, verify that for any δ > 0, E|ηin |2+δ = n−(2+δ) EKα2+δ (x, X)E(|ε|2+δ |X = x) = O(n−(2+δ) αn(1+δ)/2 ). n Hence, sn−(2+δ) n $ i=1 E|ηin |2+δ )% √ & + αn δ/2 =O = o(1). n Hence, by the Lindeberg–Feller Central Limit Theorem (CLT), sn−1 Vn (x) →d N(0, 1). From the asymptotic results on fˆn (x) and Vn (x) in Theorem 3.2 and fact (19) about Bn (x), we obtain that 1 2 1 1 −1 sn − [Bn (x) + Vn (x)] = op (1). fˆn (x) f (x) * √ * √ This, together with the result that n/ αn · Op (1/ n αn ) = op (1), implies f (x)sn−1 % % && b(x) 1 m ˆ n (x) − m(x) − = sn−1 Vn (x) →d N(0, 1). +o αn − 1 αn √ The proof is completed by noting that f (x)sn−1 = (v(x) αn /n)−1/2 . ! 848 H.L. Koul and W. Song 6∞ Proof of Theorem 3.4 Recall that E fˆn (x) = 0 g(u; p1 , λ1 )f (u) du. By Equation (16) and the boundedness of x 2 f (( (x) on [a, b], we obtain % & 1 ˆ E fn (x) − f (x) = O , for any x ∈ [a, b]. αn Downloaded by [98.239.145.180] at 20:47 16 April 2014 Hence supa≤x≤b |E fˆn (x) − f (x)| = O(1/αn ). Therefore, we only need to show that fˆn (x) − √ 1/4 √ E fˆn (x) = o(αn log n/ n). For this purpose, let ξin (x) = n−1 [Kαn (x, Xi ) − EKαn (x, Xi )], hence ! fˆn (x) − E fˆn (x) = ni=1 ξin (x). In order to apply Bernstein inequality, we have to verify the Cram´er 2 condition for ξin , that is, we need to show that, for k ≥ 3, E|ξ1n |k ≤ cnk−2 k!Eξ1n for some cn only depending on n. Note that Kαn (x, X) can be written as Kαn (x, X) = " x #αn +2 " α x# αnαn +1 n exp − . x"(αn + 1) X X As a function of u, uαn +2 exp(−αn u) attains its maximum at u = (αn + 2)/αn . Therefore, for any x and X, by Stirling formula, % & αnαn +1 αn + 2 αn +2 Kαn (x, X) ≤ exp(−(αn + 2)) x"(αn + 1) αn ≤ (αn + 2)2 (αn + 2)αn exp(−(αn + 2)) xαn "(αn + 1) (αn + 2)2 (αn + 2)αn exp(−(αn + 2)) αn √ xαn αn 2παn e−αn (1 + o(1)) √ c αn ≤ , x = (22) for some positive constant c. Therefore, for any k ≥ 3, and αn large enough, With vn := ( or !n i=1 E|ξin |k = n−k E|Kαn (x, Xi ) − EKαn (x, Xi )|k % √ &k−2 c αn ≤ n−2 E|Kαn (x, Xi ) − EKαn (x, Xi )|2 xn % √ &k−2 c αn = Eξin2 . xn Eξin2 )1/2 , this immediately implies % √ &k−2 c αn k E|ξin | ≤ k! Eξin2 nx ∀ 1 ≤ i ≤ n, & % √ &k−2 ' (2 c αn ξin ξin k ≤ k! E ∀ 1 ≤ i ≤ n. E vn nxvn vn √ √ √ By Equation (21), vn2 = αn f (x)/2nx π + o( αn /n). This, together with the fact that xf (x) is bounded away from 0 and ∞ on [a, b], implies ' (k % &k−2 ' (2 ξin cαn 1/4 ξin E ≤ k! E . (23) √ vn vn n % Journal of Nonparametric Statistics 849 Then, by Equation (23) and the Bernstein inequality, for any positive number c, ) + 3 & %3 !n 2 * 3 i=1 ξin 3 c log n 3 ≥ c log n ≤ 2 exp − . P 33 √ 1/4 √ vn 3 4(1 + cαn log n/ n) 1/2 Since αn log n/n → 0, so for n large enough, 3 %3 !n & % 2 & * 3 i=1 ξin 3 c log n 3 3 P 3 ≥ c log n ≤ 2 exp − . vn 3 8 Downloaded by [98.239.145.180] at 20:47 16 April 2014 Upon taking c = 8, we have Since !∞ n=1 3 )3 n + 3$ 3 * 2 3 3 P 3 ξin 3 ≥ c log nvn = 8 . 3 3 n i=1 √ n−8 < ∞, so by the Borel–Cantelli lemma and by the fact vn2 = O( αn /n), we obtain ) 1/4 √ + n $ α log n n fˆn (x) − E fˆn (x) = ξin = o . √ n i=1 ! To bound ni=1 ξin uniformly for all x ∈ [a, b], we partition the interval [a, b] by the equally spaced points xi , i = 0, 1, 2, . . . , Nn , such that a = x0 < x1 < x2 < · · · < xNn = b, Nn = n3 . It is easily seen that 3 n 3 ) + 1/4 √ 3$ 3 2 αn log n 2Nn 3 3 P max 3 ξin (xj )3 > c ≤ 8 = 5. √ 0≤j≤Nn 3 3 n n n i=1 The Borel–Cantelli lemma implies that 3 n 3 ) 1/4 √ + 3$ 3 αn log n 3 3 max 3 ξin (xj )3 = o . √ 0≤j≤Nn 3 3 n (24) i=1 For any x ∈ [xj , xj+1 ], ξin (x) − ξin (xj ) = n−1 [Kαn (x, Xi ) − EKαn (x, Xi )] − n−1 [Kαn (xj , Xi ) − EKαn (xj , Xi )]. Then, a Taylor expansion of Kαn (x, Xi ) at x = xj up to the first order leads to the following expression for the difference Kαn (x, Xi ) − Kαn (xj , Xi ): 1 % & % & % & % &2 x − xj αn x˜ αn +2 αn x˜ αn x˜ αn +3 αn x˜ (αn + 1) exp − − exp − , "(αn + 1)αn x˜ 2 Xi Xi Xi Xi where |x − x˜ | ≤ xj+1 − xj ≤ (b − a)/Nn . Note that for p > 0, the maximum of x p e−x for x > 0 is attained at x = p and equals pp e−p . Hence, % αn x˜ Xi &αn +2 % αn x˜ Xi &αn +3 & % αn x˜ ≤ (αn + 2)αn +2 e−αn −2 , exp − Xi % αn x˜ exp − Xi & ≤ (αn + 3)αn +3 e−αn −3 . 850 H.L. Koul and W. Song Therefore, for all 1 ≤ i ≤ n, |Kαn (x, Xi ) − Kαn (xj , Xi )| ≤ (x − xj )αnαn +2 exp(−αn ) "(αn + 1)˜x 2 1% 2 & & % 2 αn +3 −2 3 αn +3 −3 × 1+ e + 1+ . e αn αn With this upper bound together with the Stirling approximation for the Gamma function, one concludes that for n and αn large enough, 3/2 Downloaded by [98.239.145.180] at 20:47 16 April 2014 |Kαn (x, Xi ) − Kαn (xj , Xi )| ≤ c(x − xj )αn , x˜ 2 for some positive constant c. Because 0 ≤ x − xj ≤ (b − a)/Nn , and x˜ > 1/a, 3/2 |Kαn (x, Xi ) − Kαn (xj , Xi )| ≤ cαn , Nn which implies that when n is large enough, for some constant c, 3/2 |ξin (x) − ξin (xj )| ≤ cαn , nNn 1 ≤ i ≤ n. These bounds imply that for all x ∈ [xj , xj+1 ] and 0 ≤ j ≤ Nn − 1, 3 n 3 ) 1/4 √ + n 3$ 3 cα 3/2 $ α log n n n 3 3 ξin (x) − ξin (xj )3 ≤ =o . √ 3 3 3 n3 n i=1 i=1 (25) (26) Finally, from Equations (24) and (26), we obtain 3 n 3 3 n 3 3$ 3 3$ 3 3 3 3 3 ξin (x)3 ≤ max 3 ξin (xj )3 sup |fˆn (x) − E fˆn (x)| = sup 3 3 0≤j≤Nn 3 3 a≤x≤b a≤x≤b 3 i=1 i=1 3 n 3 n 3$ 3 $ 3 3 + max sup 3 ξin (x) − ξin (xj )3 0≤j≤Nn −1 x∈[xj ,xj+1 ] 3 3 i=1 i=1 ) 1/4 √ + log n αn =o . √ n This, together with the result supa≤x≤b |E fˆn (x) − f (x)| = O(1/αn ), completes the proof of Theorem 3.4. ! Proof of Theorem 3.5 By Equation (21) and Theorem 3.4, it suffices to prove the following two facts: 3 3 + ) 1/4 √ % & 3 B (x) 3 1 αn log n 3 n 3 sup 3 , (27) +o √ 3=O αn n x∈[a,b] 3 fˆn (x) 3 3 3 ) 1/4 √ + % & 3 V (x) 3 log n 1 α n 3 n 3 sup 3 +o . (28) √ 3=O αn n x∈[a,b] 3 fˆn (x) 3 We shall prove Equation (28) only, the proof of Equation (27) being similar. 851 Journal of Nonparametric Statistics Let β, η be such that β < 25 , β(2 + η) > 1, and β(1 + η) > dn dn write εi = εi1 + εi2 + µdi n , with dn εi1 = εi I(|εi | > dn ), dn εi2 = εi I(|εi | ≤ dn ) − µdi n , 2 5 and define dn = nβ . For each i, µdi n = E[εi I(|εi | ≤ dn )|Xi ]. Hence, Downloaded by [98.239.145.180] at 20:47 16 April 2014 !n !n !n dn dn dn Vn (x) i=1 Kαn (x, Xi )εi1 i=1 Kαn (x, Xi )εi2 i=1 Kαn (x, Xi )µi = !n + !n + ! . n fˆn (x) i=1 Kαn (x, Xi ) i=1 Kαn (x, Xi ) i=1 Kαn (x, Xi ) Since E(εi |Xi ) = 0, so µdi n = −E[εi I(|εi | > dn )|Xi ], then from assumption (A4), we have |µdi n | ≤ cdn−(1+η) . Hence, 3! 3 ) 1/4 + 3 n K (x, X )µdn 3 αn i 3 i=1 αn i 3 sup 3 !n . 3 ≤ cdn−(1+η) = o √ 3 3 n x∈[a,b] i=1 Kαn (x, Xi ) dn Now, consider the part involving εi1 . By the Markov inequality, ∞ $ n=1 P(|εn | > dn ) ≤ E|ε|2+η n $ 1 2+η n=1 dn < ∞. The Borel–Cantelli lemma implies that P{∃N, |εn | ≤ dn for n > N} = 1 ⇒ P{∃N, |εi | ≤ dn , i = 1, 2, . . . , n, for n > N} = 1 dn ⇒ P{∃N, εi,1 = 0, i = 1, 2, . . . , n, for n > N} = 1. Hence, 3 !n 3 dn 3 3 3 i=1 Kαn (x, Xi )εi,1 3 sup 3 !n 3 = O(n−k ) ∀k > 0. 3 3 K (x, X ) i x∈[a,b] i=1 αn dn dn , we have E[εi,2 |Xi ] = 0, and it is easy to show that For the term εi,2 dn Var(εi,2 |Xi ) = σ 2 (Xi ) + O[dn−η + dn−2(1+η) ] dn k dn 2 and for k ≥ 2, E(|εi,n | |Xi ) ≤ 2k−2 dnk−2 E(|εi,n | |Xi ). Then, from Equation (22) and the bounded2 ness of σ (x) over (0, ∞), we have dn k dn k | ≤ n−k E[Kαkn (x, X)E(|εi,n | |Xi )] E|n−1 Kαn (x, Xi )εi,2 ≤ cn−k 2k−2 dnk−2 EKαkn (x, X)σ 2 (X) % √ &k−2 cdn αn dn 2 ≤ E|n−1 Kαn (x, Xi )εi,2 | . n Because 1 E[Kα2n (x, X)σ 2 (X)][1 + o(1)] n2 √ αn f (x)σ 2 (x) = [1 + o(1)], √ 2n2 πx dn 2 | = E|n−1 Kαn (x, Xi )εi,2 852 H.L. Koul and W. Song dn the random variable n−1 Kαn (x, Xi )εi,2 satisfies the Cram´er condition. Therefore, using the Bernstein inequality as in proving Theorem 3.4, one establishes the fact that for all c > 0, ; 3 3 < n % 2 & n 3$ 3 $ < * c log n 3 dn 3 dn 2 = P 3 Kαn (x, Xi )εi,2 3 ≥ c log n E[Kαn (x, Xi )εi,2 ] ≤ 2 exp − . 3 3 8 i=1 i=1 Downloaded by [98.239.145.180] at 20:47 16 April 2014 * √ Take c = 4 and C(x) = c f (x)σ 2 (x)/(2x π) in the above inequality to obtain @ 3 3 n 1/2 31 $ 3 α log n n 3 dn 3 ≤ 2, P 3 Kαn (x, Xi )εi,2 3 ≥ C(x) 3n 3 n n2 i=1 by the Borel–Cantelli Lemma and the boundedness of f (x)σ 2 (x)/x over x ∈ [a, b], this implies, for each x ∈ [a, b], 3 n 3 ) 1/4 √ + 31 $ 3 α log n n 3 dn 3 Kαn (x, Xi )εi,2 . √ 3 3=o 3 3n n i=1 To show the above bound is indeed uniform, we can use the similar technique as in showing the uniform convergence of fˆn (x) as in the proof of Theorem 3.4. In fact, the only major difference is that, instead of using Equation (25), we should use the inequality dn dn |Kαn (x, Xi )εi,2 − Kαn (xj , Xi )εi,2 |≤ 3/2 cαn dn , Nn x ∈ [xj , xj+1 ], 1 ≤ i ≤ n. The above result, together with the facts that f (x) is bounded below from 0 on [a, b], and supx∈[a,b] |fˆn (x) − f (x)| = o(1), implies 3 !n 3 ) 1/4 √ + dn 3 3 K (x, X )ε log n α α i n 3 i=1 n i,2 3 sup 3 !n , a.s. √ 3=o 3 K (x, X ) n α i x∈[a,b] 3 n i=1 This concludes the proof of Theorem 3.5. ! Acknowledgements The authors gratefully acknowledge the editors and two referees for their helpful comments which improved the presentation of the paper. Research supported in part by the NSF DMS Collaborative Grants 1205271 and 1205276. References Abadir, K.M., and Lawford, S. (2004), ‘Optimal Asymmetric Kernels’, Economics Letters, 83, 61–68. Bouezmarni, T., and Rolin, J. (2003), ‘Consistency of the Beta Kernel Density Function Estimator’, Canadian Journal of Statistics, 31, 89–98. Chaubey, Y.P., Sen, A., and Sen, P.K. (2012), ‘A New Smooth Density Estimator for Non-Negative Random Variables’, Journal of Indian Statistical Association, 50, 83–104. Chen, S.X. (1999), ‘Beta Kernel Estimators for Density Functions’, Computational Statistics & Data Analysis, 31, 131–145. Chen, S.X. (2000a), ‘Beta Kernel Smoothers for Regression Curves’, Statistica Sinica, 10, 73–91. Chen, S.X. (2000b), ‘Probability Density Function Estimation Using Gamma Kernels’, Annals of the Institute of Statistical Mathematics, 52, 471–480. Chen, S.X. (2002), ‘Local Linear Smoothers UsingAsymmetric Kernels’, Annals of the Institute of Statistical Mathematics, 54, 312–323. Cline, D.B. (1988), ‘Admissible Kernel Estimators of a Multivariate Density’, The Annals of Statistics, 16, 1421–1427. Downloaded by [98.239.145.180] at 20:47 16 April 2014 Journal of Nonparametric Statistics 853 Cowling, A., and Hall, P. (1996), ‘On Pseudodata Methods for Removing Boundary Effects in Kernel Density Estimation’, Journal of the Royal Statistical Society. Series B (Methodological), 58, 551–563. Fan, J. (1993), ‘Local Linear Regression Smoothers and Their Minimax Efficiencies’, The Annals of Statistics, 21, 196–216. Fan, J., and Gijbels, I. (1992), ‘Variable Bandwidth and Local Linear Regression Smoothers’, The Annals of Statistics, 20, 2008–2036. Gasser, T., and Müller, H.G. (1979), ‘Kernel Estimation of Regression Functions’, in Smoothing Techniques for Curve Estimation (Vol. 757), Lecture Notes in Mathematics, eds. T. Gasser and M. Rosenblatt, Berlin: Springer Berlin Heidelberg, pp. 23–68. Härdle, W., Hall, P., and Marron, J.S. (1988), ‘How Far Are Automatically Chosen Regression Smoothing Parameters from Their Optimum?’, Journal of the American Statistical Association, 83, 86–95. Härdle, W., Hall, P., and Marron, J. (1992), ‘Regression Smoothing Parameters that Are Not Far from Their Optimum’, Journal of the American Statistical Association, 87, 227–233. Härdle, W., Müller, M., Sperlich, S., and Werwatz, A. (2004), Nonparametric and Semiparametric Models, Berlin Heidelberg: Springer Verlag. Hart, J.D. (1997), Nonparametric Smoothing and Lack-of-Fit Tests, New York: Springer. John, R. (1984), ‘Boundary Modification for Kernel Regression’, Communications in Statistics – Theory and Methods, 13, 893–900. Jones, M. (1993), ‘Simple Boundary Correction for Kernel Density Estimation’, Statistics and Computing, 3, 135–146. Jones, M., and Henderson, D. (2007), ‘Miscellanea Kernel-Type Density Estimation on the Unit Interval’, Biometrika, 94, 977–984. Kotz, S., Balakrishnan, N., and Johnson, N.L. (2000), Continuous Multivariate Distributions, Models and Applications (Vol. 1), New York: John Wiley & Sons, Inc. Marron, J.S., and Ruppert, D. (1994), ‘Transformations to Reduce Boundary Bias in Kernel Density Estimation’, Journal of the Royal Statistical Society. Series B (Methodological), 56, 653–671. Mnatsakanov, R., and Sarkisian, K. (2012), ‘Varying Kernel Density Estimation On’, Statistics & Probability Letters, 82, 1337–1345. Müller, H.G. (1991), ‘Smooth Optimum Kernel Estimators Near Endpoints’, Biometrika, 78, 521–530. Müller, H.G., and Wang, J.L. (1994), ‘Hazard Rate Estimation Under Random Censoring with Varying Kernels and Bandwidths’, Biometrics, 50, 61–76. Scaillet, O. (2004), ‘Density Estimation Using Inverse and Reciprocal Inverse Gaussian Kernels’, Nonparametric Statistics, 16, 217–226. Schuster, E.F. (1985), ‘Incorporating Support Constraints into Nonparametric Estimators of Densities’, Communications in Statistics – Theory and Methods, 14, 1123–1136. Wand, M.P., and Jones, M.C. (1994), Kernel Smoothing (Vol. 60), Boca Raton, FL: Chapman & Hall, CRC Press.
© Copyright 2024