Izenman 1991
Izenman 1991
To cite this article: Alan Julian Izenman (1991) Review Papers: Recent Developments in
Nonparametric Density Estimation, Journal of the American Statistical Association, 86:413,
205-224, DOI: 10.1080/01621459.1991.10475021
Advances in computation and the fast and cheap computational facilities now available to statisticians have had a significant
impact upon statistical research, and especially the development of nonparametric data analysis procedures. In particular, the-
oretical and applied research on nonparametric density estimation has had a noticeable influence on related topics, such as
nonparametric regression, nonparametric discrimination, and nonparametric pattern recognition. This article reviews recent de-
velopments in nonparametric density estimation and includes topics that have been omitted from review articles and books on
the subject. The early density estimation methods, such as the histogram, kernel estimators, and orthogonal series estimators
are still very popular, and recent research on them is described. Different types of restricted maximum likelihood density es-
timators, including order-restricted estimators, maximum penalized likelihood estimators, and sieve estimators, are discussed,
where restrictions are imposed upon the class of densities or on the form of the likelihood function. Nonparametric density
estimators that are data-adaptive and lead to locally smoothed estimators are also discussed; these include variable partition
histograms, estimators based on statistically equivalent blocks, nearest-neighbor estimators, variable kernel estimators, and adap-
tive kernel estimators. For the multivariate case, extensions of methods of univariate density estimation are usually straightfor-
ward but can be computationally expensive. A method of multivariate density estimation that did not spring from a univariate
generalization is described, namely, projection pursuit density estimation, in which both dimensionality reduction and density
estimation can be pursued at the same time. Finally, some areas of related research are mentioned, such as nonparametric
estimation of functionals of a density, robust parametric estimation, semiparametric models, and density estimation for censored
and incomplete data, directional and spherical data, and density estimation for dependent sequences of observations.
KEY WORDS: Adaptive estimators; Censored data; Delta sequences; Directional data; Histograms; Kernel estimators; Max-
imum penalized likelihood; Method of sieves; Multivariate density estimation; Nearest neighbor methods; Or-
der-restricted maximum likelihood methods; Orthogonal series; Projection pursuit density estimation; Statis-
tically equivalent blocks.
205
206 Journal of the American Statistical Association, March 1991
0.04
0.03
0.03
0.02
...>-
'iii 0.02
c
ell
"C /
I , /
I
0.01 I
0.01 i
/
/
/
r
/ /
/ /
./
0.0 0.0
Figure 1. Gaussian Kernel Density Estimates of (a) Resting Heart Rate and (b) Maximum Heart Rate Following Exercise for a Group of 117
Male Heart Patients (Dotted Lines) and for a Group of 117 Age-Matched Male "Normals" (Solid Lines) in a Study of Coronary Heart Disease
(Kasser and Bruce 1969). For each density estimate, the window-width was taken to reflect sample variation. Note especially the bimodal
density estimate for maximum heart rate for the patient group and the bimodal density estimate for resting heart rate for the normal group.
Source of data: Kronmal and Tarter (1973).
timates effective in the following situations: (a) In explor- such as multimodality, tail behavior, and skewness, are of
atory analysis, descriptive features of the density estimate, special interest, and a nonparametric approach may be more
220
200
180
...
ell
III , I
I
,,
,
160 ,
..."-
"-
I
I /
..
i ,\
III
ell
\
,
.c ,
, /
E
;]
\
,I I ,
/
,,
/
E 140 , /
.. ,
'xIII , /
I
,,
, /
,
I
/
/
,
,
\
"
E , I
/
,
i
, I I
I
I I , ,, i
,
I I
I
I
I
,,
I
,
, ,
I
I
,
I ,,
/ \
120 I
, /
I
100
80
40 60 80 100 ·120
""'~~.-lillill
maximum
heart rate heart rate
maximum
(a) (b)
Figure 3. Three-Dimensional Perspective Plots of Bivariate Gaussian Kernel Density Estimates of Resting Heart Rate and Maximum Heart
Rate From Figure 1. The normals group is displayed in (a) and the patient group in (b).
flexible than the traditional parametric methods; (b) in con- 1985; Hand 1982; Nadarya 1989; Prakasa Rao 1983; Sil-
firmatory analysis, nonparametric density estimates are used verman 1986; Tapia and Thompson 1978; Van Es 1990;
in decision making, such as nonparametric discrimination and Wertz 1978); certain books emphasized density esti-
and classification analysis, testing for modes, and random mation methods preferred by the authors, while others were
variate testing; and (c) for presentational purposes, statis- more comprehensive in their treatment of the diverse ma-
tical peculiarities of the data often can be readily explained terial. As with most statistical research, much of what has
to clients through simple graphical displays of estimated been written on the subject of nonparametric density esti-
density curves (See Silverman 1981a). There is a very re- mation, including most of these books, has been completely
vealing example of (a) by Park and Marron (1990) where theoretical; some books (such as Silverman 1986), how-
they display a sequence of annual lognormal density esti- ever, contain discussions of real-data examples, simulation
mates for net income data that indicated unimodal densities studies, and computational issues. References to JASA re-
hardly changing from year to year, while nonparametric views of some of these books are listed in Table 2. See
density estimates indicated at least two modes and signif- also the book review by Silverman (1985). The successful
icant changes in shape over time. Further published appli- development of nonparametric density estimation tech-
cations of nonparametric density estimation can be found niques led, in tum, to the formulation of nonparametric
listed and briefly described in Table 1. regression (Eubank 1988; Muller 1988; Nadarya 1989), in-
The last two decades have seen a consolidation and a cluding the nonparametric analysis of growth curves, and
critical assessment of nonparametric density estimation nonparametric statistical pattern recognition (Devijver and
methods. Several review articles (Bean and Tsokos 1980; Kittler 1982; Fukunaga 1972, chap. 6).
Fryer 1977; Leonard 1978; Rosenblatt 1971; Tarter and This article surveys recent developments in nonparamet-
Kronmal1976; and Wegman 1972, 1982) and an extensive ric density estimation, as well as topics that were omitted
bibliography (Wertz and Schneider 1979) were published, from previous review articles and books. Section 2 dis-
as well as nine books (Devroye 1987; Devroye and Gyorfi cusses desirable statistical properties of nonparametric den-
Silverman (1978c) Identifying the causes of "cot MPL Univariate data; assessing bimodality
death"
Scott, Gotto, Cole, and Gorry (1978) Coronary heart disease Kernel Bivariate data; classification problem
Good and Gaskins (1980) High-energy physics and "bump- MPL Univariate grouped data; assessing a
hunting" bump in a mass spectrum
histogram
Dubuisson and Lavison (1982) Surveillance of a nuclear reactor Kernel Multivariate data; classification
problem
Scott and Thompson (1983) Remote sensing of satellite ASH Trivariate data; exploratory analysis
agricultural crop data
Aitchison and Lauder (1985) Compositional data for geology Kernel Multivariate data vectors of
and consumer demand proportions summing to unity
analysis
De Jager, Swanepoel, and Gamma-ray astronomy for Kernel Univariate data; assessing Whether
Raubenheimer (1986) estimating light curves and light curve differs from uniform
identifying periodic sources density
Izenman and Sommer (1988) Identifying the components of a Kernel Univariate data; assessing
philatelic mixture multimodality; comparison with
parametric mixture
208 Journal of the American Statistical Association, March 1991
Wertz (1978) JASA, 75 (1980), 241 K.-S. Lii Emphasizes kernel methods; theoretical
Tapia and Thompson (1978) no JASA review Emphasizes MPL method; theoretical;
Monte Carlo simulations
Hand (1982) JASA, 78 (1983), 990-991 J. D. Knoke Kernel methods only; some applications;
univariate and multivariate approaches
Prakasa Rao (1983) JASA, 81 (1986), 264 V. Surarla Comprehensive; theoretical; applications
to different topics
Devroye and Gyorfi (1985) JASA, 82 (1987), 344 J. R. Thompson Comprehensive; theoretical; L, viewpoint
Silverman (1986) JASA, 83 (1988), 269-270 A. J. Izenman Comprehensive; numerous real-data
applications; univariate and
multivariate approaches;
computational details
Devroye (1987) no JASA review Emphasizes kernel methods; theoretical;
L, viewpoint
Nadarya (1989) JASA, 85 (1990), 598 D. W. Scott Emphasizes kernel methods; theoretical
sity estimates, followed in Sections 3-9 by reviews of the where var[}(x)] = E/t.}(x) - EA}(x)]Y and bias[}(x)] =
various estimation methods. Finally, in Section 10, some EA}(x)] - f(x). If MSE(x) ~ 0 for all x ERas n ~ 00,
remarks are made about related research areas. Note that then} is said to be a pointwise consistent estimator off in
the references, though numerous, should not be regarded quadratic mean. A more important performance criterion
as exhaustive. relates to how well the entire curve} estimates f. One such
measure of goodness of fit is found by integrating (2.1)
2. STATISTICAL PROPERTIES OF DENSllY ESTIMATORS over all values of x, yielding the integrated mean squared
Like any statistical procedure, nonparametric density es- error,
timators are recommended only if they possess desirable
properties. Finite-sample properties of nonparametric den- IMSE = 1'' 00 Ei}(x) - f(x)]z dx. (2.2)
sity estimators are available for special situations (Deheu-
vels 1977; Fryer 1976), but, in general, research emphasis Another measure commonly used is integrated squared er-
has settled on developing large-sample properties. ror (or L z norm),
2.1 Unbiasedness
ISE = Loooo [}(x) - f(x)f dx. (2.3)
Consider, for example, unbiasedness. An estimator} of
a probability density function f is unbiased for f if, for all Taking expectations over fin (2.3) gives the mean inte-
x E R d , Ei}(x)] = f(x). Although unbiased estimators of grated squared error, MISE = E/ISE). Note that MISE =
parametric densities, such as the normal, Poisson, expo- IMSE. ISE is often preferred as a criterion, rather than its
nential, and geometric, do exist (Ghurye and Olkin 1969), expected value MISE, since ISE determines how closely}
no bona fide density estimator [that is, satisfying (1. 1)] can approximates f for a given data set, whereas MISE is con-
exist that is unbiased for all continuous densities (Rosen- cerned with the average over all possible data sets. Under
blatt 1956). Hence attention has since focused on sequences mild conditions, ISE has been shown to be a reasonably
Un} of nonparametric density estimators that are asymptot- random approximation to MISE (Marron and Hardle 1986),
ically unbiased for f; that is, for all x E R d , Ei}n(x)] ~ while, in certain situations, MISE may actually be a better
f(x) as n ~ 00. performance criterion than ISE (Hall and Marron 1988).
2.2 Consistency Farrell (1972) showed that for bona fide density estimates,
the best possible asymptotic rate of convergence for MISE
A more important property is consistency. The simplest is O(n- 4!5) , and Boyd and Steele (1978) proved that no}
notion of consistency of a density estimator is where} is can exist with a MISE better than O(n-)), even if f is a
(weakly) pointwise consistent for a univariate f if }(x) ~ normal density.
f(x) in probability for every x E R, and is strongly point-
wise consistent for f if convergence holds almost surely. The L] Approach. One problem with the L, approach
Other types of consistency depend upon the error criterion to nonparametric density estimation is that the tail behavior
(L) or L z, in general); see Hall (1989b). of a density becomes less important, possibly resulting in
peculiarities in the tails of the density estimate. Further ob-
The Lz Approach. Iff is assumed square integrable, then jections to the L z approach can be found in Donoho and
the performance of J at x E R is measured by the mean Johnstone (1989). In two books (Devroye 1987; Devroye
squared error, and Gyorfi 1985), and in a host of articles, an alternative
MSE(x) = Ef[}(x) - f(x)f (2.1) L) theory of nonparametric density estimation was vigor-
ously pursued by Devroye and his colleagues. Specifically,
= var[}(x)] + {bias[}(x)]f, Devroye and Gyorfi (1985, p. 1) claimed that L) is "the
Izenman: Recent Developments in Nonparametric Density Estimation 209
natural space for densities," and showed that the integrated ~;:I N; = n. Then, the histogram, defined by
absolute error (also known as the total variation or the L I
norm), A ~ NJn
f(x) = LJ lr,(x) , (3.1)
;=1 (t n •i + 1 - tn )
well for Gaussian samples, while it led to overly large bin to have been the first to call K in (4.1) a kernel function;
widths and hence oversmoothing otherwise. Freedman and previously, K was referred to as a weight function. Note
Diaconis (1981b) suggested a "simple, robust rule [that] that the same amount of smoothing is used in (4.1) for each
often gives quite reasonable results," namely, il: = of the d dimensions. The fast Fourier transform is recom-
2(IQR)n- I /3, where IQR is the interquartile range of the mended for computing (4.1) in the univariate case (d = 1);
data. Numerical comparisons by Emerson and Hoaglin (1983) see Silverman (1982a) and Jones and Lotwick (1984). Since
of the Scott and Freedman-Diaconis rules showed the (4.1) shows thatJh inherits whatever properties the kernel
Freedman-Diaconis rule led to narrower bin widths, al- K possesses, it is important that K have desirable proper-
though "in practical applications the two rules will often ties.
lead to the same choice of interval width." Terrell and Scott The simplest class of kernels consists of probability den-
(1985) and Terrell (1990) argued that h; should be chosen sity functions that satisfy
conservatively by restricting the choice of bin width to the
value that yields the smoothest density, subject to a given K(x) 2 0, r K(x) dx = 1.
JR
(4.2)
measure of spread (such as the standard deviation or range). d
Information-based methods for the histogram were studied If a kernel K from this class is used in (4. I), then Jh will
by Taylor (1987), who used Akaike's information criterion always be a bona fide probability density. Popular choices
for determining an optimal histogram bin width, and by of univariate kernels include the Gaussian kernel with un-
Rodriguez and van Ryzin (1985), who defined maximum bounded support,
entropy histograms. Scott (1988) studied hexagonal and
square bin shapes for bivariate histograms. K(x) = (277)-1/2 e-x'/2, x E R, (4.3)
and the compactly supported "polynomial" kernels,
3.4 Related Estimators
By modifying the block-like shape of the histogram, a
faster rate of IMSE convergence of O(n- 4 / 5 ) (or close to it) r
can be attained by the following estimators. K rs = , r> 0, s 2 O. (4.4)
2Beta(s + I, l/r)
The averaged shifted histogram (ASH) of Scott and
Thompson (1983) and Scott (1985a) is constructed by av- The rectangular kernel obtains in (4.4) if s = 0 (KrlJ = 1/
eraging several histograms with equal bin widths but dif- 2); the triangular kernel if r = I, s = I (KII = I); the
ferent bin locations and was motivated by the need to re- Bartlett-Epanechnikov kernel if r = 2, s = I (K21 = 3/4);
solve the problem of a choice of bin origin; its computational the biweight kernel if r = 2, s = 2 (K22 = 15/16); the
efficiency in the multivariate case has made the ASH pop- triweight kernel if r = 2, s = 3 (K23 = 35/32); and, after
ular among many researchers. a suitable rescaling, the Gaussian kernel if r = 2, S = 00.
The classical frequency polygon (FP), studied by Scott The triangular kernel density estimate is asymptotically re-
(1985b), is constructed by connecting the mid-bin values lated to the ASH since the former is obtained as a limit of
of the histogram with straight lines. The FP was especially the latter as the number of shifted histograms becomes in-
recommended for interpolating the ASH, leading to the ASH- finite. For x E R d , multivariate kernels are usually radially
FP. Jones (1989) studied discretization and interpolation symmetric unimodal densities such as the Gaussian K(x) =
problems related to the ASH and ASH-FP. (27T)-d/2 e- O / 2)x' X, and the Bartlett-Epanechnikov, K(x) =
The histospline of Boneva, Kendall, and Stefanov (1971) «d + T
2)/2cd)(1 - x x )I [x' X'; I]' Cd = 7Td/ 2/ f « d / 2) + I).
is a cardinal quadratic spline fitted to the histogram and is In certain situations (Cacoullos 1966), product kernels
obtained by interpolating the knots of the sample distribu- may be appropriate, where K(x) = II1=1 K(x;) is a product
tion function t; = n- I ~7=1 I[xisxl and then differentiating of univariate kernel functions. For example, Figures 2 and
the cubic spline estimator of the distribution function F. 3 were computed using bivariate product Gaussian kernel
A weighted histogram estimator of f, also referred to as density estimates. In a similar study, Scott, Gotto, Cole,
a Bernstein polynomial-type approximation, was proposed and Gorry (1978) used bivariate product biweight kernel
by Vitale (1975) and Gawronski and Stadtmuller (1980), density estimates.
where the bin counts were weighted by empirical Poisson
probabilities. 4.1 Statistical Properties
Deriving asymptotic properties of kernel density esti-
4. KERNEL DENSITY ESTIMATION mates depends on the particular viewpoint considered. De-
The multivariate kernel density estimator off has the form vroye (1983), using the L 1 approach, proved the remarkably
simple result that if K satisfies (4.2), then the kernel esti-
Jh(X) = (nh d) -I 2:
n
K (x - X)
_ _1 ,
mator (4.1) will be a strongly consistent estimator of f if
and only if h; ~ 0 and nh: ~ 00, as n ~ 00, without any
j=1 h
conditions on f. Devroye and Penrod (1984) also showed
where the choice of kernel function K and the window width that, for the univariate case, MIAE was of order O(n- 2 / 5 ) ,
Ji.
h = h; > 0 determine the performance of as an estimator better than the L 1 rate for histograms. Explicit formulas
of f. It is interesting to note that Cacoullos (1966) appears for minimum MIAE and asymptotically optimal smoothing
Izenman: Recent Developments in Nonparametric Density Estimation 211
parameters for kernel estimators were obtained by Hall and dow width is cross-validation (CV). The basic algorithm
Wand (1988). involves removing a single value, say Xi' from the sample,
For the L 2 approach, under regularity conditions on K computing the appropriate density estimate at that Xi from
andf, Parzen (1962) showed that if h; ~ 0 as n ~ 00, then the remaining n - 1 sample values,
the univariate kernel estimator was both asymptotically un-
1 '" (Xi - Xi)
biased and asymptotically normal. Cacoullos (1966) showed = (n _ l)h ~ K - h - ,
A
fh.i(Xi) (4.5)
that the asymptotic expression for IMSE for the d-dimen-
sional case was minimized over all h satisfying the above and then choosing h to optimize some given criterion in-
conditions by h~SE = a(K){3(f)n -1/(d+4), where a(K) de- volving all values oflh.i(X;) (i = 1,2, ... , n). Two different
pends only on the kernel K and (3(f) depends only on f; versions of CV have been used in density estimation: like-
furthermore, IMSE ~ 0 at rate O(n- 4 / (d+4) . The results show lihood cross-validation and least squares cross-validation.
clearly the dimensionality effect, since these convergence For likelihood cross-validation, h LC V is that h that maxi-
rates become slower as d increases. In the univariate case, mizes the "pseudo-likelihood" L(h) = II7~1 ]h.i(Xi), For least
if K is the standard Gaussian kernel (4.3) and f is a Gaus- squares cross-validation, hLSCV is that h that minimizes LS(h)
sian density with variance u 2 , then h~SE = 1.06un- 1/ 5 would = R(jh) - (2/n) 2:-7=dh,i(X;), which is exactly unbiased for
be the optimal window width. Additional consistency re- MISE - R(f). Marron (1987b) provided an excellent sur-
sults were obtained by Hall and Hannan (1988). vey of these and other automatic smoothing parameter
methods.
4.2 Choice of Kernel Mixed results have been obtained for CV methods in ker-
It has been known for some time that although the Bart- nel density estimation. It has been shown, for example, that
lett-Epanechnikov kernel minimizes the optimal asymp- when using compactly supported kernels [such as (4.4)],
totic IMSE with respect to K, IMSE is quite insensitive to likelihood CV produces consistent estimates of compactly
the shape of the kernel. Marron and Nolan (1987) gave fur- supported densities (Chow, Geman, and Wu 1983) but does
ther results in this direction. As a result, more exotic types not necessarily do so for estimating infinitely supported
of kernels are now being studied. The most important of densities (Schuster and Gregory 1981). The complex influ-
these developments concerns a hierarchy of classes of ker- ence that the tails of both K and f have on likelihood CV
nels defined by the existence of certain moments of K. In was studied by Hall (1987a) in terms of the Kullback-Lei-
this scheme, those univariate symmetric kernels K that in- bler norm. Broniatowski, Deheuvels, and Devroye (1989)
tegrate to unity are called order 0 kernels, while order s related such convergence problems to the stability of the
kernels, for some positive integer s, are those order 0 ker- extreme order statistics. Simulation studies by Scott and
nels whose first s - 1 moments vanish but whose sth mo- Factor (1981) indicated that, depending upon the type of
ment is finite. Thus second-order kernels have zero mean kernel employed, likelihood CV could lead to either a se-
and finite variance and include all compactly supported ker- verely undersmoothed or oversmoothed density estimate.
nels. Order s kernels, for s ~ 3, have zero variance, which Furthermore, the criterion L(h) was found to be very sen-
can be achieved only if K takes on negative values. Such sitive to outliers. Obvious modifications of L(h), including
kernels are important for bias reduction and improving the truncating f, have been considered; see Hall (1982) and
IMSE convergence rate. For example, if K is an order s Marron (1985).
kernel, then the fastest asymptotic rate of MSE conver- Least squares CV does not seem to display the peculiar
gence of]tofis O(n- 2s / (2s + 1) ; thus, for a fourth-order ker- behavior exhibited by likelihood CV. Indeed, very mild tail
nel, which cannot be nonnegative, the minimum asymptotic conditions on f and K are needed to prove asymptotic op-
MSE convergence rate of]tofis of order O(n- S/ 9 ) , which timality results for least squares CV. See, for example, Hall
is faster than the best such rate, O(n- 4 / 5 ) , for nonnegative (1983a) and Stone (1984), who showed that hLS CV asymp-
kernels (see Gasser, Muller, and Mammitzsch 1985). Hall totically minimized ISE. Bowman (1984) also showed, via
and Marron (1988) considered optimal selection of the or- simulation, that least squares CV achieved satisfactory re-
der s. Cline (1988) defined the admissibility of kernel es- sults for long-tailed f. Hall and Marron (1987a, b) proved
timators and showed that while the Bartlett-Epanechnikov that h LSCV performed asymptotically as well as the optimal
kernel is not admissible among all kernels, it is admissible (but unattainable) window width hIMSE; they then went on
among all nonnegative kernels. to show that although hLS CV converged very slowly, the least
squares CV choice of window width could not be improved
4.3 Choice of Window Width upon asymptotically. Scott and Terrell (1987) introduced a
Early work on the kernel method emphasized asymptotic version of the criterion LS(h) that was biased for MISE and
results, whereas determining an optimal h is the main re- showed that although large asymptotic performance gains
search focus today. Since the optimal window width, could be obtained from such a biased CV procedure, no
h~SE, depends explicitly on the unknown f through (3(f), currently available (biased or unbiased) CV procedure could
it cannot be computed exactly. Several "plug-in" proce- be considered highly reliable for very small samples.
dures were proposed whereby (3(j) was used to estimate The high sampling variability of CV estimates led Terrell
(3(f), but these were generally unsatisfactory (e.g., see Scott (1990) to propose that the smoothest density estimate be
and Terrell 1987). chosen that is compatible with the estimated scale of the
An automatic method for determining the optimal win- density. Taylor (1989) and Hall (1990) showed that the
212 Joumal of the American Statistical Association, March 1991
bootstrap also works well for selecting h in large samples ming algorithm, but gave no asymptotic rate of conver-
and if resampling is carried out with a reduced sample size. gence for the estimator.
Gessaman 1972). Each probability square will be the union where the variable window width H jk = hDk(X} does not
of about k; se-blocks and, therefore, will contain about k; depend on x as did (5.4), h is a smoothing parameter, and
observations. If B; is a bounded probability square and x k controls the local behavior of H jk. The estimator (5.5) is
E B n , set a bona fide density if the kernel K satisfies (4.2). It was
, knl(n + 1) apparently first considered by Meisel in 1973 in the context
f(x) = . (5.2) of pattern recognition and then studied empirically by Brei-
area(B n )
man, Meisel, and Purcell (1977), who listed its advantages
On unbounded probability squares, estimate f as O. Ges- as having the smoothness properties of kernel estimators,
saman (1970) showed that if k; ~ 00 and k.f n ~ 0 as n ~ the data-adaptive character of the k-NN approach, and very
00, then the estimator (5.2) was weakly consistent for f.
little computational penalty. In their simulation studies, the
Convergence rates and some optimal choice for k; in (5.2) estimator (5.5) performed very poorly unless k was large,
have yet to be determined, however. on the order of . IOn. Conditions for consistency of the vari-
5.3 Nearest Neighbor Methods able kernel estimator were obtained by Wagner (1975) and
Devroye (1985); Devroye and Penrod (1986) proved the
Fix and Hodges (1951) proposed the nearest neighbor es- strong uniform consistency of (5.5).
timator in the context of nonparametric discrimination. See
Silverman and Jones (1988) for a modem interpretation. At 5.5 Adaptive Kernel Estimators
a fixed point x and for fixed integer k, let Dix) be the
The variable kernel estimator (5.5) led, in tum, to the
Euclidean distance from x to its kth nearest neighbor among
adaptive kernel estimator. Abramson (1982a,b), who was
the XI> X2 , ••• , Xn , and let vol.tx) = cADk(x)]d be the vol-
concerned with estimatingf at a point, proposed a two-step
ume of the d-dimensional sphere of radius Dix), where Cd
algorithm for computing a data-adaptive window width. First,
is the volume of the unit d-dimensional sphere. The kth
a clipped (or winsorized) version 12 is constructed from a
nearest neighbor (k-NN) density estimator is then given by
pilot kernel density estimate J2with fixed window width h
, k/n and then the adaptive kernel estimator is defined as
f(x) = --. (5.3)
voVx)
Tukey and Tukey (1981, sec. 11.3.2) called (5.3) the bal-
I~ I (x - X
fh(X) = - LJ d K - - ,
A j)
(5.6)
n j~1 hj h,
loon density estimate of f. An advantage of the k-NN es-
timator is that it is always positive, even in regions of sparce where hj = h[12(X}r l /2. Two modifications of Abram-
data. Loftsgaarden and Quesenberry (1965) proved (5.3) son's h, have been suggested. Silverman (1986, sec. 5.3)
was consistent if k = k; ~ 00 and k n / n ~ 0 as n ~ 00. set h, = h[(l/g) }2(X}r a , where g is a scale factor [such
Abramson (1984) proposed that in the d-dimensional case, as the geometric mean of the J2(xi ) , i = 1, 2, ... , n] and
k n should be chosen proportional to n 4 / (d + 4 ) , the constant of o ~ a ~ 1 reflects the sensitivity of the window width to
proportionality depending on x. The k-NN estimator (5.3) variations in the pilot estimate; examples of Silverman's
can be written as an kernel density estimator by setting adaptive window widths and a = 1/2 were also given that
demonstrated better tail behavior than the corresponding fixed
f(x) =
A 1 d
(x - X)
~ K _ _J
LJ , (5.4) window width kernel estimator. Hall and Marron (1988) set
hj = h F [J2p(X}r l / 2 in (5.6), where h p was the smoothing
n[Dix)J j=1 Dk(x)
parameter of the pilot estimate and hF was the smoothing
where the smoothing parameter is now k and the kernel K
parameter of the final estimate; they showed that their mod-
is the rectangular kernel. Moore and Yackel (1977) and Mack
ification had a very fast rate of MSE convergence.
and Rosenblatt (1979) analyzed the bias and variance of
(5.3). Rosenblatt (1979) studied the global behavior of gen- 6. ORTHOGONAL SERIES ESTIMATORS
eralized nearest neighbor estimates off. See also Mack (1980)
and Abramson (1984). Although the k-NN estimator ap- Orthogonal series density estimators were introduced by
peared reasonable for estimating a density at a point, it was Cencov (1962) and have since been applied to several dif-
not particularly successful for estimating the entire density ferent areas, especially pattern recognition and discrimi-
functionf. Indeed, the estimator was not a bona fide density nation and classification; see Greblicki and Pawlak (1981).
since (5.3) was discontinuous and had an infinite integral The method has been used to estimate multivariate densities
due to very heavy tails. Devroye and Gyorfi (1985, p. 21) for dichotomous (Ott and Kronmal 1976), polychotomous
noted that, because of these difficulties, "it is impossible (Butler and Kronmal 1985), and mixed continuous and dis-
to study its properties in L 1 • " crete variables (Hall 1983b).
The variable kernel estimator, which was an attempt to This method assumes that a square-integrable f can be
avoid the problems associated with the k-NN estimator, was represented as a convergent orthogonal series expansion,
defined by setting .
xED, (6.1)
A 1~ 1
f(x) = - LJ d K - - ,
(x - X
j)
(5.5)
k=-oo
on a set° of the real line [that is, satisfying In 'Pix)'Pix) last result slightly. Note that the IMSE convergence rate is
dx = Sjb where Sjk is the Kronecker delta] and {aJ are coef- independent of the dimension of the data, which gives the
ficients defined by ak = EA'Pt(X)], where 'Ptis the complex Hermite series estimator an advantage over the kernel es-
conjugate of 'Pk' This formulation allows for systems of real- timator for multivariate density estimation. The Hermite
or complex-valued orthonormal functions. Orthonormal system does not form a basis for the L 1 approach, however,
systems proposed for {'Pk} are those with compact support and the Hermite series estimator is neither translation in-
(such as the Fourier, trigonometric, and Haar systems on variant nor consistent in the L 1 sense.
[0, I], and Legendre system on [-1, 1]) and those with If j has compact support [0, 1], say, the popular Fourier
unbounded support [such as the Hermite system on R and (or trigonometric) series estimate, which is the real part of
Laguerre system on [0, 00)]. (6.4), is formed from the system of discrete Fourier func-
Given an independent sample, X h X 2 , ••• , Xn , fromjand tions, defined by 'Pk(X) = e2wikx [i = (_1)1/2, k = 0, 1, 2,
a system {'Pk}, the {ak} can be estimated unbiasedly by ...]. See Wahba (l975a, 1975b, 1981) and Hall (1981) for
I n details and comments about the influence of periodicity and
Ok = - 2:
n j=1
'Pf(X). (6.2) the Gibbs phenomenon on Fourier series density estimates.
Devroye and Gyorfi (1985, sec. 12.4) proved that for the
Fourier series estimator, under suitable conditions on j and
The obvious estimator ofj, obtained by plugging (6.2) into if r.fn - 0 as r; - 00, then MIAE - 0 as n _ 00.
(6.1) in place of ai, may not be well defined: It has infinite Arguments about the relative merits of the Hermite sys-
variance and is not consistent in the ISE sense. Tapered tem versus the Fourier system can be found in Walter (1977)
estimators of the form and Good and Gaskins (1980). Wahba (1981) suggested that
"in many applications it might be preferable to assume the
J(x) = 2:bkok'Pix) , xEO, (6.3) true density has compact support and to scale the data to
*=-00 the interior of [0, I]."
have been studied, where 0 < b, < I is a symmetric weight
(b_ k = b k) that shrinks Ok towards the origin, and ~Ibkl < 6.3 Choice of Number of Terms
00 is needed for pointwise convergence of (6.3). See, for The performance and smoothness of the orthogonal series
example, Watson (1969), Rosenblatt (1971), Brunk (1978), density estimate (6.4) depend on r, the number of terms in
and Hall (1986). Tapered orthogonal series estimators were the expansion. Kronmal and Tarter (1968) proposed a term-
used by Johnstone and Silverman (1990) to estimate bi- by-term optimal stopping rule for choosing r by minimizing
variate glucose density within the brain. The choice b, = an estimated MISE criterion. Disadvantages of that rule were
1 for - r s; k -s rand 0 otherwise leads to the partial sums pointed out by Crain (1973), who suggested that it might
of (6.1) being approximated by not yield the optimal r; by Hart (1985), who noted from
simulation studies that the rule tended to stop too soon, thus
Jr(x) = 2: ok'Pix) ,
k=-r
xEO, (6.4) yielding oversmoothed density estimates; and by Diggle and
Hall (1986), who warned about the possible poor perfor-
where {Ok} are given by (6.2). Wahba (1981) considered a mance and inconsistency of the rule in multimodal situa-
two-parameter system of weights, b, = (1 + A(2'lTk)2m)-1 tions. Improvements were suggested by Hart (1985) and
for -r -s k s; r, where A > 0 is a smoothing parameter Diggle and Hall (1986), and Lock (1990) combined choice
and m > 1/2 is a shape parameter. Other systems of weights of the number of terms with a tapered estimator and showed
were discussed by Hall (1987) and Lock (1990). To esti- its advantages in a simulation study.
mate the {b k } , likelihood cross-validation was proposed by
Wahba (1981) and least squares cross-validation by Hall 7. DELTA SEQUENCE DENSITY ESTIMATORS
(I 987b). In related work, Anderson and de Figueiredo (1980)
developed an adaptive orthogonal series estimator. Many of the different methods described so far for non-
parametric density estimation are special cases of the fol-
6.2 Statistical Properties lowing general class of density estimators. Let SA(X, y) (x,
Y E R), be a bounded function indexed by a smoothing
The most popular orthogonal series estimator for densi- parameter A > O. The sequence {8A(x , y)} is called a delta
ties with unbounded support, usually R or [0, 00), is the sequence on R if I::"" SA(X, y)cf>(y) dy - cf>(x) as A - 00 for
Hermite series estimator. The normalized Hermite functions every infinitely differentiable function cf> on R. Any esti-
given by 'Pix) = ck(x)Hk(x) (k = 0, 1, 2, ... ), where c, = mator that can be written in the form
e- x'/2/(2kk!7r1/2) 1/2 and Hk(x) = (_I)ke-x'/2(dk/~)(e-x')
is the kth Hermite polynomial, form an orthonormal basis
for an L 2 approach. They are heavily weighted in the tails xER, (7.1)
by e- x' /2 and provide sufficient protection against unusual
tail behavior of X; see Hall (1987b). Schwartz (1967) showed where {8A(x, y)} is a delta sequence, is called a delta se-
that if r = r n in (6.4) satisfies rn/n - 0 as r; _ 00, then quence density estimator. Thus histograms, kernel esti-
IMSE - 0 as n - 00; moreover, if r; = O(n 1/ q) for q :::: mators, and orthogonal series estimators can each be writ-
2, then IMSE = O(n-(I-I/q). Walter (1977) improved this ten in the form (7.1):
Izenman: Recent Developments in Nonparametric Density Estimation 215
l
histograms: <>m(x, X) = L~~l (t;+1 - t r ITi(x)ITi(X) the neighborhood of zero. For different approaches to com-
puting (8.1), see Barlow, Bartholomew, Bremner, and Brunk
[see (3.1)]
(1972, chap. 5) and Denby and Vardi (1986). Alternative
1
kernels: <>h(X, X) = - K((x - X)jh) approaches to estimating a decreasing density were given
h by Birge (1987a,b).
[see (4.1)] A related order restriction concerns unimodal densities.
First, without loss of generality, assume that the mode M
orthogonal series: <>rCx, X) = L~~-r q;ix)q;ff(X)
= 0 is known. Since a unimodal density f is nondecreasing
[see (6.2), (6.4)] in x prior to the mode and nonincreasing thereafter, it suf-
fices to consider only ML estimation of f+, the conditional
In some cases (such as histograms and orthogonal series
density on [0, 00), since a similar argument holds for f-,
estimators), A will be integer-valued as in the number of
the conditional density on (- 00, 0). The ML estimate of f
terms in an expansion, while in others (such as kernel es-
timators), A will be real-valued. Such general density es-
J 6.J+
is then given by = + (1 - 6.)J-, J+
where is the slope
of the least concave majorant of tn, J- is the slope of the
timators were first studied by Whittle (1958). Watson and
Leadbetter (1964) called them <>-function sequences and
greatest convex minorant of t: and 0 :5 6. :5 1 is the pro-
portion of sample values that fall into [0, 00). See, for ex-
showed that they were asymptotically unbiased as density
ample, Robertson, Wright, and Dykstra (1988, chap. 7).
estimators. Further work along the same lines was carried
Robertson (1967) showed that the ML estimate for a uni-
out by Foldes and Revesz (1974). Walter and Blum (1979)
variate, unimodal density with known mode can also be
and Prakasa Rao (1983, sec. 2.8) gave a long list of special
expressed as a conditional expectation given the a lattice
cases and established MSE rates of convergence; but, see
of all intervals that contained the mode, together with the
Hall (1981) for a cautionary note. Silverman (1986, sec.
empty set, and demonstrated that isotonic regression al-
2.9) referred to (7.1) as a general weight function esti-
gorithms can efficiently compute the ML estimate. When
mator. Marron (1987a) used delta sequence estimators as
the mode is unknown, Wegman (1969) obtained the ap-
a means of comparing different density estimators.
propriate ML estimator and showed consistency; in this case
8. RESTRICTED MAXIMUM LIKELIHOOD ESTIMATORS the a lattice was defined in terms of all intervals that con-
tained a consistent estimate of the mode. Sager (1982) gen-
The ML method of Section 3.1 fails miserably when the eralized the results of Robertson and Wegman and illus-
class of densities H over which the likelihood L is to be trated his results by estimating the contours of a bivariate
maximized is otherwise unrestricted. For that case, the like- density applied to a problem in cartography. See also Sager
lihood is maximized by a linear combination of Dirac delta (1986). A related minimum-distance estimator for unimo-
functions (or "spikes") at the n sample values, resulting in dal densities was studied by Reiss (1976).
a value of + 00 for the likelihood. In this section, ap-
proaches to the ML problem are described in which restric- 8.2 Method of Sieves
tions are placed either on H or L. The method of sieves is another restricted ML density
estimation method in which H is restricted. It is different,
8.1 Order-Restricted Methods however, in that the choice of "sieve" determines the den-
Consider, first, an order restriction on H. For example, sity estimation method. The essence of the method of sieves
densities that are monotone decreasing over the range [0, is the following: For each h > 0, select a subset Sh of den-
00) are especially important in survival analysis; see Denby sities for which a ML estimator does exist; next, find the
and Vardi (1986). Grenander (1956) showed that the ML restricted ML density estimator t; by maximizing the like-
estimator for a nonincreasing density on [0, 00) was a step lihood function
n
function with jumps at the order statistics {X(i)}. Specifi-
cally, if t, is the sample distribution function, then the ML Lh(f) = TIf(X;), (8.2)
i=1
estimator of a nonincreasing density is the slope of the least
concave majorant of r;
namely, and, finally, let the subset Sh grow (in some sense) with
the sample size n, while allowing h = h; ~ 0 as n ~ 00
, . tiX(/) - tn(x(S» in such a way as to ensure that the ML estimator converges
f(x) = mm max ,
sSt-I t?:; X(t) - X(s) to a density function. The sequence {Sh} of these subsets is
called a sieve, h is called the sieve parameter or mesh size,
X(i-I) < X < x.; (8.1)
and the estimation procedure is called the method of sieves.
and 0 for x < 0 and x < X(x)' Figure 4 displays the least For specific sieves, this method produced the histogram,
concave majorant for a sample of size n = 15. The Gren- MPL, and orthogonal series estimators, but, surprisingly,
ander estimator (8.1) is strongly consistent for monotone not the Gaussian kernel estimator.
decreasing f (Groeneboom 1983) with an MIAE conver- The method was introduced by Grenander (1981, part ill),
gence rate of O(n- 1/ 3 ) (Devroye 1987, chap. 8). It is also motivated by his work in pattern analysis and "based on an
reasonably well behaved when f is close to decreasing (Birge idea ofWald refmed by Bahadur." It was further developed
1986, 1989). Some modifications have been suggested to by Geman and Hwang (1982) and Walter and Blum (1984).
improve the performance of (8.1), including smoothing in See also Wegman (1975). As with density estimators in
216 Journal of the American Statistical Association. March 1991
1.
0.8
o.
", F
n
o.
0.2
0.0
o 1 2 3
x
Figure 4. The Empirical Distribution Function r. and Its Least Concave Majorant for a Sample of Size n = 15.
general which depend upon a smoothing parameter, the class of functions H. For example, $(f) = a f::oo [f"(x)f
performance of the method of sieves estimator depends par- dx is used in the International Mathematical and Statistical
ticularly upon the sequence of sieve parameters which should Libraries, Inc. (1987) routine DESPL, where a > 0 is a
decrease to zero "at a sufficiently slow rate" (Grenander smoothing parameter. Based on this penalty function, Fig-
1981, p. 426). It has been shown that this method leads to ure 5 shows MPL density estimates with different a using
consistent estimators in the L 1 sense, although exact rates n = 63 observations of Buffalo snowfall recorded during
of convergence have not yet been determined. To date, the 1910-1972. Good and Gaskins observed that the MPL
method has been studied only theoretically. method could, for certain types of problems, be interpreted
as "quasi-Bayesian" since (8.3) resembles a posterior den-
8.3 Maximum Penalized Likelihood Method sity for a parametric estimation problem. Furthermore, the
MPL method is closely related to Tikhonov's method of
The most popular method for restricted ML density es-
regularization used for solving ill-posed inverse problems
timation, however, involves penalizing the likelihood func-
(O'Sullivan 1986).
tion L for producing density estimates that are "too rough."
De Montricher, Tapia, and Thompson (1975) rigorously
See Good and Gaskins (1971). Thus, if $ is a given non-
established the existence and uniqueness of MPL density
negative (roughness) penalty junctional defined on H, then
the $-penalized likelihood of f is defined to be estimates, and showed that the MPL method was intimately
related to spline methods. For example, iffhas finite sup-
n
port 0 and H(O) is a suitable class of smooth functions on
L(f) = TI f(X;)e
;;}
-4>(/). (8.3) 0, then the MPL estimate Jexists, is unique, and is a poly-
nomial spline with join points (or "knots") only at the sam-
The optimization problem calls for L(f) in (8.3), or its log- ple values.
arithm, to be maximized subject to f E H(fl), f nf(t) dt = The case whenfhas infinite support is more complicated.
J,
1, and f(t) ;:: 0 ('\It E 0). If it exists, a solution, of that Good and Gaskins (1971) proposed penalty functionals de-
problem is called a maximum penalized likelihood (MPL) signed to estimate the root-density, 'Y = fl/1, so that J =
estimate of f corresponding to the penalty function $ and '92 would be a nonnegative (and bona fide) estimator of f.
Izenman: Recent Developments in Nonparametric Density Estimation 217
0.025 0.025
0.020 0.020
0.015 0.015
....>- ....>-
Ul Ul
cQ) l:
Q)
\J 0.010 "0 0.010
0.005 0.005
0.0 o .0
40 60 80 100 120 40 60 80 100 120
Figure 5. Maximum Penalized Likelihood Density Estimates of the 63 Annual Observations on Buffalo Snowfal/, 1910-1972. The data are
given in Scott (1985a). The penalty function used was 4>( f) = all f"(x)f dx, and the smoothing-parameter values were (a) a = 107, and (b)
a = 106 • The trimodal shape lsee (b)l is general/y regarded as the most reasonable density estimate for these data.
The penalty functionals were gave some recommendations for (a, f3) that performed well
in their examples.
<1>1(/) = 4a f'oo [y'(X)]2 dx, a> 0, (8.4) Another way of guaranteeing a bona fide density estimate
using the MPL method was devised by Silverman (l982b),
who used a roughness penalty based on g = logf, and showed
<1>2(/) = 4a Loooo [y'(X)]2 dx + f3 Loooo [y"(x)f dx, that this approach led to a wide range of possible density
estimates. Solving the appropriate optimization problem
a ~ 0, f3 ~ 0, (8.5) yielded an estimator g of g, so that a nonnegative MPL
estimate for f was given by J = e8. Silverman developed a
where the hyperparameters a and f3, with a + f3 > 0 in
very general theory of penalty functionals based on log f,
(8.5), control the amount of smoothing. Motivation for <1>1
and then proved the existence, consistency, and asymptotic
and <1>2 rested on how best to represent the "roughness" of
normality of the resulting estimators. This approach was
f. Good and Gaskins preferred (8.5) to (8.4), arguing that studied further by Silverman (1984).
curvature as well as slope of the density estimate should be
Implementation of the MPL method depends upon the
penalized. In follow-up papers, Good and Gaskins (1980)
quality of the numerical solutions to the restricted optimi-
and Good and Deaton (1981) set a = 0 in (8.5) and used
zation problems. Since y = fl /2 is square-integrable, Good
f3 Jl y"(x)f dx as the measure of roughness of f, where f3 and Gaskins (1980) suggested using mixtures of orthonor-
was to be determined from the data. Klonias and Nash (1983)
mal expansions for y, terminating the expansions at some
and Klonias (1984) investigated a very general class of pen-
finite number of terms. Scott, Tapia, and Thompson (1980)
alty functionals [that included (8.4) and (8.5) as special cases]
studied a discrete approximation to the spline solutions of
whose primary motivation was to improve estimation of peaks
the MPL problems, and proved that the resulting discrete
and valleys of f.
MPL estimator exists, is unique, converges to the spline
For the penalty function (8.4) and a given value of a,
MPL estimator, and is a strongly pointwise consistent es-
De Montricher et al. (1975) showed that, if the optimiza-
timator of f. Further computational work on the discrete
tion problem is set up correctly, then the resulting estimator
MPL estimator was carried out by Good and Deaton (1981).
'Ya, say, exists, is unique, and is a positive exponential spline
with knots only at the sample values. An exponential spline
9. PROJECTION PURSUIT DENSITY ESTIMATION
rather than a polynomial spline is the price to be paid for
requiring nonnegativity of the density estimate. The MPL Multivariate kernel density estimators tend to be poor
estimator is then given by t: = 'Y~. Klonias (1982) dem- performers when it comes to dealing with high-dimensional
onstrated consistency ofJa in a number of different norms, data since extremely large sample sizes are needed to match
including L 1 and L 2 • As for determining the value of a, the sort of numerical accuracy that is possible in low di-
Silverman (1978c) suggested, in a slightly different setup, mensions. In light of this, Friedman and Stuetzle (1982)
that a be chosen informally using graphical methods. If the and Friedman, Stuetzle, and Schroeder (1984) developed
penalty function is (8.5) and given values of a and f3, then, projection pursuit density estimation (PPDE). The PPDE
provided the optimization problem is set up correctly, the method has been shown in simulations to possess excellent
resulting estimate 'Ya.f3 exists and is unique. The MPL es- properties, and several quite striking applications of PPDE
timate off is given by Ja,f3 = 'Y ~.f3' Good and Gaskins also to real data have also been published.
218 Joumal of the American Statistical Association, March 1991
9.1 The PPDE Paradigm I(f) should be absolutely continuous with easily comput-
able first derivatives. "Interesting" projections should cor-
When dealing with small samples of high-dimensional
respond to large values of I(f), while small values of I(f)
data, the PPDE procedure may be jump-started by restrict-
should correspond to random or unstructured projections.
ing attention to the subspace spanned by the first few sig-
Estimates of I(f) should be amenable to fast computa-
nificant principal components; see Friedman (1987) and Jee
tion, unaffected by the overall covariance structure of the
(1987) for examples. A PPDE offis then formed using the
data and by outliers or heavy tails; see Huber (1985, sec.
following stepwise procedure. First, transform the data to
4). Friedman (1987) stressed that a very reliable and thor-
have center the origin and covariance matrix the identity.
ough numerical optimizer was absolutely essential for find-
Second, choose ]<0) to be an initial multivariate density es-
ing "substantive" maxima of I(f), since sampling fluctua-
timate off, usually taken to be standard multivariate Gaus-
d tions tend to trap ineffective optimizers within a multitude
sian. Third, fmd the direction a, E R for which the (model)
of local maxima. If [z} are the projected data, then (9.3)
marginal fa! along a. differs most from the current estimated
is estimated by f(f) = f J(}(z» dFn(z) = (1ln) ~7=1
'(data) marginalia! along a.. Choice of direction a. will not
J(}(z;). Thus if J(f(z» = f(z), then I(f) = f [f(Z)]2 dz
generally be unique. Fourth, given a., define a univariate
can be estimated by f (f) = (1 In) ~7= .Jh(Z;), where}h is a
"augmenting function" g.(a~x) as the ratio of the two mar-
kernel estimate with window width h; see Friedman and
ginals, namely, g.(a~x) = fa.<a~x)/J..!(a~x), and update the
Tukey (1974) and Tukey and Tukey (1981). Another choice
initial estimate so that ]<l)(x) = ]<O)(x)gl(a~x). Repeat this
is to take J(f(z» = logf(z), so that I(f) = f f(z) logf(z)
procedure on the modified density j'i'' so that a second di-
d dz, which is (negative) cross-entropy, and (9.3) can be es-
rection a z E R and augmenting function gz(a2x) = fa2(a2x)1
timated at the kth iteration by (lin) ~7=. log }(k)(Z;); see
}a2(a2x) are found, and the density is again modified to be
Friedman et al. (1984). Joe (1987) discussed kernel esti-
}(Z)(x) = }O)(X)gz(a2x). Repeat the procedure as many times
mation of functionals such as (9.3) and showed that, for
as necessary so that, at the kth iteration,
moderate-sized samples, statistical properties of f were im-
n
}(k)(X) = }(O)(x)
j~1
k
ing than that best suited for estimatingf. Kernel estimation tance between two probability densities f and g, namely,
of the hazard rate was discussed by Singpurwalla and Wong
(1983) and Hassani, Sarda, and Vieu (1986), and that of HD(f, g) = ~ foo ([f(X)]1/2 - [g(X)]1/2)2 dx. (10.1)
the quantile function gp = F- 1( p ) , 0 < P < 1, by Parzen 2 -00
(1979), Falk (1984), and Sheather and Marron (1988). The The minimum Hellinger distance (MHD) estimator is that
bootstrap and its smoothed versions have been used to es- value fJ of fJ that minimizes HD(j, fe), where j is a non-
timate a(F) directly, especially for kernel quantile esti- parametric density estimator off and fe, (J E e, is a member
mation. See Silverman and Young (1987), Yang (1985), of some parametric family. The distance HD is always fi-
Hall, Diciccio, and Romano (1989), and Hall (1990). Note, nite and is invariant under strictly monotone transforma-
however, that bootstrap smoothing using a non-bona fide tions. Beran (1977a,b) Birge (1986), Tamura and Boos
kernel density estimator of a nonnegative quantity, such as (1986), and Simpson (1987, 1989) proved asymptotic re-
a probability or a variance, can make a nonnegative esti- sults and established impressive robustness properties of
mate negative. MHD location estimators based on the kernel density es-
Assessing Multimodality. Integer-valued nonlinear timator. For related work on minimum distance estimators
functionals off, such as the number of mixture components of densities, see Reiss (1976) and Birge (1983).
needed to represent I. and the number of modes of f, are Semiparametric Models. Olkin and Spiegelman (1987)
also of interest, and different nonparametric approaches to developed an approach to density estimation that combined
determining the values of such functionals have been con- parametric and nonparametric approaches. Their density es-
sidered. Donoho (1988) developed a general theory for de- timator was given by
termining nonparametric lower bounds on such functionals.
jAx) = 7Tf9(X) + (1 - 7T)j(X), (10.2)
Good and Gaskins (1980) used the MPL method together
with certain "bump hunting" surgical techniques to assess where f ii is a ML parametric estimator of f, j is a kernel
the existence of any "real" dips and bumps in mass spectra estimator off, and 0 ::s; 7T ::s; 1 is unknown. The parameter
obtained from scattering experiments. Silverman (1981b, 7T was chosen to minimize the Hellinger distance, HD<l""
1983) used the kernel method together with the smoothed f), and asymptotic results were obtained under regularity
bootstrap procedure to develop a confirmatory test of the conditions on f. Figure 6 shows the semiparametric density
most probable number of modes in a density; see Silverman estimate constructed from annual wind speed measurements
(1986, sec. 6.6) and Izenman and Sommer (1988). from Olkin and Spiegelman. For that example, the para-
metric model appeared to be appropriate.
Robust Estimation. Nonparametric density estimation
has been used to obtain robust estimators for parametric Directional Data. In astronomy, geology, and studies
inference. The main tool has been the use of Hellinger dis- of animal behavior, it is often of interest to estimate the
0.07
0.06
0.05
->-
.c;;
c 0.04
CD
c
0.03
0.02
o. 01 '-----'-_"""-----L_.-....._.&....----'-_"""-----L_.-....._.&....----'-~
40 45 50 55 60 65
Figure 6. Density Estimates for 20 Measurements on Annual Maximum Wind Speeds in the N. Direction Taken in Sheridan, Wyoming, During
1958-1977. Reproduced from Olkin and Spiegelman (1987). The dotted-and-dashed line shows the kernel density estimate with smoothing
parameter h = .7s, where s is the sample standard deviation; the dashed line shows the parametric density estimate; and the solid line shows
the semiparametric density estimate with estimated weight,", = .8.
220 Journal of the American Statistical Association, March 1991
(a) (b)
Figure 7. Perspective Plots for 685 Measurements on the Orbits of all Known Comets. Reproduced from Hall, Watson, and Cabrera (1987).
Smoothing was obtained by (a) likelihood cross-validation, and (b) least squares cross-validation. Notice that likelihood CV produces a smoother
density estimate having lower peaks than least squares CV. With permission of the Biometrika trustees.
density f of measurements, Xl' ... , X n , observed on the Time Series Data. For dependent observations gener-
surface of a d-dimensional unit sphere Sd' d ;::: 2. Kernel ated by a strictly stationary process, kernel density esti-
density estimators for such "directional data" have the forms mators were studied by Roussas (1969), Rosenblatt (1970,
n 1971), Nguyen (1979), and Hart (1984), recursive density
JK'K,(X) = n-lc(K) 2: Kl(KXTXJ, (10.3) estimators were studied by Masry (1986, 1989) and Masry
i=1 and Gyorfi (1987), and survival function and hazard rate
estimators were studied by Roussas (1989, 1990) and Iz-
JK.K2(X) = n-ld(K) 2: KiK(1 - xTX;)), (1004) enman and Tran (1990).
i=l
Some Other Strange Facts," Probability Theory and Related Fields. Diggle, P. J., and Hall, P. (1986), "The Selection of Terms in an Or-
71,271-291. thogonal Series Density Estimator," Journal of the American Statistical
- - - (1987a), "Estimating a Density Under Order Restrictions: Non- Association. 81, 230-233.
asymptotic Minimax Risk," The Annals of Statistics. 15,995-1012. Donoho, D. L. (1988), "One-Sided Inference About Functionals of a
- - - (1987b), "On the Risk of Histograms for Estimating Decreasing Density," The Annals of Statistics. 16, 1390-1420.
Densities," The Annals of Statistics. 15, 1013-1022. Donoho, D. L., and Johnstone, I. M. (1989), "Projection-Based Ap-
- - - (1989), "The Grenander Estimator: A Nonasymptotic Ap- proximation and a Duality with Kernel Methods," The Annals of Sta-
proach," The Annals of Statistics, 17, 1532-1549. tistics. 17, 58-106.
Blum, J. R., and Susarla, V. (1980), "Maximal Deviation Theory of Dubuisson, B., and Lavison, P. (1980), "Surveillance of a Nuclear Re-
Density and Failure Rate Function Estimates Based on Censored Data," actor by Use of a Pattern Recognition Methodology," IEEE Transac-
in Multivariate Analysis V. ed. P. R. Krishnaiah, Amsterdam: North- tions on Systems. Man. and Cybernetics. 10, 603-609.
Holland, pp. 213-222. Emerson, J. D., and Hoaglin, D. C. (1983), "Stem and Leaf Displays,"
Boneva, L. I., Kendall, D. G., and Stefanov, I. (1971), "Spline Trans- in Understanding Robust and Exploratory Data Analysis. eds. D. C.
formations: Three New Diagnostic Aids for the Statistical Data-Ana- Hoaglin, F. Mosteller, and J. W. Tukey, New York: John Wiley.
lyst" (with discussion), Journal of the Royal Statistical Society. Ser. Eubank, R. L. (1988), Spline Smoothing and Nonparametric Regression.
B, 33, 1-70. New York: Marcel Dekker.
Bowman, A. W. (1984), "An Alternative Method of Cross-Validation Falk, M. (1984), "Relative Deficiency of Kernel Type Estimators of
for the Smoothing of Density Estimates," Biometrika. 71, 353-360. Quantiles," The Annals of Statistics. 12, 261-268.
Boyd, D. W., and Steele, J. M. (1978), "Lower Bounds for Nonpara- Farrell, R. H. (1972), "On the Best Obtainable Asymptotic Rates of Con-
metric Density Estimation Rates," The Annals of Statistics. 6, 932- vergence in Estimation of a Density Function at a Point," The Annals
934. of Mathematical Statistics. 43, 170-180.
Breiman, L., Meisel, W., and Purcell, E. (1977), "Variable Kernel Es- Fix, E., and Hodges, J. L. (1951), "Discriminatory Analysis, Nonpara-
timates of Multivariate Densities," Technometrics, 19, 135-144. metric Estimation: Consistency Properties," Report No.4. Project No.
Broniatowski, M., Deheuvels, P., and Devroye, L. (1989), "On the Re- 21-49-004. Randolph Field, Texas: USAF School of Aviation Medi-
lationship Between Stability of Extreme Order Statistics and Conver- cine.
gence of the Maximum Likelihood Kernel Density Estimate," The An- Foldes, A., and Revesz, P. (1974), "A General Method for Density Es-
nals of Statistics. 17, 1070-1086. timation," Studia Scientiarum Mathematicarum Hungarica, 9, 81-92.
Brunk, H. B. (1978), "Univariate Density Estimation by Orthogonal Se- Fraser, D. A. S. (1951), "Sequentially Determined Statistically Equiv-
ries," Biometrika. 65,521-528. alent Blocks," The Annals of Mathematical Statistics. 22, 372-381.
Butler, W. J., and Kronmal, R. A. (1985), "Discrimination with Poly- - - - (1953), "Nonparametric Tolerance Regions," The Annals of
chotomous Predictor Variables Using Orthogonal Functions," Journal Mathematical Statistics. 24, 44-55.
of the American Statistical Association. 80, 443-448. - - - (1957), Nonparametric Methods in Statistics. New York: John
Cacoullos, T. (1966), "Estimation of a Multivariate Density," Annals of Wiley.
the Institute of Statistical Mathematics. 18, 178-189. Fraser, D. A. S., and Guttman, I. (1956), "Tolerance Regions," The
Carroll, R. J. (1976), "On Sequential Density Estimation," Zeitschrift Annals of Mathematical Statistics. 27, 162-179.
fur Wahrscheinlichkeitstheorie und verwandte Gebeite, 36, 136-151. Freedman, D., and Diaconis, P. (1981a), "On the Maximum Deviation
Cencov, N. N. (1962), "Evaluation of an Unknown Distribution Density Between the Histogram and the Underlying Density," Zeitschrift fur
From Observations," Soviet Mathematics. 3, 1559-1562. Wahrscheinlichkeitstheorie und verwandte Gebiete, 58, 139-167.
Chow, Y. S., Geman, S., and Wu, L. D. (1983), "Consistent Cross- - - - (1981 b), "On the Histogram as a Density Estimator: L, Theory, "
Validated Density Estimation," The Annals of Statistics. II, 25-38. Zeitschriji fur Wahrscheinlichkeitstheorie und verwandte Gebeite, 57,
Cline, D. (1988), "Admissible Kernel Estimators of a Multivariate Den- 453-476.
sity," The Annals of Statistics. 16, 1421-1427. Friedman, J. H. (1987), "Exploratory Projection Pursuit," Journal of the
Crain, B. R. (1973), "A Note on Density Estimation Using Orthogonal American Statistical Association. 82, 249-266.
Expansions," The Annals of Statistics. 2, 454-463. Friedman, J. H., and Stuetzle, W. (1982), "Projection Pursuit Methods
Davies, H. I., and Wegman, E. J. (1975), "Sequential Nonparametric for Data Analysis," in Modern Data Analysis. eds. R. L. Launer and
Density Estimation," IEEE Transactions on Information Theory. 21, A. F. Siegel, New York: Academic Press, pp. 123-147.
619-628. Friedman, J. H., Stuetzle, W., and Schroeder. A. (1984), "Projection
Deheuvels, P. (1973), "Sur l'Estimation Sequentielle de la Densite," Pursuit Density Estimation," Journal of the American Statistical As-
Comptes Rendus de l'Academie des Sciences de Paris. 276, 1119-1121. sociation. 79, 599-608.
- - - (1977), "Estimation nonparametrique de la densite par histo- Friedman, J. H., and Tukey, J. W. (1974), "A Projection Pursuit Al-
grammes generalises," Revue de Statistique Appliquee. 25/3, 5-42. gorithm for Exploratory Data Analysis," IEEE Transactions on Com-
De Jager, O. c.. Swanepoel, J. W. H., and Raubenheimer, B. C. (1986), puting. 23, 881-890.
"Kernel Density Estimators Applied to Gamma Ray Light Curves," Fryer, M. J. (1976), "Some Errors Associated With the Nonparametric
Astronomy and Astrophysics. 170, 187-196. Estimation of Density Functions," Journal of the Institute of Mathe-
De Montricher, G. M., Tapia, R. A., and Thompson, J. R. (1975), matics and its Applications. 18, 371-380.
"Nonpararnetric Maximum Likelihood Estimation of Probability Den- - - - (1977), "A Review of Some Nonparametric Methods of Density
sities by Penalty Function Methods," The Annals of Statistics. 3, 1329- Estimation," Journal of the Institute of Mathematics and Its Applica-
1348. tions. 20, 335-354.
Denby, L., and Vardi, Y. (1986), "The Survival Curve With Decreasing Fukunaga, K. (1972), Introduction to Statistical Pattern Recognition.
Density," Technometrics, 28, 359-367. London: Academic Press.
Devijver, P. A., and Kittler, J. (1982), Pattern Recognition: A Statistical Gajek, L. (1986), "On Improving Density Estimators Which Are Not
Approach. London: Prentice-Hall. Bona Fide Functions," The Annals of Statistics. 14, 1612-1618.
Devroye, L. (1979), "On the Pointwise and Integral Convergence of Re- Gasser, T., Muller, H.-G., and Mammitzsch, V. (1985), "Kernels for
cursive Kernel Estimates of Probability Densities," Utilitas Mathe- Nonparametric Curve Estimation," Journal of the Royal Statistical So-
matica, 15,113-128. ciety. Ser. B, 47, 238-252.
- - - (1983), "The Equivalence of Weak, Strong, and Complete Con- Gawronski, W., and Stadtmuller, U. (1980), "On Density Estimation by
vergence in L, For Kernel Density Estimates," The Annals of Statistics. Means of Poisson's Distribution," Scandinavian Journal of Statistics.
11, 896-904. 7,90-94.
- - - (1985), "A Note on the L, Consistency of Variable Kernel Es- Geman, S., and Hwang, C.-R. (1982), "Nonparametric Maximum Like-
timates," The Annals of Statistics. 13, 1041-1049. lihood Estimation by the Method of Sieves," The Annals of Statistics.
- - - (1987), A Course in Density Estimation. Boston: Birkhauser. 10, 401-414.
Devroye, L., and Gyorfi, L. (1985), Nonparametric Density Estimation: Gessaman, M. P. (1970), "A Consistent Nonparametric Multivariate
The L, View. New York: John Wiley. Density Estimator Based on Statistically Equivalent Blocks," The An-
Devroye, L., and Penrod, C. S. (1984), "The Consistency of Automatic nals of Mathematical Statistics. 41,1344-1346.
Kernel Density Estimates," The Annals of Statistics. 12, 1231-1249. Gessaman, M. P., and Gessaman, P. H. (1972), "A Comparison of Some
- - - (1986), "The Strong Uniform Convergence of Multivariate Vari- Multivariate Discrimination Procedures," Journal of the American Sta-
able Kernel Estimates," The Canadian Journal of Statistics. 14,211- tistical Association. 67, 468-472.
219. Ghurye, S. G. and Olkin, I. (1969), "Unbiased Estimation of Some Mul-
222 Journal of the American Statistical Association, March 1991
tivariate Probability Densities and Related Functions," The Annals of Huber, P. J. (1985), "Projection Pursuit" (with discussion), The Annals
Mathematical Statistics, 40, 1261-1271. of Statistics, 13, 435-525.
Good, 1. J., and Deaton, M. L. (1981), "Recent Advances in Bump International Mathematical and Statistical Libraries, Inc. (1987), STAT/
Hunting," in Computer Science and Statistics: Proceedings of the 13th liBRARY (Version 1.0), Houston, TX: Author.
Symposium on the Interface, ed. W. F. Eddy, New York: Springer- Izenman, A. J., and Sommer, C. J. (1988), "Philatelic Mixtures and
Verlag, pp. 92-104. Multimodal Densities," Journal of the American Statistical Associa-
Good,1. J., and Gaskins, R. A. (1971), "Nonparametric Roughness Pen- tion, 83, 941-953.
alties for Probability Densities," Biometrika, 58, 255-277. Izenman, A. J., and Tran, L. T. (1990), "Kernel Estimation of the Sur-
- - - (1980), "Density Estimation and Bump-Hunting by the Penalized vival Function and Hazard Rate Under Weak Dependence," Journal
Likelihood Method Exemplified by Scattering and Meteorite Data" (with of Statistical Planning and Inference, 24, 233-247.
discussion), Journal of the American Statistical Association, 75, 42- Jee, J. R. (1987), "Exploratory Projection Pursuit Using Nonparametric
73. Density Estimation," Proceedings of the Statistical Computing Section
Greblicki, W., and Pawlak, M. (1981), "Classification Using the Fourier of the American Statistical Association, 335-339.
Series Estimate of Multivariate Density Functions," IEEE Transactions Joe, H (1987), "Estimation of Entropy and Other Functionals of a Mul-
on Systems, Man, and Cybernetics, II, 726-730. tivariate Density," Technical Report, University of British Columbia.
Grenander, U. (1956), "On the Theory of Mortality Measurement. Part Johnstone, I. M., and Silverman, B. W. (1990), "Speed of Estimation
II," Skandinavisk Aktuarietidskrift, 39, 125-153. in Positron Emission Tomography and Related Inverse Problems," The
- - - (1981), Abstract Inference, New York: John Wiley. Annals of Statistics, 18, 251-280.
Groeneboom, P. (1983), "Estimating a Monotone Density," Proceedings Jones, M. C. (1989), "Discretized and Interpolated Kernel Density Es-
of the Berkeley Conference in Honor ofJerzy Neyman and Jack Keifer, timates," Journal of the American Statistical Association, 84, 733-
eds. L. M. LeCam and R. A. Olshen, 2, 539-555. Belmont, CA: 741.
Wadsworth. Jones, M. C., and Lotwick, H. W. (1984), "A Remark on Algorithm
Hall, P. (1981), "On Trigonometric Series Estimates of Densities," The AS 176. Kernel Density Estimation Using the Fast Fourier Transform,"
Annals of Statistics, 9, 683-685. Applied Statistics, 33, 120-122.
- - - (1982), "Cross-Validation in Density Estimation," Biometrika, Jones, M. C. and Sibson, R. (1987), "What is Projection Pursuit?" (with
69, 383-390. discussion), Journal of the Royal Statistical Society, Ser. A, 150,
- - - (1983a), "Large Sample Optimality of Least Squares Cross-Val- 1-36.
idation in Density Estimation," The Annals of Statistics, 11, 1156- Kanazawa, Y. (1988), "An Optimal Variable Cell Histogram," Com-
1174. munications in Statistics, 17, 1401-1422.
- - - (1983b), "Orthogonal Series Methods for Both Qualitative and Kasser, I. S., and Bruce, R. A. (1969), "Comparative Effects of Aging
Quantitative Data," The Annals of Statistics, 11, 1004-1007. and Coronary Heart Disease on Submaximal and Maximal Exercise,"
- - - (1986), "On the Rate of Convergence of Orthogonal Series Den- Circulation, 39, 759-774.
sity Estimators," Journal of the Royal Statistical Society, Ser. B, 48, Klonius, V. K. (1982), "Consistency of Two Nonparametric Maximum
115-122. Penalized Estimators of the Probability Density Function," The Annals
- - - (1987a), "On Kullback-Leibler Loss and Density Estimation," of Statistics, 10, 811-824.
The Annals of Statistics, 15, 1491-1519. - - - (1984), "On a Class of Nonparametric Density and Regression
- - - (1987b), "Cross-Validation and the Smoothing of Orthogonal Se- Estimators," The Annals of Statistics, 12, 1263-1284.
ries Density Estimators," Journal of Multivariate Analysis, 21, 189- Klonius, V. K., and Nash, S. G. (1983), "On the Computation ofa Class
206. of Maximum Penalized Likelihood Estimators of the Probability Den-
- - - (1989a), "On Polynomial-Based Projection Indices for Explora- sity Function," in Computer Science and Statistics: The Interface, ed.
tory Projection Pursuit," The Annals of Statistics, 17,589-605. J. E. Gentle, Amsterdam: North-Holland, pp. 310-314.
- - - (1989b), "On Convergence Rates in Nonparametric Problems," Kogure, A. (1987), "Asymptotically Optimal Cells for a Histogram,"
International Statistical Review, 57,45-58. The Annals of Statistics, 15, 1023-1030.
- - - (1990), "Using the Bootstrap to Estimate Mean Squared Error Kronmal, R. and Tarter, M. (1968), "The Estimation of Probability Den-
and Select Smoothing Parameter in Nonparametric Problems," Journal sities and Cumulatives by Fourier Series Methods," Journal of the
of Multivariate Analysis, 32, 177-203. American Statistical Association, 63, 925-952.
Hall, P., Diciccio, T. J., and Romano, J. P. (1989), "On Smoothing and - - - (1973), "The Use of Density Estimates Based on Orthogonal Ex-
the Bootstrap," The Annals of Statistics, 17, 692-704. pansions," in Exploring Data Analysis: The Computer Revolution in
Hall, P., and Hannan, E. J. (1988), "On Stochastic Complexity and Non- Statistics, eds. W. J. Dixon and W. L. Nicholson, Los Angeles: Uni-
parametric Density Estimation," Biometrika, 75, 705-714. versity of California Press, pp. 365-395.
Hall, P., and Marron, J. S. (1987a), "Extent to Which Least-Squares Lecoutre, J.-P. (1986), "The Histogram with Random Partition," in New
Cross-Validation Minimises Integrated Square Error in Nonparametric Perspectives in Theoretical and Applied Statistics, eds. M. L. Puri, J.
Density Estimation," Probability Theory and Related Fields, 74, 567- P. Vilaplana, and W. Wertz, New York: John Wiley, pp. 265-276.
581. Leonard, T. (1978), "Density Estimation, Stochastic Processes, and Prior
- - - (1987b), "On the Amount of Noise Inherent in Bandwidth Se- Information" (with discussion), Journal of the Royal Statistical Soci-
lection for a Kernel Density Estimator," The Annals of Statistics, 15, ety, Ser. B, 40, 113-146.
163-181. Liu, R. Y. C., and Van Ryzin, J. (1985), "A Histogram Estimator of
- - - (1988), "Choice of Kernel Order in Density Estimation," The the Hazard Rate With Censored Data," The Annals of Statistics, 13,
Annals of Statistics, 16, 161-173. 592-605.
Hall, P., and Wand, M. P. (1988), "Minimizing L, Distance in Non- Lock, M. D. (1990), "Optimizing Density Estimates Based On Un-
parametric Density Estimation," Journal of Multivariate Analysis, 26, weighted and Weighted Mean Integrated Squared Error," unpublished
59-88. Ph.D. dissertation, University of California, Berkeley, Group in Bio-
Hall, P., Watson, G. S., and Cabrera, J. (1987), "Kernel Density Es- statistics.
timation With Spherical Data," Biometrika, 74, 751-762. Loftsgaarden, D.O., and Quesenberry, C. P. (1965), "A Nonparametric
Hand, D. J. (1982), Kernel Discriminant Analysis. Chichester, U.K.: Estimate of a Multivariate Density Function," The Annals of Mathe-
Research Studies Press. matical Statistics, 36, 1049-1051.
Hart, J. D. (1984), "Efficiency of a Kernel Density Estimator Under an Lubecke, A. M., and Padgett, W. J. (1985), "Nonparametric Maximum
Autoregressive Dependence Model," Journal of the American Statis- Penalized Likelihood Estimation of a Density from Arbitrarily Right-
tical Association, 79, 110-117. Censored Observations," Communications in Statistics, Part A-The-
- - - (1985), "On the Choice of Truncation Point in Fourier Series ory and Methods, 14, 257-271 (corrigendum, p. 2007).
Density Estimation," Journal of Statistical Computation and Simula- Mack, Y. P. (1980), "Asymptotic Normality of Multivariate K-NN Den-
tion, 21,95-116. sity Estimates," Sankhya, Ser. A, 42, 53-63.
Hassani, S., Sarda, P., and Vieu, P. (1986), "Nonparametric Approaches Mack, Y. P., and Rosenblatt, M. (1979), "Multivariate K-Nearest Neigh-
to Hazard Functions: Bibliographical Review (in French)," Revue de bor Density Estimates," Journal of Multivariate Analysis, 9, 1-15.
Statistique Appliquee, 34/4, 27-42. Marron, J. S. (1985), "An Asymptotically Efficient Solution to the Band-
Hendriks, H. (1990), "Nonparametric Estimation of a Probability Density width Problem of Kernel Density Estimation," The Annals ofStatistics,
on a Riemannian Manifold Using Fourier Expansions," The Annals of 13, 1011-1023.
Statistics, 18, 832-849. - - - (1987a), "A Comparison of Cross- Validation Techniques in Den-
Izenman: Recent Developments in Nonparametric Density Estimation 223
sity Estimation," The Annals of Statistics, 15, 152-162. timation (Lecture Notes in Mathematics No. 757), eds. T. Gasser and
- - - (1987b), "Automatic Smoothing Parameter Selection: A Sur- M. Rosenblatt, Berlin: Springer-Verlag, pp. 181-190.
vey," Empirical Economics, 13, 187-208. Roussas, G. (1969), "Nonpararnetric Estimation of the Transition Dis-
Marron, J. S., and Hardle, W. (1986), "Random Approximations to Some tribution Function of a Markov Process," The Annals of Mathematical
Measures of Accuracy in Nonparametric Curve Estimation," Journal Statistics, 40, 1386-1400.
of Multivariate Analysis, 20,91-113. - - - (1989), "Hazard Rate Estimation Under Dependence Condi-
Marron, J. S., and Nolan, D. (1987), "Canonical Kernels for Density tions," Journal of Statistical Planning and Inference, 22, 81-93.
Estimation," Technical Report, University of North Carolina, Chapel - - - (1990), "Asymptotic Normality of the Kernel Estimate Under
Hill. Dependence Conditions: Application to Hazard Rate," Journal of Sta-
Marron, 1. S., and Padgett, W. J. (1987), "Asymptotically Optimal tistical Planning and Inference, 25, 81-104.
Bandwidth Selection for Kernel Density Estimators from Randomly Sager, T. W. (1982), "Nonparametric Maximum Likelihood Estimation
Right-Censored Samples," The Annals of Statistics, 15,1520-1535. of Spatial Patterns," The Annals of Statistics, 10, 1125-1136.
Masry, E. (1986), "Recursive Probability Density Estimation for Weakly - - - (1986), "An Application of Isotonic Regression to Multivariate
Dependent Stationary Processes," IEEE Transactions on Information Density Estimation," in Advances in Order Restricted Statistical In-
Theory, 32, 254-267. ference (Springer Lecture Notes in Statistics, Vol. 37), eds. R.· Dyk-
- - - (1989), "Nonparametric Estimation of Conditional Probability stra, T. Robertson, and F. T. Wright, New York: Springer-Verlag, pp.
Densities and Expectations of Stationary Processes: Strong Consistency 69-90.
and Rates," Stochastic Processes and Their Applications, 32, 109- Schafer, H. (1985), "A Note on Data-Adaptive Kernel Estimation of the
128. Hazard and Density Function in the Random Censorship Situation,"
Masry, E., and Gyorfi, L. (1987), "Strong Consistency and Rates for The Annals of Statistics, 13,818-820.
Recursive Probability Density Estimators of Stationary Processes," Schuster, E. F., and Gregory, C. G. (1981), "On the Nonconsistency of
Journal of Multivariate Analysis, 22, 79-93. Maximum Likelihood Nonparametric Density Estimators," in Com-
Mielniczuk, J. (1986), "Some Asymptotic Properties of Kernel Esti- puter Science and Statistics: Proceedings of the 13th Symposium on
mators in Case of Censored Data," The Annals of Statistics, 14, 766- the Interface, ed. W. F. Eddy, New York: Springer-Verlag pp. 295-
773. 298.
Moore, D. S., and Yackel, J. W. (1977), "Consistency Properties of Schwartz, S. C. (1967), "Estimation of Probability Density by an Or-
Nearest Neighbor Density Function Estimators," The Annals of Statis- thogonal Series," The Annals of Mathematical Statistics, 38, 1261-
tics, 5, 143-154. 1265.
Muller, H.-G. (1988), Nonparametric Regression Analysis of Longitu- Scott, D. W. (1979), "On Optimal and Data-Based Histograms," Biom-
dinal Data (Springer Lecture Notes in Statistics), New York: Springer- etrika, 66, 605-610.
Verlag pp. 295-298. - - - (1985), "Average Shifted Histograms: Effective Nonparametric
Nadarya, E. A. (1989), Nonparametric Estimation of Probability Den- Density Estimators in Several Dimensions," The Annals of Statistics,
sities and Regression Curves. Dordrecht, Neth.: Kluwer Academic 13, 1024-1040.
Publishers. - - - (1985b), "Frequency Polygons," Journal of the American Sta-
Nguyen, H. T. (1979), "Density Estimation in a Continuous-Time Sta- tistical Association, 80, 348-354.
tionary Markov Process," The Annals of Statistics, 7, 341-348. - - - (1988), "A Note on Choice of Bivariate Histogram Bin Shape,"
OIkin, 1., and Spiegelman, C. H. (1987), "A Semiparametric Approach Journal of Official Statistics, 4, 47-51.
to Density Estimation," Journal of the American Statistical Associa- Scott, D. W., and Factor, L. E. (1981), "Monte Carlo Study of Three
tion, 82, 858-865. Data-Based Nonparametric Density Estimators," Journal of the Amer-
O'Sullivan, F. (1986), "A Statistical Perspective on Ill-Posed Inverse ican Statistical Association, 76, 9-15.
Problems," Statistical Science, I, 502-527. Scott, D. W., Gotto, A. M., Cole, J. S., and Gorry, G. A. (1978),
Ott, J., and Kronmal, R. A. (1976), "Some Classification Procedures for "Plasma Lipids as Collateral Risk Factors in Coronary Artery Dis-
Multivariate Binary Data Using Orthogonal Functions," Journal of the ease-A Study of 371 Males With Chest Pain," Journal of Chronic
American Statistical Association, 71, 391-399. Diseases, 31, 337-345.
Padgett, W. J., and McNichols, D. T. (1984), "Nonparametric Density Scott, D. W., Tapia, R. A., and Thompson, J. R. (1980), "Nonpara-
Estimation From Censored Data," Communications in Statistics-The- metric Probability Density Estimation by Discrete Maximum Penal-
ory and Methods, 13, 1581-1611. ized-Likelihood Criteria," The Annals of Statistics, 8, 820-832.
Park, B. U., and Marron, J. S. (1990), "Comparison of Data-Driven Scott, D. W., and Terrell, G. R. (1987), "Biased and Unbiased Cross-
Bandwidth Selectors," Journal of the American Statistial Association, Validation in Density Estimation," Journal of the American Statistical
85,66-72. Association, 82, 1131-1146.
Parzen, E. (1962), "On Estimation of a Probability Density Function and Scott, D. W., and Thompson, J. R. (1983), "Probability Density Esti-
Mode," The Annals of Mathematical Statistics, 33, 1065-1076. mation in Higher Dimensions," in Computer Science and Statistics:
- - - (1979), "Nonparametric Statistical Data Modeling" (with dis- Proceedings of the Fifteenth Symposium on the Interface, ed. J. E.
cussion), Journal of the American Statistical Association, 74, 105-131. Gentle, Amsterdam: North-Holland, pp. 173-179.
Prakasa Rao, B. L. S. (1983), Nonparametric Functional Estimation. Sheather, S. J., and Marron, J. S. (1988), "Kernel Quantile Estimators,"
New York: Academic Press. Working Paper 88-012, Australian Graduate School of Management,
Quesenberry, C. P., and Gessaman, M. P. (1968), "Nonparametric Dis- The University of New South Wales, Australia.
crimination Using Tolerance Regions," The Annals of Mathematical Silverman, B. W. (1978c), "Density Ratios, Empirical Likelihood, and
Statistics, 39, 664-673. Cot Death," Applied Statistics, 27, 26-33.
Reiss, R.-D. (1976), "On Minimum Distance Estimators for Unimodal - - - (198Ia), "Density Estimation for Univariate and Bivariate Data,"
Densities," Metrika, 23,7-14. in Interpreting Multivariate Data, ed. V. Barnett, New York: John
Robertson, T. (1967), "On Estimating a Density Which is Measurable Wiley, Ch. 3, pp. 37-53.
With Respect to a a Lattice," The Annals of Mathematical Statistics, - - - (198Ib), "Using Kernel Density Estimates to Investigate Multi-
33, 482-493. modality," Journal of the Royal Statistical Society, Ser. B, 43, 97-
Robertson, T., Wright, F. T., and Dykstra, R. L. (1988), Order Re- 99.
stricted Statistical Inference, New York: John Wiley. - - - (1982a), "Algorithm AS 176. Kernel Density Estimation Using
Rodriguez, C. C., and van Ryzin, J. (1985), "Maximum Entropy His- the Fast Fourier Transform," Applied Statistics, 31,93-97.
tograms," Statistics and Probability Letters, 3, 117-120. - - - (1982b), "On the Estimation of a Probability Density Function
Rosenblatt, M. (1956), "Remarks on Some Nonparametric Estimates of by the Maximum Penalized Likelihood Method," The Annals of Sta-
a Density Function," The Annals of Mathematical Statistics, 27, 832- tistics, 10, 795-810.
837. - - - (1983), "Some Properties of a Test for Multimodality Based on
- - - (1970), "Density Estimates and Markov Sequences," in Non- Kernel Density Estimates," in Probability, Statistics, and Analysis, eds.
parametric Techniques in Statistical Inference, ed. M. L. Puri, Cam- J. F. C. Kingman and G. E. H. Reuter, 248-259, Cambridge: Cam-
bridge, U.K.: Cambridge University Press, pp. 199-210. bridge University Press.
- - - (1971), "Curve Estimates," The Annals of Mathematical Statis- - - - (1984), "Spline Smoothing: The Equivalent Variable Kernel
tics, 42, 1815-1842. Method," The Annals of Statistics, 12, 898-916.
- - - (1979), "Global Measures of Deviation for Kernel and Nearest - - - (1985), "Two Books on Density Estimation," The Annals of Sta-
Neighbor Density Estimates," in Smoothing Techniques for Curve Es- tistics, 13, 1630-1638.
224 Journal of the American Statistical Association, March 1991
- - - (1986), Density Estimationfor Statistics and Data Analysis, New Tract, Amsterdam: Centre for Mathematics and Computer Science.
York: Chapman and Hall. Van Ryzin, J. (1973), "On a Histogram Method of Density Estimation,"
Silverman, B. W., and Jones, M. C. (1988), "E. Fix and J. L. Hodges Communications in Statistics, 2, 493-506.
(1951): An Important Unpublished Contribution to Nonparametric Dis- Vitale, R. A. (1975), "A Bernstein Polynomial Approach to Density Es-
criminant Analysis and Density Estimation," Technical Report, Uni- timation," in Statistical Inference and Related Topics (Vol. 2), ed. M.
versity of Bath. L. Puri, San Francisco: Academic Press, pp. 87-99.
Silverman, B. W., and Young, G. A. (1987), "The Bootstrap: To Smooth Wagner, T. J. (1975), "Nonparametric Estimates of Probability Densi-
or Not To Smooth?" Biometrika, 74, 469-479. ties," IEEE Transactions on Information Theory, 21, 438-440.
Simpson, D. G. (1987), "Minimum Hellinger Distance Estimation for Wahba, G. (1971), "A Polynomial Algorithm for Density Estimation,"
the Analysis of Count Data," Journal of the American Statistical As- The Annals of Mathematical Statistics, 42, 1870-1886.
sociation, 82, 802-807. - - - (l975a), "Optimal Convergence Properties of Variable Knot,
- - - (1989), "Hellinger Deviance Tests: Efficiency, Breakdown Points, Kernel, and Orthogonal Series Methods for Density Estimation," The
and Examples," Journal of the American Statistical Association, 84, Annals of Statistics, 3, 15-29.
107-113. - - - (l975b), "Interpolating Spline Methods for Density Estimation
Singpnrwalla, N. D., and Wong, M.-Y. (1983), "Estimation of the Fail- I. Equi-Spaced Knots," The Annals of Statistics, 3, 30-48.
ure Rate-A Survey of Nonparametric Methods, Part I: Non-Bayesian - - - (1981), "Data-Based Optimal Smoothing of Orthogonal Series
Methods," Communications in Statistics, Part A-Theory and Meth- Density Estimates," The Annals of Statistics, 9, 146-156.
ods, 12, 559-588. Wald, A. (1943), "An Extension ofWilk's Method for Setting Tolerance
Stone, C. J. (1984), "An Asymptotically Optimal Window Selection Rule Limits," The Annals of Mathematical Statistics, 14, 45-55.
for Kernel Density Estimates," The Annals of Statistics, 12, 1285- Walter, G. (1977), "Properties of Hermite Series Estimation of Proba-
1297. bility Density," The Annals of Statistics, 5, 1258-1264. (Addendum:
Tamura, R. N., and Boos, D. D. (1986), "Minimum Hellinger Distance Annals of Statistics, 8,454-455 (1980)J.
Estimation for Multivariate Location and Covariance," Journal of the Walter, G., and Blum, J. R. (1979), "Probability Density Estimation
American Statistical Association, 81, 223-229. Using Delta Sequences," The Annals of Statistics, 7, 328-340.
Tanner, M. A. (1983), "A Note on the Variable Kernel Estimator of the - - - (1984), "A Simple Solution to a Nonparametric Maximum Like-
Hazard Function from Randomly Censored Data," The Annals of Sta- lihood Estimation Problem," The Annals of Statistics, 12, 372-379.
tistics, 11,994-998. Watson, G. S. (1969), "Density Estimation by Orthogonal Series," The
Tanner, M. A., and Wong, W. H. (1983), "The Estimation of the Hazard Annals of Mathematical Statistics, 40, 1496-1498.
Function from Randomly Censored Data by the Kernel Method;" The Watson, G. S., and Leadbetter, M. R. (1964), "Hazard Analysis 1,"
Annals of Statistics, 11,989-993. Biometrika, 51, 175-184.
Tapia, R. A., and Thompson, J. R. (1978), Nonparametric Probability Wegman, E. J. (1969), "Maximum Likelihood Histograms," Technical
Density Estimation, Baltimore, MD: Johns Hopkins University Press. Report, University of North Carolina, Chapel Hill.
Tarter, M. E., and Kronmal, R. A. (1976), "An Introduction to the Im- - - - (1972), "Nonparametric Probability Density Estimation: I. A
plementation and Theory of Nonparametric Density Estimation," The Summary of Available Methods," Technometrics, 14, 533-546.
American Statistician, 30, 105-112. - - - (1975), "Maximum Likelihood Estimation of a Probability Den-
Taylor, C. C. (1987), "Akaike's Information Criterion and the Histo- sity Function," Sankhya, Ser. A, 37, 211-224.
gram," Biometrika, 74, 636-639. - - - (1982), "Density Estimation," in Encyclopedia of Statistical Sci-
- - - (1989), "Bootstrap Choice of the Smoothing Parameter in Kernel ences, (Vol. 2), eds. S. Kotz and N. L. Johnson, New York: John
Density Estimation," Biometrika, 76,705-712. Wiley, pp. 309-315.
Terrell, G. R. (1990), "The Maximal Smoothing Principle in Density Wegman, E. J., and Davies, H. I. (1979), "Remarks on Some Recursive
Estimation, " Journal of the American Statistical Association, 85, 470- Estimators of a Probability Density," The Annals of Statistics, 7, 316-
477. 327.
Terrell, G. R., and Scott, D. W. (1980), "On Improving Convergence Wertz, W. (1978), Statistical Density Estimation: A Survey, Gottingen,
Rates for Nonnegative Kernel Density Estimators," The Annals of Sta- F.R.G.: Vanderhoeck and Ruprecht.
tistics, 8, 1160-1163. Wertz, W., and Schneider, B. (1979), "Statistical Density Estimation: A
- - - (1985), "Oversmoothed Nonpararnetric Density Estimates," Journal Bibliography," International Statistical Review, 47, 155-175.
of the American Statistical Association, 80, 209-214. Whittle, P. (1958), "On the Smoothing of Probability Density Func-
Titterington, D. M., and Mill, G. M. (1983), "Kernel-Based Density tions," Journal of the Royal Statistical Society, Ser. B, 20, 334-343.
Estimates from Incomplete Data," Journal of the Royal Statistical So- Wilks, S. S. (1962), Mathematical Statistics, New York: John Wiley.
ciety, Ser. B, 45, 258-266. Wolverton, C. T., and Wagner, T. J. (1969), "Recursive Estimates of
Tukey, J. W. (1947), "Non-Parametric Estimation II. Statistically Equiv- Probability Densities," IEEE Transactions on Systems, Science, and
alent Blocks and Tolerance Regions-The Continuous Case," The An- Cybernetics, 5, 307.
nals of Mathematical Statistics, 18, 529-539. Yamato, H. (1971), "Sequential Estimation of a Continuous Probability
- - - (1948), "Nonparametric Estimation, III. Statistically Equivalent Density Function and the Mode," Bulletin of Mathematical Statistics,
Blocks and Multivariate Tolerance Regions-The Discontinuous Case," 14, 1-12.
The Annals of Mathematical Statistics, 19, 30-39. Yandell, B. S. (1983), "Nonparametric Inference for Rates With Cen-
Tukey, P. A., and Tukey, J. W. (1981), "Data-Driven View Selection; sored Survival Data," The Annals of Statistics, 11, 1119-1135.
Agglomeration and Sharpening," in Interpreting Multivariate Data, ed. Yang, S.-S. (1985), "A Smooth Nonparametric Estimator of a Quantile
V. Bamett, New York: John Wiley, Ch. II, pp. 215-243. Function," Journal of the American Statistical Association, 80, 1004-
Van Es, A. J. (1990), Aspects of Nonparametric Density Estimation, CWI 1011.