The graph of a cumulative distribution function is given in Figure 1, and the corresponding quantile function is given in Figure 2.. If the cumulative distribution functions have hori-zo
Trang 1CHAPTER 47
Density Estimation
[Sim96] is a very encompassing text A more elementary introduction with good explanations is [WJ95] This also has some plots with datasets relevant to economics, see pp 1, 11, and there is a R and Splus-package called KernSmooth associated with it (but this package does not contain the datasets) A more applied book is [BA97], which goes together with the R and Splus-package sm
47.1 How to Measure the Precision of a Density Estimator Let ˆbe the estimated density and f the true density Then for every fixed value
u, the estimation error at u is ˆ(u) − f (u) This is a random variable which depends
on u as a nonrandom parameter Its expected value is the bias at u E[ˆ(u) − f (u)],
Trang 2and the expected value of its square is the MSE at u E ˆ(u) − f (u)2
This is a measure of the precision of the density estimate at point u only
The overall deviation of the estimated density from the true density can be measured by the integrated squared error (ISE)R+∞
u=−∞ ˆ(u) − f (u)2
du, This is a random variable; for each observation vector, it gives a different number The mean integrated squared error (MISE) is the expected value of the ISE, and at the same time (as long as integration and formation of the expected value can be interchanged)
it is the integral of the MSE at u over all us: MISE= Ru=−∞+∞ ˆ(u) − f (u)2du =
R+∞
u=−∞E ˆ(u) − f (u)2
du The asymptotic value of this is called AMISE The MSE is the variance plus the squared bias In density estimation, bias arises
if one overmoothes, and variance increases if one undersmoothes
47.2 The Histogram Histograms are density estimates They are easy to understand, easy to con-struct, and do not require advanced graphical tools
Here the number of bins is important Too few bins lead to oversmoothing, too many to undersmoothing [Sim96, p 16] has some math how to compute the MISE of a histogram, and which bin size is optimal If the underlying distribution
Trang 3is Normal, the optimal bin width is
h = 3.491σn−1/3
This is often used also for non-Normal distributions, but if these distributions are bimodal, then one needs narrower bins The R/S-function dpih (which stands for Di-rect Plug In Histogram) in the library KernSmooth uses more sophisticated methods
to select an optimal bin width
Also the anchor positions can have a big impact on the appearance of a histogram
To demonstrate this, cd /usr/share/ecmet/xlispstat/anchor-position then do xlispstat, then (load "fde"), then (fde-demo), and pick animate anchor-moving Regarding the labeling on the vertical axis of a histogram there is a naive and a more sophisticated approach The naive approach gives the number of data points in each bin The preferred, more sophisticated approach is to divide the total number
of points in each bin by the overall size of the dataset and by the bin width In this way one gets the relative frequency density With this normalization, the total area under the histogram is 1 and the histogram is directly comparable with other estimates of the probability density function
Trang 447.3 The Frequency Polygon Derived from histogram by connecting the mid-points of each bin Gives a better approximation to the actual density Now the optimal bin width for a Normal is
h = 2.15σn−1/5 Dominates the histogram, and is not really more difficult to construct Simonoff argues that one should never draw histograms, only frequency polygons
47.4 Kernel Densities For every observation draw a standard Normal with that point as the mode, and then add them up An illustration is sm.script(sp build) It can also be a Normal with variance different than 1; the greater the variance, the smoother the density estimate Instead of the Normal density one can also take other smoothing kernels, i.e., functions k with Ru=−∞+∞ k(u) du = 1 and Ru=−∞+∞ uk(u) du = 0 An often-used kernel is the Triweight kernel 35
32(1 − x2)3 for |x| ≤ 1 and 0 otherwise, but these kernel functions may also assume negative values (in which case they are no longer densities) The choice of the functional form of the kernel is much less important than the bandwidth, i.e., the variance of the kernel (if interpreted as a density function)
µ2=R+∞
u2k(u) du = 1,
Trang 5Problem 446 If u 7→ k(u) is the kernel, and x =
x1 · · · xn
>
the data vector, then ˆ(u) = n1Pn
i=1k(u −xi) is the kernel estimate of the density at u
• a 3 points Compute the mean of the kernel estimator at u
Answer E[ˆ(u)] =n1Pni=1E[k(u − x i )] but since all x i are assumed to come from the same distribution, it follows E[ˆ(u)] = E[k(u − x)] =R+∞
x=−∞ k(u − x)f (x) dx
• b 4 points Assuming thexi are independent, show that
(47.4.1) var[ˆ(u)] = 1
n
Z +∞
x=−∞
k2(u − x)f (x) dx −
Z +∞
x=−∞
k(u − x)f (x) dx2
Answer.
var[ˆ(u)] = 1
n 2
n
X
i=1
var[k(u − x i )]
(47.4.2)
= 1
nvar[k(u −x)]
(47.4.3)
= 1 n
E k(u − x)2− E[k(u − x)]2
(47.4.4)
= 1 n
Z +∞
x=−∞
k2(u − x)f (x) dx −
Z +∞
x=−∞
k(u − x)f (x) dx2
(47.4.5)
Trang 6
47.5 Transformational Kernel Density Estimators
This approach transforms the data first, then estimates a density of the trans-formed data, and then re-transforms this density to the original scale For instance the income distribution can use this, see [WJ95, p 11]
47.6 Confidence Bands The variance of a density estimate is usually easier to compute than the bias One method to get a confidence band is to draw additional curves with 2 estimated point-wise standard deviations above and below the plot This makes the assumption that the bias is 0 Therefore it is not really usable for inference, but it may give some idea whether certain features of the plot should be taken seriously sm.script(air band) Another approach is bootstrapping sm.script(air boot) The expected value of the bootstrapped density functions is ˆf (and not f ; therefore bootstrapping will not reveal the bias but it does reveal the variance of the density estimate
47.7 Other Approaches to Density Estimation
Variable bandwidth methods
Nearest Neighbor methods
Trang 7Orthogonal Series Methods: project the data on an orthogonal base and only use the first few terms Advantage: here one actually knows the functional form of the estimated density See [BA97, pp 19–21]
47.8 Two-and Three-Dimensional Densities sm.script(air dens) and sm.script(air imag) give different representations
of a two-dimensional density; sm.script(air cont) gives the evolution over time (dotted line is first, dashed second, and solid line third)
sm.script(mag scat) is the plot of a dataset containing 3-dimensional direc-tions (longitude and latitude) Here is a kernel function and a smoothed representa-tion of this dataset: sm.script(mag dens)
Problem 447 Write a function that translates the latitude and longitude data
of the magrem dataset into a 3-dimensional dataset which can be loaded into xgobi Here is a 3-dimensional rendering of the geyser data: provide.data(geys3d) and then xgobi(geys3d) The script which draws a 3-dimensional density contour does not work right now: sm.script(geys td)
Trang 847.9 Other Characterizations of Distributions
Instead of the density function one can also give smoothed versions of the em-pirical cumulative distribution function, or of the hazard function 1−F (u)f (u)
47.10 Quantile-Quantile Plots The QQ-plot is a plot of the quantile functions, as defined in (3.4.14), of two different distributions against each other
The graph of a cumulative distribution function is given in Figure 1, and the corresponding quantile function is given in Figure 2 The bullets on the beginning
of the lines in the cumulative distribution function indicate that the line includes its infimum but not its supremum The quantile function has the bullets at the end of the lines
The “theoretical QQ plot” of two distributions which have distribution functions
F1 and F2 and quantile functions F1−1 and F2−1 is the set of all (x1, x2) ∈ R2 for which there exists a p such that x1= F1−1(p) and x2= F2−1(p)
If both distributions are continuous and strictly increasing, then the theoretical QQ-plot is continuous as well If the cumulative distribution functions have hori-zontal straight line segments, then the theoretical QQ-plot has gaps If one of the two distribution functions is a step function and the other is continuous, then the
Trang 9theoretical QQ-plot is a step function; and if both distribution functions are step functions, then the theoretical QQ-plot consists of isolated points
Here is a practical instruction how to construct a QQ plot from the given cumu-lative distribution functions: Draw the cumucumu-lative distribution functions of the two distributions which you want to compare into the same diagram Then, for every value p between 0 and 1 plot the abscisse of the intersection of the horizontal line with height p with the first cumulative distribution function against the abscisse of its intersection with the second If there are horizontal line segments in these dis-tribution functions, then the suprema of these line segments should be used If the cumulative distribution functions is a step function stepping over p, then the value
at which the step occurs should be used
If the QQ-plot is a straight line, then the two distributions are either identical, or the underlying random variables differ only by a scale factor The plots have special sensitivity regarding differences in the tail areas of the two distributions
Problem448 Let F1be the cumulative distribution function of random variable
x1, and F2that of the variablex2whose distribution is the same as that of αx1, where
α is a positive constant Show that the theoretical QQ plot of these two distributions
is contained in the straight line q2= αq1
Answer (x 1 , x 2 ) ∈ QQ-plot ⇐⇒ a p exists with x 1 = F1−1(p) = inf{u : Pr[x 1 ≤ u] ≥ p} and x = F−1(p) = inf{u : Pr[x ≤ u] ≥ p} = inf{u : Pr[αx ≤ u] ≥ p} Write v = u/α, i.e.,
Trang 10u = αv; then x 2 = inf{αv : Pr[αx 1 ≤ αv] ≥ p} = inf{αv : Pr[x 1 ≤ v] ≥ p} = α inf{v : Pr[x 1 ≤ v] ≥ p} = αx 1
In other words, if one makes a QQ plot of a normal with mean zero and variance 2
on the vertical axis against a normal with mean zero and variance 1 on the horizontal axis, one gets a straight line with slope 2 This makes such plots so valuable, since visual inspection can easily discriminate whether a curve is a straight line or not To repeat, QQ plots have the great advantage that one only needs to know the correct distribution up to a scale factor!
QQ-plots can not only be used to compare two probability measures, but an important application is to decide whether a given sample comes from a given distri-bution by plotting the quantile function of the empirical distridistri-bution of the sample, compare (3.4.17) against the quantile function of the given cumulative distribution function Since empirical cumulative distribution functions and quantile functions are step functions, the resulting theoretical QQ plot is also a step function
In order to make it easier to compare this QQ plot with a straight line, one usually does not draw the full step function but one chooses one point on the face of each step, so that the plot contains one point per observation This is like plotting the given sample against a very regular sample from the given distribution Where
on the face of each step should one choose these points? One wants to choose that
Trang 11ordinate where the first step in an empirical cumulative distribution function should usually be
It is a mathematically complicated problem to compute for instance the “usual location” (say, the “expected value”) of the smallest of 50 normally distributed vari-ables But there is one simple method which gives roughly the right locations inde-pendently of the distribution used Draw the cumulative distribution function (cdf) which you want to test against, and then draw between the zero line and the line
p = 1 n parallel lines which divide the unit strip into n + 1 equidistant strips The intersection points of these n lines with the cdf will roughly give the locations where the smallest, second smallest, etc., of a sample of n normally distributed observations should be found
For a mathematical justification of this, make the following thought experiment Assume you have n observations from a uniform distribution on the unit interval Where should you expect the smallest observation to be? The answer is given by the simple result that the expected value of the smallest observation is 1/(n + 1), the expected value of the second-smallest observation is 2/(n + 1), etc In other words,
in the average, the n observations, cut the unit interval into n + 1 intervals of equal distance
Therefore we do know where the first step of an empirical cumulative distribution function of a uniform random variable should be, and it is a very simple formula
Trang 12But this can be transferred to the general case by the following fact: if one plugs any random variable into its cumulative distribution, one obtains a uniform distribution! These locations will therefore give, strictly speaking, the usual values of the smallest, second smallest etc observation of Fx(x), but the usual values forxitself cannot be far from this
If one plots the data on the vertical axis versus the standard normal on the horizontal axis (the default for the R-function qqnorm), then an S-shaped plot in-dicates a light-tailed distribution, an inverse S says that the distribution is heavy-tailed (try qqnorm(rt(25,df=1)) as an example), a C is left-skewed, and an inverse
C, a J, is right-skewed A right-skewed, or positively skewed, distribution is one which has a long right tail, like the lognormal qqnorm(rlnorm(25)) or chisquare qqnorm(rchisq(25,df=3))
The classic reference which everyone has read and which explains it all is [Gum58,
pp 28–34 and 46/47] Also [WG68] is useful, many examples
47.11 Testing for Normality [Vas76] is a test for Normality based on entropy
Trang 13CHAPTER 48
Measuring Economic Inequality
48.1 Web Resources about Income Inequality
• UNU/Wider-UNDP World Income Inequality Database (4500 Gini-coefficients)
www.wider.unu.edu/wiid/wiid.htm
• Luxembourg Income Study (detailed household surveys)http://lissy.ceps.lu
• World Bank site on Inequality, Poverty, and Socio-economic Performance
http://www.worldbank.org/poverty/inequal/index.htm
• World Bank Data on Poverty and Inequalityhttp://www.worldbank.org/poverty/data/index.htm
• World Bank Living Standard Measurement Surveyshttp://www.worldbank.org/lsms/
• University of Texas Inequality Project (Theil indexes on manufacturing and
industrial wages for 71 countries) http://utip.gov.utexas.edu/
Trang 14• MacArthur Network on the Effects of Inequality on Economic Performance,
Institute of International Studies, University of California, Berkeley http://globetrotter.berkeley.edu/macarthur/inequality/
• Inter-American Development Bank site on Poverty and Inequalityhttp://www.iadb.org/sds/document.cfm/5/ENGLISH
48.2 Graphical Representations of Inequality See [Cow77] (there is now a second edition out, but our library only has this
one):
• Pen’s Parade: Suppose that everyone’s height is proportional to his or her
income Line everybody in the population up in order of height Then the
curve which their heights draws out is “Pen’s Parade.” It is the empirical
quantile function of the sample It highlights the presence of any extremely
large income and to a certain extent abnormally small incomes But the
information on middle income receivers is not so obvious
• Histogram or other estimates of the underlying density function Suppose
you are looking down on a field On one side, there is a long straight fence
marked off with income categories: the physical distance between any two
points directly corresponds to the income differences they represent Get
the whole population to come into the field and line up in the strip of land
marked off by the piece of fence corresponding to their income bracket The
shape you will see is a histogram of the income distribution This shows