Class Notes in Statistics and Econometrics Part 24 ppt

The graph of a cumulative distribution function is given in Figure 1, and the corresponding quantile function is given in Figure 2.. If the cumulative distribution functions have hori-zo

Trang 1

CHAPTER 47

Density Estimation

[Sim96] is a very encompassing text A more elementary introduction with good explanations is [WJ95] This also has some plots with datasets relevant to economics, see pp 1, 11, and there is a R and Splus-package called KernSmooth associated with it (but this package does not contain the datasets) A more applied book is [BA97], which goes together with the R and Splus-package sm

47.1 How to Measure the Precision of a Density Estimator Let ˆbe the estimated density and f the true density Then for every fixed value

u, the estimation error at u is ˆ(u) − f (u) This is a random variable which depends

on u as a nonrandom parameter Its expected value is the bias at u E[ˆ(u) − f (u)],

Trang 2

and the expected value of its square is the MSE at u E ˆ(u) − f (u)2

This is a measure of the precision of the density estimate at point u only

The overall deviation of the estimated density from the true density can be measured by the integrated squared error (ISE)R+∞

u=−∞ ˆ(u) − f (u)2

du, This is a random variable; for each observation vector, it gives a different number The mean integrated squared error (MISE) is the expected value of the ISE, and at the same time (as long as integration and formation of the expected value can be interchanged)

it is the integral of the MSE at u over all us: MISE= Ru=−∞+∞ ˆ(u) − f (u)2du =

R+∞

u=−∞E ˆ(u) − f (u)2

du The asymptotic value of this is called AMISE The MSE is the variance plus the squared bias In density estimation, bias arises

if one overmoothes, and variance increases if one undersmoothes

47.2 The Histogram Histograms are density estimates They are easy to understand, easy to con-struct, and do not require advanced graphical tools

Here the number of bins is important Too few bins lead to oversmoothing, too many to undersmoothing [Sim96, p 16] has some math how to compute the MISE of a histogram, and which bin size is optimal If the underlying distribution

Trang 3

is Normal, the optimal bin width is

h = 3.491σn−1/3

This is often used also for non-Normal distributions, but if these distributions are bimodal, then one needs narrower bins The R/S-function dpih (which stands for Di-rect Plug In Histogram) in the library KernSmooth uses more sophisticated methods

to select an optimal bin width

Also the anchor positions can have a big impact on the appearance of a histogram

To demonstrate this, cd /usr/share/ecmet/xlispstat/anchor-position then do xlispstat, then (load "fde"), then (fde-demo), and pick animate anchor-moving Regarding the labeling on the vertical axis of a histogram there is a naive and a more sophisticated approach The naive approach gives the number of data points in each bin The preferred, more sophisticated approach is to divide the total number

of points in each bin by the overall size of the dataset and by the bin width In this way one gets the relative frequency density With this normalization, the total area under the histogram is 1 and the histogram is directly comparable with other estimates of the probability density function

Trang 4

47.3 The Frequency Polygon Derived from histogram by connecting the mid-points of each bin Gives a better approximation to the actual density Now the optimal bin width for a Normal is

h = 2.15σn−1/5 Dominates the histogram, and is not really more difficult to construct Simonoff argues that one should never draw histograms, only frequency polygons

47.4 Kernel Densities For every observation draw a standard Normal with that point as the mode, and then add them up An illustration is sm.script(sp build) It can also be a Normal with variance different than 1; the greater the variance, the smoother the density estimate Instead of the Normal density one can also take other smoothing kernels, i.e., functions k with Ru=−∞+∞ k(u) du = 1 and Ru=−∞+∞ uk(u) du = 0 An often-used kernel is the Triweight kernel 35

32(1 − x2)3 for |x| ≤ 1 and 0 otherwise, but these kernel functions may also assume negative values (in which case they are no longer densities) The choice of the functional form of the kernel is much less important than the bandwidth, i.e., the variance of the kernel (if interpreted as a density function)

µ2=R+∞

u2k(u) du = 1,

Trang 5

Problem 446 If u 7→ k(u) is the kernel, and x =

x1 · · · xn

>

the data vector, then ˆ(u) = n1Pn

i=1k(u −xi) is the kernel estimate of the density at u

• a 3 points Compute the mean of the kernel estimator at u

Answer E[ˆ(u)] =n1Pni=1E[k(u − x i )] but since all x i are assumed to come from the same distribution, it follows E[ˆ(u)] = E[k(u − x)] =R+∞

x=−∞ k(u − x)f (x) dx

• b 4 points Assuming thexi are independent, show that

(47.4.1) var[ˆ(u)] = 1

n

Z +∞

x=−∞

k2(u − x)f (x) dx −

Z +∞

x=−∞

k(u − x)f (x) dx2

Answer.

var[ˆ(u)] = 1

n 2

n

X

i=1

var[k(u − x i )]

(47.4.2)

= 1

nvar[k(u −x)]

(47.4.3)

= 1 n

E k(u − x)2− E[k(u − x)]2

(47.4.4)

= 1 n

Z +∞

x=−∞

k2(u − x)f (x) dx −

Z +∞

x=−∞

k(u − x)f (x) dx2

(47.4.5)

Trang 6

47.5 Transformational Kernel Density Estimators

This approach transforms the data first, then estimates a density of the trans-formed data, and then re-transforms this density to the original scale For instance the income distribution can use this, see [WJ95, p 11]

47.6 Confidence Bands The variance of a density estimate is usually easier to compute than the bias One method to get a confidence band is to draw additional curves with 2 estimated point-wise standard deviations above and below the plot This makes the assumption that the bias is 0 Therefore it is not really usable for inference, but it may give some idea whether certain features of the plot should be taken seriously sm.script(air band) Another approach is bootstrapping sm.script(air boot) The expected value of the bootstrapped density functions is ˆf (and not f ; therefore bootstrapping will not reveal the bias but it does reveal the variance of the density estimate

47.7 Other Approaches to Density Estimation

Variable bandwidth methods

Nearest Neighbor methods

Trang 7

Orthogonal Series Methods: project the data on an orthogonal base and only use the first few terms Advantage: here one actually knows the functional form of the estimated density See [BA97, pp 19–21]

47.8 Two-and Three-Dimensional Densities sm.script(air dens) and sm.script(air imag) give different representations

of a two-dimensional density; sm.script(air cont) gives the evolution over time (dotted line is first, dashed second, and solid line third)

sm.script(mag scat) is the plot of a dataset containing 3-dimensional direc-tions (longitude and latitude) Here is a kernel function and a smoothed representa-tion of this dataset: sm.script(mag dens)

Problem 447 Write a function that translates the latitude and longitude data

of the magrem dataset into a 3-dimensional dataset which can be loaded into xgobi Here is a 3-dimensional rendering of the geyser data: provide.data(geys3d) and then xgobi(geys3d) The script which draws a 3-dimensional density contour does not work right now: sm.script(geys td)

Trang 8

47.9 Other Characterizations of Distributions

Instead of the density function one can also give smoothed versions of the em-pirical cumulative distribution function, or of the hazard function 1−F (u)f (u)

47.10 Quantile-Quantile Plots The QQ-plot is a plot of the quantile functions, as defined in (3.4.14), of two different distributions against each other

The graph of a cumulative distribution function is given in Figure 1, and the corresponding quantile function is given in Figure 2 The bullets on the beginning

of the lines in the cumulative distribution function indicate that the line includes its infimum but not its supremum The quantile function has the bullets at the end of the lines

The “theoretical QQ plot” of two distributions which have distribution functions

F1 and F2 and quantile functions F1−1 and F2−1 is the set of all (x1, x2) ∈ R2 for which there exists a p such that x1= F1−1(p) and x2= F2−1(p)

If both distributions are continuous and strictly increasing, then the theoretical QQ-plot is continuous as well If the cumulative distribution functions have hori-zontal straight line segments, then the theoretical QQ-plot has gaps If one of the two distribution functions is a step function and the other is continuous, then the

Trang 9

theoretical QQ-plot is a step function; and if both distribution functions are step functions, then the theoretical QQ-plot consists of isolated points

Here is a practical instruction how to construct a QQ plot from the given cumu-lative distribution functions: Draw the cumucumu-lative distribution functions of the two distributions which you want to compare into the same diagram Then, for every value p between 0 and 1 plot the abscisse of the intersection of the horizontal line with height p with the first cumulative distribution function against the abscisse of its intersection with the second If there are horizontal line segments in these dis-tribution functions, then the suprema of these line segments should be used If the cumulative distribution functions is a step function stepping over p, then the value

at which the step occurs should be used

If the QQ-plot is a straight line, then the two distributions are either identical, or the underlying random variables differ only by a scale factor The plots have special sensitivity regarding differences in the tail areas of the two distributions

Problem448 Let F1be the cumulative distribution function of random variable

x1, and F2that of the variablex2whose distribution is the same as that of αx1, where

α is a positive constant Show that the theoretical QQ plot of these two distributions

is contained in the straight line q2= αq1

Answer (x 1 , x 2 ) ∈ QQ-plot ⇐⇒ a p exists with x 1 = F1−1(p) = inf{u : Pr[x 1 ≤ u] ≥ p} and x = F−1(p) = inf{u : Pr[x ≤ u] ≥ p} = inf{u : Pr[αx ≤ u] ≥ p} Write v = u/α, i.e.,

Trang 10

u = αv; then x 2 = inf{αv : Pr[αx 1 ≤ αv] ≥ p} = inf{αv : Pr[x 1 ≤ v] ≥ p} = α inf{v : Pr[x 1 ≤ v] ≥ p} = αx 1

In other words, if one makes a QQ plot of a normal with mean zero and variance 2

on the vertical axis against a normal with mean zero and variance 1 on the horizontal axis, one gets a straight line with slope 2 This makes such plots so valuable, since visual inspection can easily discriminate whether a curve is a straight line or not To repeat, QQ plots have the great advantage that one only needs to know the correct distribution up to a scale factor!

QQ-plots can not only be used to compare two probability measures, but an important application is to decide whether a given sample comes from a given distri-bution by plotting the quantile function of the empirical distridistri-bution of the sample, compare (3.4.17) against the quantile function of the given cumulative distribution function Since empirical cumulative distribution functions and quantile functions are step functions, the resulting theoretical QQ plot is also a step function

In order to make it easier to compare this QQ plot with a straight line, one usually does not draw the full step function but one chooses one point on the face of each step, so that the plot contains one point per observation This is like plotting the given sample against a very regular sample from the given distribution Where

on the face of each step should one choose these points? One wants to choose that

Trang 11

ordinate where the first step in an empirical cumulative distribution function should usually be

It is a mathematically complicated problem to compute for instance the “usual location” (say, the “expected value”) of the smallest of 50 normally distributed vari-ables But there is one simple method which gives roughly the right locations inde-pendently of the distribution used Draw the cumulative distribution function (cdf) which you want to test against, and then draw between the zero line and the line

p = 1 n parallel lines which divide the unit strip into n + 1 equidistant strips The intersection points of these n lines with the cdf will roughly give the locations where the smallest, second smallest, etc., of a sample of n normally distributed observations should be found

For a mathematical justification of this, make the following thought experiment Assume you have n observations from a uniform distribution on the unit interval Where should you expect the smallest observation to be? The answer is given by the simple result that the expected value of the smallest observation is 1/(n + 1), the expected value of the second-smallest observation is 2/(n + 1), etc In other words,

in the average, the n observations, cut the unit interval into n + 1 intervals of equal distance

Therefore we do know where the first step of an empirical cumulative distribution function of a uniform random variable should be, and it is a very simple formula

Trang 12

But this can be transferred to the general case by the following fact: if one plugs any random variable into its cumulative distribution, one obtains a uniform distribution! These locations will therefore give, strictly speaking, the usual values of the smallest, second smallest etc observation of Fx(x), but the usual values forxitself cannot be far from this

If one plots the data on the vertical axis versus the standard normal on the horizontal axis (the default for the R-function qqnorm), then an S-shaped plot in-dicates a light-tailed distribution, an inverse S says that the distribution is heavy-tailed (try qqnorm(rt(25,df=1)) as an example), a C is left-skewed, and an inverse

C, a J, is right-skewed A right-skewed, or positively skewed, distribution is one which has a long right tail, like the lognormal qqnorm(rlnorm(25)) or chisquare qqnorm(rchisq(25,df=3))

The classic reference which everyone has read and which explains it all is [Gum58,

pp 28–34 and 46/47] Also [WG68] is useful, many examples

47.11 Testing for Normality [Vas76] is a test for Normality based on entropy

Trang 13

CHAPTER 48

Measuring Economic Inequality

48.1 Web Resources about Income Inequality

• UNU/Wider-UNDP World Income Inequality Database (4500 Gini-coefficients)

www.wider.unu.edu/wiid/wiid.htm

• Luxembourg Income Study (detailed household surveys)http://lissy.ceps.lu

• World Bank site on Inequality, Poverty, and Socio-economic Performance

http://www.worldbank.org/poverty/inequal/index.htm

• World Bank Data on Poverty and Inequalityhttp://www.worldbank.org/poverty/data/index.htm

• World Bank Living Standard Measurement Surveyshttp://www.worldbank.org/lsms/

• University of Texas Inequality Project (Theil indexes on manufacturing and

industrial wages for 71 countries) http://utip.gov.utexas.edu/

Trang 14

• MacArthur Network on the Effects of Inequality on Economic Performance,

Institute of International Studies, University of California, Berkeley http://globetrotter.berkeley.edu/macarthur/inequality/

• Inter-American Development Bank site on Poverty and Inequalityhttp://www.iadb.org/sds/document.cfm/5/ENGLISH

48.2 Graphical Representations of Inequality See [Cow77] (there is now a second edition out, but our library only has this

one):

• Pen’s Parade: Suppose that everyone’s height is proportional to his or her

income Line everybody in the population up in order of height Then the

curve which their heights draws out is “Pen’s Parade.” It is the empirical

quantile function of the sample It highlights the presence of any extremely

large income and to a certain extent abnormally small incomes But the

information on middle income receivers is not so obvious

• Histogram or other estimates of the underlying density function Suppose

you are looking down on a field On one side, there is a long straight fence

marked off with income categories: the physical distance between any two

points directly corresponds to the income differences they represent Get

the whole population to come into the field and line up in the strip of land

marked off by the piece of fence corresponding to their income bracket The

shape you will see is a histogram of the income distribution This shows

Định dạng
Số trang	20
Dung lượng	353 KB