Chapter 2: DESCRIPTIVE STUDY OF DATA

Although the descriptive study of data is an important facet of modelling with real data in itself, in the present study it is mainly used to motivate the need for probability theory an

Trang 1

CHAPTER 2

Descriptive study of data

2.1 Histograms and their numerical characteristics

By descriptive study of data we refer to the summarisation and exposition (tabulation, grouping, graphical representation) of observed data as well as

the derivation of numerical characteristics such as measures of location,

dispersion and shape

Although the descriptive study of data is an important facet of modelling with real data in itself, in the present study it is mainly used to motivate the need for probability theory and statistical inference proper

In order to make the discussion more specific let us consider the after-tax personal income data of 23 000 households for 1979-80 in the UK These data in raw form constitute 23000 numbers between £1000 and £50000 This presents us with a formidable task in attempting to understand how income is distributed among the 23 000 households represented in the data The purpose of descriptive statistics is to help us make some sense of such data A natural way to proceed is to summarise the data by allocating the numbers into classes (intervals) The number of intervals is chosen a priori and it depends on the degree of summarisation needed In the present case the income data are allocated into 15 intervals, as shown in Table 2.1 below (see National Income and Expenditure (1983)) The first column of the table shows the income intervals, the second column shows the number of incomes falling into each interval and the third column the relative frequency for each interval The relative frequency is calculated by dividing the number of observations in each interval by the total number of observations Summarising the data in Table 2.1 enables us to get some idea

of how income is distributed among the various classes If we plot the relative frequencies in a bar graph we get what is known as the histogram,

23

Trang 2

24 Descriptive study of data

Table 2.1 Personal income in the UK, 1979-80

0.16-

0.15

0.13

»012F

-§ 0.11

2 0.10-

2 0.09;

'ø 0.08Ƒ-

§ 0.07

3 0.06Ƒ

Œ 005Ƒ

0.04

0.03

0.02

Income

Fig 2.1 The histogram and frequency polygon of the personal income data

shown in Fig 2.1 The pictorial representation of the relative frequencies gives us a more vivid impression of the distribution of income Looking at the histogram we can see that most households earn less than £4500 and in some sense we can separate them into two larger groups: those earning between £1000 and £4500 and those above £4500 The first impression is

Trang 3

ws

that the distribution of income inside these two larger groups appears to be rather similar

For further information on the distribution of income we could calculate various numerical characteristics describing the histogram’s location, dispersion and shape Such measures can be calculated directly in terms of the raw data However, in the present case it 1s more convenient for expositional purposes to use the grouped data The main reason for this is

to introduce various concepts which will be reinterpreted in the context of probability theory in Part If

The mean as a measure of location takes the form

where @, and z, refer to the relative frequency and the midpoint of interval 1 ' The mode as a measure of location refers to the value of income that occurs most frequently in the data set In the present case the mode belongs to the first interval £1.0- 1.5 Another measure of location is the median referring to the value of income in the middle when incomes are arranged in an ascending (or descending) order according to the size of income The best way to calculate the median is to plot the cumulatice frequency graph which

is more convenient for answering questions such as “How many observations fall below a particular value of income” (see Fig 2.2) From the cumulative frequency graph we can see that the median belongs to the interval £3.0-3.5 Comparing the three measures of location we can see that

1.0

0.9

08

escogeno0e89 =

poe L1 L1 1 1 1 1 1 1 1L]

Income Fig 2.2 The cumulative histogram and ogive of the personal income data

Trang 4

mode<median<mean, confirming the obvious asymmetry of the histogram

Another important feature of the histogram is the dispersion of the relative frequencies around a measure of central tendency The most frequently used measure of dispersion is the variance defined by

15

r= ¥ (5-32 = 4.85, (2.2)

¿=1

which is a measure of dispersion around the mean; v is known as the standard deviation

We can extend the concept of the variance to

15

i=l

defining what are known as higher central moments These higher moments can be used to get a better idea of the shape of the histogram For example, the standardised form of the third and fourth moments defined by

known as the skewness and kurtosis coefficients measure the asymmetry and peakedness of the histogram, respectively In the case of a symmetric histogram, SK =0 and the less peaked the histogram the greater value of

K For the income data

which confirms the asymmetry of the histogram (skewed to the nght) The above numerical characteristics referring to the location, dispersion and shape were calculated for the data set as a whole It was argued above, however, that it may be preferable to separate the data into two larger groups and study those separately Let us consider the groups £I.0-4.5 and

£4.0-20.0 separately The numerical characteristics for the two groups are

F,=2.5, \ 1v2=0.996, SK,=0.252, K,=1.77

and

Z;=6.l18, v3=3.814, SK,=2.55, K,=11.93, respectively Looking at these measures we can see that although the two subsets of the income data seemed qualitatively rather similar they actually differ substantially The second group has much bigger dispersion, skewness and kurtosis coefficients

Returning to the numerical characteristics of the data set as a whole we

Trang 5

2.2 Frequency curves 27 can see that these seem to represent an uneasy compromise between the

above two subsets This confirms our first intuitive reaction based on the histogram that it might be more appropriate to study the two larger groups separately

Another form of graphical representation for time-series data 1s the time graph (z,, t) t= 1.2 , T The temporal pattern of an economic time series

is important not only in the context of descriptive statistics but also plays an important role in econometric modelling in the context of statistical inference proper; see Part IV

2.2 Frequency curves

Although the histogram can be a very useful way to summarise and study observed data it is not a very convenient descriptor of data This is because m—1 parameters @, b2 , Om—, (m being the number of intervals) are needed to describe it Moreover, analytically the histogram is a cumbersome step function of the form

=1 (2)44 — 2)

where [2;,2;.,) represents the ith half-closed interval and [(-) is the indicator function

1 for ze[z,,2;+,)

0 forz#[z¿,z,.¿)

Hence, the histogram is not an ideal descriptor especially in relation to the modelling facet of observed data

The first step towards a more convenient descriptor of observed data is the so-called frequency polygon which is a modified histogram This is obtained by joining up the midpoints of the step function, as shown in Fig 2.1, to get a continuous function

An analogous graph for the cumulative frequency graph is known as the ogive (see Fig 2.2) These two graphs can be interpreted as the histograms obtained by increasing the number of intervals In summarising the data in the form of a histogram some information is lost The greater the number of intervals the smaller the information lost This suggests that increasing the number of intervals we might get more realistic descriptors for our data Intuition suggests that if we keep on increasing the number of intervals to infinity we should get a much smoother frequency curve Moreover, with a smooth frequency curve we should be able to describe it in some functional form with fewer than m — | parameters For example, if we were to describe

Mens} (2.6)

Trang 6

the two subsets of the data separately we could conceivably be able to express a smoothed version of the frequency polygons in a polynomial form with one or two parameters This line of reasoning led statisticians in the second part of the nineteenth century to suggest various such families of frequency curves with various shapes for describing observed data

The Pearson family of frequency curves

In his attempt to derive a general family of frequency curves to describe observed data, Karl Pearson in the late 1890s suggested a family based on the differential equation

which satisfies the condition that the curve touches the z-axis at @(z) =O and

has an optimum at z= —a, that is, the curve has one mode Clearly, the solution of the above equation depends on the roots of the denominator By imposing different conditions on these roots and choosing different values for a, bo, b, and by we can generate numerous frequency curves such as

— it can be bell-shaped, U-shaped or even J-shaped; (2.8)

2

In the case of the income data above we can see that the J-shaped (in) frequency curve seems to be our best choice As can be seen it has only one parameter a and it is clearly a much more convenient descriptor (if appropriate) of the income data than the histogram For 4, equal to the lowest income value this is known as the Pareto frequency curve Looking

at Fig 2.1 we can see that for incomes greater than £4.5 the Pareto frequency curve seems a very reasonable descriptor

An important property of the Pearson family of frequency curves 1s that the parameters #, bọ, b¡ and b, are completely determined from knowledge

of the first four moments This implies that any frequency curve can be fitted

to the data using these moments (see Kendall and Stuart (1969)) At this point instead of considering how such frequency curves can be fitted to observed data we are going to leave the story unfinished to be taken up in Parts II] and IV in order to look ahead to probability theory and statistical inference proper

Trang 7

2.3 Looking ahead 29

2.3 Looking ahead

The most important drawback of descriptive statistics is that the study of the observed data enables us to draw certain conclusions which relate only

to the data in hand The temptation in analysing the above income data is to attempt to make generalisations beyond the data in hand, in particular

about the distribution of income in the UK This however, is not possible in

the descriptive statistics framework In order to be able to generalise beyond the data in hand we need ‘to model the distribution of income in the

UK and not just ‘describe’ the observed data in hand Such a general

‘mode! is provided by probability theory to be considered in Part II It turns out that the model provided by probability theory owes a lot to the earlier developed descriptive statistics In particular, most of the concepts which form the basis of the probability model were motivated by the descriptive statistics concepts considered above The concepts of measures

of location, dispersion and shape, as well as the frequency curve, were transplanted into probability theory with renewed interpretations The frequency curve when reinterpreted becomes a density function purporting

to model observable real world phenomena In particular the Pearson family of frequency curves can be reinterpreted as a family of density functions As for the various measures, they will now be reinterpreted in terms of the density function

Equipped with the probability model to be developed in Part II we can

go on to analyse observed data (now interpreted as generated by some assumed probability model) in the context of statistical inference proper; the subject matter of Part HI In such a context we can generalise beyond the observed data in hand Probability theory and statistical inference will enable us to construct and analyse statistical models of particular interest in econometrics; the subject matter of Part IV

In Chapter 2 we consider the axiomatic approach to probability which forms the foundation for the discussion in Part II Chapter 3 introduces the concept ofa random variable and related notions; arguably the most widely used concept in the present book In Chapters 4-10 we develop the mathematical framework in the context of which the probability model could be analysed as a prelude to Part III

Additional references

Bhattacharya and Johnson (1977); Haber and Runyon (1973); Johnson and Kotz (1970), Yeomans (1968)

Tiêu đề	Histograms and their numerical characteristics
Chuyên ngành	Statistics
Thể loại	chapter

Định dạng
Số trang	7
Dung lượng	252,2 KB