Although the descriptive study of data is an important facet of modelling with real data in itself, in the present study it is mainly used to motivate the need for probability theory an
Trang 1CHAPTER 2
Descriptive study of data
2.1 Histograms and their numerical characteristics
By descriptive study of data we refer to the summarisation and exposition (tabulation, grouping, graphical representation) of observed data as well as
the derivation of numerical characteristics such as measures of location,
dispersion and shape
Although the descriptive study of data is an important facet of modelling with real data in itself, in the present study it is mainly used to motivate the need for probability theory and statistical inference proper
In order to make the discussion more specific let us consider the after-tax personal income data of 23 000 households for 1979-80 in the UK These data in raw form constitute 23000 numbers between £1000 and £50000 This presents us with a formidable task in attempting to understand how income is distributed among the 23 000 households represented in the data The purpose of descriptive statistics is to help us make some sense of such data A natural way to proceed is to summarise the data by allocating the numbers into classes (intervals) The number of intervals is chosen a priori and it depends on the degree of summarisation needed In the present case the income data are allocated into 15 intervals, as shown in Table 2.1 below (see National Income and Expenditure (1983)) The first column of the table shows the income intervals, the second column shows the number of incomes falling into each interval and the third column the relative frequency for each interval The relative frequency is calculated by dividing the number of observations in each interval by the total number of observations Summarising the data in Table 2.1 enables us to get some idea
of how income is distributed among the various classes If we plot the relative frequencies in a bar graph we get what is known as the histogram,
23
Trang 224 Descriptive study of data
Table 2.1 Personal income in the UK, 1979-80
0.16-
0.15
0.13
»012F
-§ 0.11
2 0.10-
2 0.09;
'ø 0.08Ƒ-
§ 0.07
3 0.06Ƒ
Œ 005Ƒ
0.04
0.03
0.02
Income
Fig 2.1 The histogram and frequency polygon of the personal income data
shown in Fig 2.1 The pictorial representation of the relative frequencies gives us a more vivid impression of the distribution of income Looking at the histogram we can see that most households earn less than £4500 and in some sense we can separate them into two larger groups: those earning between £1000 and £4500 and those above £4500 The first impression is
Trang 3ws
that the distribution of income inside these two larger groups appears to be rather similar
For further information on the distribution of income we could calculate various numerical characteristics describing the histogram’s location, dispersion and shape Such measures can be calculated directly in terms of the raw data However, in the present case it 1s more convenient for expositional purposes to use the grouped data The main reason for this is
to introduce various concepts which will be reinterpreted in the context of probability theory in Part If
The mean as a measure of location takes the form
where @, and z, refer to the relative frequency and the midpoint of interval 1 ' The mode as a measure of location refers to the value of income that occurs most frequently in the data set In the present case the mode belongs to the first interval £1.0- 1.5 Another measure of location is the median referring to the value of income in the middle when incomes are arranged in an ascending (or descending) order according to the size of income The best way to calculate the median is to plot the cumulatice frequency graph which
is more convenient for answering questions such as “How many observations fall below a particular value of income” (see Fig 2.2) From the cumulative frequency graph we can see that the median belongs to the interval £3.0-3.5 Comparing the three measures of location we can see that
1.0
0.9
08
escogeno0e89 =
poe L1 L1 1 1 1 1 1 1 1L]
Income Fig 2.2 The cumulative histogram and ogive of the personal income data
Trang 426 Descriptive study of data
mode<median<mean, confirming the obvious asymmetry of the histogram
Another important feature of the histogram is the dispersion of the relative frequencies around a measure of central tendency The most frequently used measure of dispersion is the variance defined by
15
r= ¥ (5-32 = 4.85, (2.2)
¿=1
which is a measure of dispersion around the mean; v is known as the standard deviation
We can extend the concept of the variance to
15
i=l
defining what are known as higher central moments These higher moments can be used to get a better idea of the shape of the histogram For example, the standardised form of the third and fourth moments defined by
known as the skewness and kurtosis coefficients measure the asymmetry and peakedness of the histogram, respectively In the case of a symmetric histogram, SK =0 and the less peaked the histogram the greater value of
K For the income data
which confirms the asymmetry of the histogram (skewed to the nght) The above numerical characteristics referring to the location, dispersion and shape were calculated for the data set as a whole It was argued above, however, that it may be preferable to separate the data into two larger groups and study those separately Let us consider the groups £I.0-4.5 and
£4.0-20.0 separately The numerical characteristics for the two groups are
F,=2.5, \ 1v2=0.996, SK,=0.252, K,=1.77
and
Z;=6.l18, v3=3.814, SK,=2.55, K,=11.93, respectively Looking at these measures we can see that although the two subsets of the income data seemed qualitatively rather similar they actually differ substantially The second group has much bigger dispersion, skewness and kurtosis coefficients
Returning to the numerical characteristics of the data set as a whole we
Trang 52.2 Frequency curves 27 can see that these seem to represent an uneasy compromise between the
above two subsets This confirms our first intuitive reaction based on the histogram that it might be more appropriate to study the two larger groups separately
Another form of graphical representation for time-series data 1s the time graph (z,, t) t= 1.2 , T The temporal pattern of an economic time series
is important not only in the context of descriptive statistics but also plays an important role in econometric modelling in the context of statistical inference proper; see Part IV
2.2 Frequency curves
Although the histogram can be a very useful way to summarise and study observed data it is not a very convenient descriptor of data This is because m—1 parameters @, b2 , Om—, (m being the number of intervals) are needed to describe it Moreover, analytically the histogram is a cumbersome step function of the form
=1 (2)44 — 2)
where [2;,2;.,) represents the ith half-closed interval and [(-) is the indicator function
1 for ze[z,,2;+,)
0 forz#[z¿,z,.¿)
Hence, the histogram is not an ideal descriptor especially in relation to the modelling facet of observed data
The first step towards a more convenient descriptor of observed data is the so-called frequency polygon which is a modified histogram This is obtained by joining up the midpoints of the step function, as shown in Fig 2.1, to get a continuous function
An analogous graph for the cumulative frequency graph is known as the ogive (see Fig 2.2) These two graphs can be interpreted as the histograms obtained by increasing the number of intervals In summarising the data in the form of a histogram some information is lost The greater the number of intervals the smaller the information lost This suggests that increasing the number of intervals we might get more realistic descriptors for our data Intuition suggests that if we keep on increasing the number of intervals to infinity we should get a much smoother frequency curve Moreover, with a smooth frequency curve we should be able to describe it in some functional form with fewer than m — | parameters For example, if we were to describe
Mens} (2.6)
Trang 628 Descriptive study of data
the two subsets of the data separately we could conceivably be able to express a smoothed version of the frequency polygons in a polynomial form with one or two parameters This line of reasoning led statisticians in the second part of the nineteenth century to suggest various such families of frequency curves with various shapes for describing observed data
The Pearson family of frequency curves
In his attempt to derive a general family of frequency curves to describe observed data, Karl Pearson in the late 1890s suggested a family based on the differential equation
which satisfies the condition that the curve touches the z-axis at @(z) =O and
has an optimum at z= —a, that is, the curve has one mode Clearly, the solution of the above equation depends on the roots of the denominator By imposing different conditions on these roots and choosing different values for a, bo, b, and by we can generate numerous frequency curves such as
— it can be bell-shaped, U-shaped or even J-shaped; (2.8)
2
In the case of the income data above we can see that the J-shaped (in) frequency curve seems to be our best choice As can be seen it has only one parameter a and it is clearly a much more convenient descriptor (if appropriate) of the income data than the histogram For 4, equal to the lowest income value this is known as the Pareto frequency curve Looking
at Fig 2.1 we can see that for incomes greater than £4.5 the Pareto frequency curve seems a very reasonable descriptor
An important property of the Pearson family of frequency curves 1s that the parameters #, bọ, b¡ and b, are completely determined from knowledge
of the first four moments This implies that any frequency curve can be fitted
to the data using these moments (see Kendall and Stuart (1969)) At this point instead of considering how such frequency curves can be fitted to observed data we are going to leave the story unfinished to be taken up in Parts II] and IV in order to look ahead to probability theory and statistical inference proper
Trang 72.3 Looking ahead 29
2.3 Looking ahead
The most important drawback of descriptive statistics is that the study of the observed data enables us to draw certain conclusions which relate only
to the data in hand The temptation in analysing the above income data is to attempt to make generalisations beyond the data in hand, in particular
about the distribution of income in the UK This however, is not possible in
the descriptive statistics framework In order to be able to generalise beyond the data in hand we need ‘to model the distribution of income in the
UK and not just ‘describe’ the observed data in hand Such a general
‘mode! is provided by probability theory to be considered in Part II It turns out that the model provided by probability theory owes a lot to the earlier developed descriptive statistics In particular, most of the concepts which form the basis of the probability model were motivated by the descriptive statistics concepts considered above The concepts of measures
of location, dispersion and shape, as well as the frequency curve, were transplanted into probability theory with renewed interpretations The frequency curve when reinterpreted becomes a density function purporting
to model observable real world phenomena In particular the Pearson family of frequency curves can be reinterpreted as a family of density functions As for the various measures, they will now be reinterpreted in terms of the density function
Equipped with the probability model to be developed in Part II we can
go on to analyse observed data (now interpreted as generated by some assumed probability model) in the context of statistical inference proper; the subject matter of Part HI In such a context we can generalise beyond the observed data in hand Probability theory and statistical inference will enable us to construct and analyse statistical models of particular interest in econometrics; the subject matter of Part IV
In Chapter 2 we consider the axiomatic approach to probability which forms the foundation for the discussion in Part II Chapter 3 introduces the concept ofa random variable and related notions; arguably the most widely used concept in the present book In Chapters 4-10 we develop the mathematical framework in the context of which the probability model could be analysed as a prelude to Part III
Additional references
Bhattacharya and Johnson (1977); Haber and Runyon (1973); Johnson and Kotz (1970), Yeomans (1968)