Using Sample SurveysCollecting, Managing, and Assessing Data Using Sample Surveys provides a thorough, step-by-step guide to the design and mentation of surveys.. Having shown readers h
Trang 2Using Sample Surveys
Collecting, Managing, and Assessing Data Using Sample Surveys
provides a thorough, step-by-step guide to the design and mentation of surveys Beginning with a primer on basic statistics, the first half of the book takes readers on a comprehensive tour through the basics of survey design Topics covered include the ethics of surveys, the design of survey procedures, the design of the survey instrument, how to write questions, and how to draw representative samples Having shown readers how to design sur-veys, the second half of the book discusses a number of issues sur-rounding their implementation, including repetitive surveys, the economics of surveys, Web-based surveys, coding and data entry, data expansion and weighting, the issue of nonresponse, and the documenting and archiving of survey data The book is an excel-lent introduction to the use of surveys for graduate students as well
imple-as a useful reference work for scholars and professionals
peter stopher is Professor of Transport Planning at the Institute of Transport and Logistics Studies at the University of Sydney He has also been a professor at Northwestern University, Cornell University, McMaster University, and Louisiana State University Professor Stopher has developed a substantial reputa-tion in the field of data collection, particularly for the support of travel forecasting and analysis He pioneered the development of travel and activity diaries as a data collection mechanism, and has written extensively on issues of sample design, data expansion, nonresponse biases, and measurement issues
Trang 4Collecting, Managing, and Assessing Data Using
Sample Surveys
Peter Stopher
Trang 5Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York www.cambridge.org
Information on this title: www.cambridge.org/9780521681872
© Peter Stopher 2012
This publication is in copyright Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2012
Printed in the United Kingdom at the University Press, Cambridge
A catalogue record for this publication is available from the British Library
ISBN 978-0-521-86311-7 Hardback
ISBN 978-0-521-68187-2 Paperback
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.
Trang 6To my wife, Carmen, with grateful thanks for your faith in me and your continuing support and encouragement.
Trang 8Some useful properties of variances and standard deviations 46
Trang 9Other measures of variability 53
3.1.1 A definition of sampling methodology 65
4.3.1 Retaining detail and confidentiality 85
6 Methods for conducting surveys of human populations 104
Trang 107.2.1 The size and number of focus groups 128
7.2.3 Analysing the focus group discussions 131
8.2.1 Classification and behaviour questions 138
8.4.3 Some general issues on question layout 159
Trang 11Appearance of the survey 161
9.2.7 Use of categories and other responses 185
9.3.4 Avoid using ‘Tick all that apply’ formats 1939.3.5 Develop response categories that are mutually exclusive
Trang 1210.3 Stated response questions 206
10.3.3 Number of choice alternatives or scenarios 207
10.4 Some concluding comments on stated response survey design 210
11.4.2 Completeness of aggregate sampling units 228
Trang 1311.7.5 External factors 246
Estimating population statistics and sampling errors 273
Sampling error of ratios and proportions 279
Stratified sampling with a uniform sampling fraction 289
Estimating population statistics and sampling errors 290
Trang 14Equal allocation 294
Stratified sampling with variable sampling fraction 295
Estimating population statistics and sampling errors 296Non-coincident study domains and strata 296Optimum allocation and economic design 297
Estimating population values and sampling statistics 308
Equal clusters: population values and standard errors 317
Unequal clusters: population values and standard errors 322Random selection of unequal clusters 324
Trang 1514 Repetitive surveys 337
14.4 Subsampling on the second and subsequent occasions 341
14.6 Practical issues in designing and conducting panel surveys 343
14.8 Methods for administering practical panel surveys 349
15.3.2 Telephone recruitment with a postal survey with or
16.5.1 Frequently asked questions, fact sheet, or brochure 374
Trang 1616.8.3 Repeated requests for callback 380
16.12 Summary comments on survey implementation 383
17.2 The internet as an optional response mechanism 388
17.3.1 Differences between paper and internet surveys 389
17.3.3 Ability to fill in the Web survey in multiple sittings 392
Trang 1719 Data expansion and weighting 418
20.2.2 Reducing nonresponse and increasing response rates 440
Trang 18Multiple imputation 458
21.5 Adherence to quality measures and guidance 476
22.3.1 A GPS survey as a potential substitute for a household
The effect of multiple observations of each respondent
Trang 1923 Documenting and archiving 499
Trang 202.1 Scatter plot of odometer reading versus model year page12
2.7 Line graph of maximum and minimum temperatures for thirty days 152.8 Ogive of cumulative household income data from Figure 2.5 16
2.14 Distribution of maximum temperatures from Table 2.4 292.15 Distribution of minimum temperatures from Table 2.4 30
2.18 Box and whisker plot of income data from Table 2.5 362.19 Box and whisker plot of maximum temperatures 372.20 Box and whisker plot of minimum temperatures 372.21 Box and whisker plot of vehicles passing through the green phase 43
2.24 Comparison of normal distributions with different variances 462.25 Scatter plot of maximum versus minimum temperature 52
3.1 Extract of random numbers from the RAND Million Random Digits 72
Trang 214.2 First page of an example subject information sheet 88
4.3 Second page of the example subject information sheet 89
8.2 Example of an unacceptable questionnaire format 164
8.3 Example of an acceptable questionnaire format 165
8.4 Excerpt from a survey showing arrows to guide respondent 168
8.5 Extract from a questionnaire showing use of graphics 169
8.6 Columned layout for asking identical questions about multiple people 171
8.7 Inefficient and efficient structures for organising serial questions 172
8.8 Instructions placed at the point to which they refer 173
8.9 Example of an unacceptable questionnaire format with response codes 175
9.1 Example of a sequence of questions that do not require answers 178
9.2 Example of a sequence of questions that do require answers 179
9.4 Example of a belief question with a more vague response 181
9.5 Two alternative response category sets for the age question 185
9.7 Examples of questions with unordered response categories 187
9.8 An example of mixed ordered and unordered categories 188
9.10 An unordered alternative to the question in Figure 9.8 189
9.12 Example of a failure to achieve mutual exclusivity and exhaustiveness 194
9.13 Correction to mutual exclusivity and exhaustiveness 195
9.16 An alternative that keeps the wording of the measure 197
9.17 An alternative way to deal with a double-barrelled question 197
10.2 Example of a qualitative question using number categories 200
10.3 Example of unbalanced positive and negative categories 201
10.4 Example of balanced positive and negative categories 201
10.5 Example of placing the neutral option at the end 202
10.6 Example of distinguishing the neutral option from ‘No opinion’ 202
10.7 Use of columned layout for repeated category responses 203
10.8 Alternative layout for repeated category responses 204
10.11 Rephrasing questions to remove requirement for ‘Agree’/‘Disagree’ 206
11.1 Example of a postcard reminder for the first reminder 215
Trang 2211.2 Framework for understanding respondent burden 241
14.1 Schematic of the four types of repetitive samples 338
14.2 Rotating panel showing recruitment, attrition, and rotation 353
18.1 An unordered set of responses requiring coding 402
18.2 A possible format for asking for an address 409
20.1 Illustration of the categorisation of response outcomes 436
Trang 232.1 Frequencies and proportions of vehicle types page 18
2.2 Frequencies, proportions, and cumulative values for household
2.3 Minimum and maximum temperatures for a month (°C) 20
2.6 Growth rates of an investment fund, 1993–2004 26
2.9 Number of vehicles passing through the green phase of a traffic light 32
2.10 Sorted number of vehicles passing through the green phase 32
2.12 Deviations from the mean for the income data of Table 2.5 38
2.14 Sorted number of vehicles passing through the green phase 43
2.15 Deviations for vehicles passing through the green phase 44
2.16 Values of variance and standard deviation for values of p and q 47
2.17 Deviations for vehicles passing through the green phase raised to third
2.18 Deviations from the mean for children’s ages 58
2.19 Data on household size, annual income, and number of vehicles for
2.20 Deviations needed for covariance and correlation estimates 61
3.1 Heights of 100 (fictitious) university students (cm) 76
3.5 Random sample of ten students (in order drawn) 77
Trang 246.2 Mixed-mode survey types (based on Dillman and Tarnai, 1991) 121
13.1 Partial listing of households for a simple random sample 272
13.2 Excerpt of random numbers from the RAND Million Random Digits 273
13.3 Selection of sample of 100 members using four-digit groups from
13.4 Data from twenty respondents in a fictitious survey 276
13.6 Data for drawing an optimum household travel survey sample 299
13.7 Optimal allocation of the 2,000-household sample 299
13.8 Optimal allocation and expected sampling errors by stratum 300
13.9 Results of equal allocation for the household travel survey 300
13.10 Given information for economic design of the optimal allocation 301
13.11 Preliminary sample sizes and costs for economic design of the
13.12 Estimation of the final sample size and budget 302
13.13 Comparison of optimal allocation, equal allocation, and economic
13.14 Comparison of sampling errors from the three sample designs 303
13.15 Desired stratum sample sizes and results of recruitment calls 305
13.17 Two-stage sample of students from the university 311
13.18 Multistage sample using disproportionate sampling at the first stage 313
13.19 Calculations for standard error from sample in Table 13.18 315
13.23 Calculations for paired selections and successive differences 332
18.1 Potential complex codes for income categories 406
18.2 Example codes for use of the internet and mobile phones 407
19.1 Results of an hypothetical household survey 424
19.2 Calculation of weights for the hypothetical household survey 424
19.9 Results of an hypothetical household survey compared to
19.10 Two-way distribution of completed surveys by percentage
19.11 Results of factoring the rows of Table 19.10 428
Trang 2519.12 Second iteration, in which columns are factored 428
19.13 Third iteration, in which rows are factored again 429
19.14 Weights derived from the iterative proportional fitting 429
20.1 Final disposition codes for RDD telephone surveys 439
23.1 Preservation metadata elements and description 504
Trang 26As is always the case, many people have assisted in the process that has led to this book First, I would like to acknowledge all those, too numerous to mention by name, who have helped me over the years, to learn and understand some of the basics of design-ing and implementing surveys They have been many and they have taught me much
of what I now know in this field However, having said that, I would particularly like
to acknowledge those whom I have worked with over the past fifteen years or more on the International Steering Committee for Travel Survey Conferences (ISCTSC), who have contributed enormously to broadening and deepening my own understandings of surveys In particular, I would like to mention, in no particular order, Arnim Meyburg, Martin Lee-Gosselin, Johanna Zmud, Gerd Sammer, Chester Wilmot, Werner Brög, Juan de Dios Órtuzar, Manfred Wermuth, Kay Axhausen, Patrick Bonnel, Elaine Murakami, Tony Richardson, (the late) Pat van der Reis, Peter Jones, Alan Pisarski, Mary Lynn Tischer, Harry Timmermans, Marina Lombard, Cheryl Stecher, Jean-Loup Madre, Jimmy Armoogum, and (the late) Ryuichi Kitamura All these individuals have inspired and helped me and contributed in various ways to this book, most of them, probably, without realising that they have done so
I would also like to acknowledge the support I have received in this endeavour from the University of Sydney, and especially from the director of the Institute of Transport and Logistics Studies, Professor David Hensher Both David and the university have provided a wide variety of support for the writing and production of this book, for which I am most grateful
However, most importantly, I would like to acknowledge the enormous support and encouragement from my wife, Carmen, and her patience, as I have often spent long hours on working on this book, and her unquestioning faith in me that I could do it She has been an enduring source of strength and inspiration to me Without her, I doubt that this book would have been written
As always, a book can see the light of day only through the encouragement and support of a publisher and those assisting in the publishing process I would like to acknowledge Chris Harrison of Cambridge University Press, who first thought that this book might be worth publishing and encouraged me to develop the outline for
Trang 27it, and then provided critical input that has helped to shape the book into what it has become I would also like to thank profusely Mike Richardson, who carefully and thor-oughly copy-edited the manuscript, improving immensely its clarity and complete-ness I would also like to thank Joanna Breeze, the production editor at Cambridge She has worked with me with all the delays I have caused in the book production, and has still got this book to publication in a very timely manner However, as always, and
in spite of the help of these people, any errors that remain in the book are entirely my responsibility
Finally, I would like to acknowledge the contributions made by the many students I have taught over the years in this area of survey design The interactions we have had, the feedback I have received, and the enjoyment I have had in being able to teach this material and see students understand and appreciate what good survey design entails have been most rewarding and have also contributed to the development of this book I hope that they and future students will find this book to be of help to them and a contin-uing reference to some of those points that we have discussed
Peter StopherBlackheath, New South Wales
August 2011
Trang 281.1 The purpose of this book
There are a number of books available that treat various aspects of survey design, pling, survey implementation, and so forth (examples include Cochran, 1963; Dillman,
sam-1978, 2000; Groves and Couper, 1998; Kish, 1965; Richardson, Ampt, and Meyburg,
1995; and Yates, 1965) However, there does not appear to be a single book that covers all aspects of a survey, from the inception of the survey itself through to archiving the data This is the purpose of this book The reader will find herein a complete treatment
of all aspects of a survey, including all the elements of design, the requirements for testing and refinement, fielding the survey, coding and analysing the resulting data, documenting what happened, and archiving the data, so that nothing is lost from what
is inevitably an expensive process
This book concentrates on surveys of human populations, which are both more lenging generally and more difficult both to design and to implement than most sur-veys of non-human populations In addition, because of the background of the author, examples are drawn mainly from surveys in the area of transport planning However, the examples are purely illustrative; no background is needed in transport planning to understand the examples, and the principles explained are applicable to any survey that involves human response to a survey instrument In spite of this focus on human partic-ipation in the survey process, there are occasional references to other types of surveys, especially observational and counting types of surveys
chal-In writing this book, the author has tried to make this as complete a treatment as sible Although extensive references are included to numerous publications and books
pos-in various aspects of measurpos-ing data, the reader should be able to fpos-ind all that he or she requires within the covers of this book This includes a chapter on some basic aspects
of statistics and probability that are used subsequently, particularly in the development
of the statistical aspects of surveys
In summary, then, the purpose of this book is to provide the reader with an sive and, as far as possible, exhaustive treatment of issues involved in the design and execution of surveys of human populations It is the intent that, whether the reader is
exten-a student, exten-a professionexten-al who hexten-as been exten-asked to design exten-and implement exten-a survey, or
Trang 29someone attempting to gain a level of knowledge about the survey process, all tions will be answered within these pages This is undoubtedly a daunting task The reader will be able to judge the extent to which this has been achieved The book is also designed that someone who has no prior knowledge of statistics, probability, surveys,
ques-or the purposes to which surveys may be put can pick up and read this book, gaining knowledge and expertise in doing so At the same time, this book is designed as a ref-erence book To that end, an extensive index is provided, so that the user of this book who desires information on a particular topic can readily find that topic, either from the table of contents, or through the index
1.2 Scope of the book
As noted in the previous section, the book starts with a treatment of some basic tics and probability The reader who is familiar with this material may find it appro-priate to skip this chapter However, for those who have already learnt material of this type but not used it for a while, as well as those who are unfamiliar with the material,
statis-it is recommended that this chapter be used as a means for review, refreshment, or even first-time learning It is then followed by a chapter that outlines some basic issues
of surveys, including a glossary of terms and definitions that will be found helpful
in reading the remainder of the book A number of fundamental issues, pertinent to overall survey design, are raised in this chapter Chapter 4 introduces the topic of the ethics of surveys, and outlines a number of ethical issues and proposes a number of basic ethical standards to which surveys of human populations should adhere The fifth chapter of the book discusses the primary issues of designing a survey A major underlying theme of this chapter is that there is no such thing as an ‘all-purpose sur-vey’ Experience has repeatedly demonstrated that only surveys designed with a clear purpose in mind can be successful
The next nine chapters deal with all the various design issues in a survey, given that
we have established the overall purpose or purposes of the survey The first of these chapters (Chapter 6) discusses and describes all the current methods that are available for conducting surveys of human populations, in which people are asked to partic-ipate in the survey process Mention is also made of some methods of dealing with other types of survey that are appropriate when the objects of the survey are observed
in some way and do not participate in the process In Chapter 7, the topic of focus groups is introduced, and potential uses of focus groups in designing quantitative and qualitative surveys are discussed The chapter does not provide an exhaustive treat-ment of this topic, but does provide a significant amount of detail on how to organise and design focus groups In Chapter 8, the design of survey instruments is discussed
at some length Illustrations of some principles of design are included, drawn pally from transport and related surveys Chapters 9 and 10 deal with issues relating
princi-to question design and question wording and special issues relating princi-to qualitative and preference surveys Chapter 11 deals with the design of data collection procedures themselves, including such issues as item and unit nonresponse, what constitutes a
Trang 30complete response, the use of proxy reporting and its effects, and so forth The seventh
of this group of chapters (Chapter 12) deals with pilot surveys and pretests – a topic that is too often neglected in the design of surveys A number of issues in designing and undertaking such surveys and tests are discussed Chapter 13 deals with the topic
of sample design and sampling issues In this chapter, there is extensive treatment of the statistics of sampling, including estimation of sampling errors and determination of sample sizes The chapter describes most of the available methods of sampling, includ-ing simple random samples, stratified samples, multistage samples, cluster samples, systematic samples, choice-based samples, and a number of sampling methods that are often considered but that should be avoided in most instances, such as quota samples, judgemental samples, and haphazard samples
Chapter 14 addresses the topic of repetitive surveys Many surveys are intended to
be done as a ‘one-off’ activity For such surveys, the material covered in the preceding chapters is adequate However, there are many surveys that are intended to be repeated from time to time This chapter deals with such issues as repeated cross-sectional sur-veys, panel surveys, overlapping samples, and continuous surveys In particular, this chapter provides the reader with a means to compare the advantages and disadvantages
of the different methods, and it also assists in determining which is appropriate to apply in a given situation
Chapter 15 builds on the material in the preceding chapters and deals with the issue
of survey economics This is one of the most troublesome areas, because, as many companies have found out, it is all too easy to be bankrupted by a survey that is under-taken without a real understanding and accounting of the costs of a survey While information on actual costs will date very rapidly, this chapter attempts to provide rel-ative data on costs, which should help the reader estimate the costs of different survey strategies This chapter also deals with many of the potential trade-offs in the design
of surveys
Chapter 16 delves into some of the issues relating to the actual survey tation process This includes issues relating to training survey interviewers and moni-toring the performance of interviewers, and the chapter discusses some of the danger signs to look for during implementation This chapter also deals with issues regarding the ethics of survey implementation, especially the relationships between the survey firm, the client for the survey, and the members of the public who are the respondents
implemen-to the survey Chapter 17 introduces a topic that is becoming of increasing interest: Web-based surveys Although this is a field that is as yet quite young, there are an increasing number of aspects that have been researched and from which the reader can benefit Chapter 18 deals with the process of coding and data entry A major issue in this topic is the geographic coding of places that may be requested in a survey
Chapter 19 addresses the topics of data expansion and weighting Data expansion is outlined as a function of the sampling method, and statistical procedures for expanding each of the different types of sample are provided in this chapter Weighting relates to problems of survey bias, resulting either from incomplete coverage of the population in the sampling process or from nonresponse by some members of the subject population
Trang 31This is an increasingly problematic area for surveys of human populations, resulting from a myriad of issues relating to voluntary participation Chapter 20 addresses the issue of nonresponse more completely Here, issues of who is likely to respond and who is not are discussed Methods to increase response rates are described, and refer-ence is made again to the economics of the survey design The question of computing response rates is also addressed in this chapter This is usually the most widely recog-nised statistic for assessing the quality of a survey, but it is also a statistic that is open
to numerous methods of computation, and there is considerable doubt as to just what
it really means
Chapter 21 deals with a range of other measures of data quality, some that are eral and some, by way of example, that are specific to surveys in transport These mea-sures are provided as a way to illustrate how survey-specific measures of quality can
gen-be devised, depending on the purposes of the survey Chapter 22 discusses some issues
of the future of human population surveys, especially in the light of emerging ogies and their potential application and misapplication to the survey task
technol-Chapter 23, the final chapter in the book, covers the issues of documenting and archiving the data This all too often neglected area of measuring data is discussed at some length A list of headings for the final report on the survey is provided, along with suggestions as to what should be included under the headings The issue of archiving data is also addressed at some length Data are expensive to collect and are rarely archived appropriately The result is that many expensive surveys are effectively lost soon after the initial analyses are undertaken In addition, knowledge about the survey
is often lost when those who were most centrally involved in the survey move on to other assignments, or leave to work elsewhere
1.3 Survey statistics
Statistics in general, and survey statistics in particular, constitute a relatively young area of theory and practice The earliest instance of the use of statistics is probably in the middle of the sixteenth century, and related to the start of data collection in France regarding births, marriages, and deaths, and in England to the collection of data on
deaths in London each week (Berntson et al., 2005) It was then not until the middle
of the eighteenth century that publications began to appear advancing some of the liest theories in statistics and probability However, much of the modern development
ear-of statistics did not take place until the late nineteenth and early twentieth centuries
(Berntson et al., 2005):
Beginning around 1880, three famous mathematicians, Karl Pearson, Francis Galton and Edgeworth, created a statistical revolution in Europe Of the three mathematicians, it was Karl Pearson, along with his ambition and determination, that led people to consider him the founder of the twentieth- century science of statistics.
It was only in the early twentieth century that most of the now famous names in tistics made their contributions to the field These included such statisticians as Karl
Trang 32Pearson, Francis Galton, C R Rao, R A Fisher, E S Pearson, and Jerzy Neyman, among many others, who all made major contributions to what we know today as the science of statistics and probability.
Survey sampling statistics is of even more recent vintage Among the most notable names in this field of study are those of R A Fisher, Frank Yates, Leslie Kish, and
W G Cochran Fisher may have given survey sampling its birth, both through his own contributions and through his appointment of Frank Yates as assistant statistician at Rothamsted Experimental Station in 1931 In this post, Yates developed, often in col-laboration with Fisher, what may be regarded as the beginnings of survey sampling in the form of experimental designs (O’Connor and Robertson, 1997) His book Sampling Methods for Censuses and Surveys was first published in 1949, and it appears to be the first book on statistical sampling designs
Leslie Kish, who founded the Survey Research Institute at the University of Michigan, is also regarded as one of the founding fathers of modern survey sampling
methods, and he published his seminal work, called Survey Sampling, in 1965 Close
in time to Kish, W G Cochran published his seminal work, Sampling Techniques, in
1963
Based on these efforts, the science of survey sampling cannot be considered to be much over fifty years old – a very new scientific endeavour As a result of this rela-tive recency, there is still much to be done in developing the topic of survey sampling, while technologies for undertaking surveys have undergone and continue to undergo rapid evolution The fact that most of the fundamental books on the topic are about forty years old suggests that it is time to undertake an updated treatise on the topic Hence, this book has been undertaken
Trang 332.1 Some definitions in statistics
Statistics is defined by the Oxford Dictionary of English Etymology as ‘the political
science concerned with the facts of a state or community’, and the word is derived
from the German statistisch The beginning of modern statistics was in the sixteenth
century, when large amounts of data began to be collected on the populations of tries in Europe, and the task was to make sense of these vast amounts of data As statis-tics has evolved from this beginning, it has become a science concerned with handling large quantities of data, but also with using much smaller amounts of data in an effort
coun-to represent entire populations, when the task of handling data on the entire population
is too large or expensive The science of statistics is concerned with providing inputs
to political decision making, to the testing of hypotheses (understanding what would happen if …), drawing inferences from limited data, and, considering the data limita-tions, doing all these things under conditions of uncertainty
A word used commonly in statistics and surveys is population The population is
defined as the entire collection of elements of concern in a given situation It is also
sometimes referred to as a universe Thus, if the elements of concern are pre-school
children in a state, then the population is all the pre-school children in the state at the time of the study If the elements of concern are elephants in Africa, then the popula-tion consists of all the elephants currently in Africa If the elements of concern are the vehicles using a particular freeway on a specified day, then the population is all the vehicles that use that particular freeway on that specific day
It is very clear that statistics is the study of data Therefore, it is necessary to
understand what is meant by data The word data is a plural noun from the Latin datum, meaning given facts As used in English, the word means given facts from which other facts may be inferred Data are fundamental to the analysis and model-ling of real-world phenomena, such as human populations, the behaviour of firms, weather systems, astronomical processes, sociological processes, genetics, etc Therefore, one may state that statistics is the process for handling and analysing data, such that useful conclusions can be drawn, decisions made, and new knowledge accumulated
Trang 34Another word used in connection with statistics is observation An observation may
be defined as the information that can be seen about a member of a subject population
An observation comprises data about relevant characteristics of the member of the population This population may be people, households, galaxies, private firms, etc Another way of thinking of this is that an observation represents an appropriate group-ing of data, in which each observation consists of a set of data items describing one member of the population
A parameter is a quantity that describes some property of the population Parameters
may be given as numbers, proportions, or percentages For example, the number of male pre-school children in the state might be 16,897, and this number is a parameter The proportion of baby elephants in Africa might be 0.39, indicating that 39 per cent
of all elephants in Africa at this time are babies This is also a parameter Sometimes, one can define a particular parameter as being critical to a decision This would then
be called a decision parameter For example, suppose that a decision is to be made
as to whether or not to close a primary school The decision parameter might be the number of schoolchildren that would be expected to attend that school in, say, the next five years
A sample is some subset of a population It may be a large proportion of the
popu-lation, or a very small proportion of the population For example, a survey of Sydney households, which comprise a population of about 1,300,000 might consist of 130,000, households (a 10 per cent sample) or 300 households (a 0.023 per cent sample)
A statistic is a numerical quantity that describes a sample It is therefore the
equiva-lent of a parameter, but for a sample rather than the population For example, a survey
of 130,000 households in Sydney might have shown that 52 per cent of households own their own home or are buying it This would be a statistic If, on the other hand, a figure of 54 per cent was determined from a census of the 1,300,000 households, then this figure would be a parameter
Statistical inference is the process of making statements about a population based
on limited evidence from a sample study Thus, if a sample of 130,000 households
in Sydney was drawn, and it was determined that 52 per cent of these owned or were purchasing their homes, then statistical inference would lead one to propose that this might mean that 676,000 (52 per cent of 1,300,000) households in Sydney own or are purchasing their homes
2.1.1 Censuses and surveys
Of particular relevance to this book is the fact that there are two methods for ing data about a population of interest The first of these is a census, which involves making observations of every member of the population Censuses of the human pop-ulation have been undertaken in most countries of the world for many years There are references in the Bible to censuses taken among the early Hebrews, and later by the Romans at the time of the birth of Christ In Europe, most censuses began in the eigh-teenth century, although a few began earlier than that In the United States of America,
Trang 35censuses began in the nineteenth century Many countries undertake a census once
in each decade, either in the year ending in zero or in one Some countries, such as Australia, undertake a census twice in each decade A census may be as simple as a head count (enumerating the total size of the population) or it may be more complex,
by collecting data on a number of characteristics of each member of the population, such as name, address, age, country of birth, etc
A survey is similar to a census, except that it is conducted on a subset of the tion, not the entire population A survey may involve a large percentage of the population
popula-or may be restricted to a very small sample of the population Much of the science of survey statistics has to do with how one makes a small sample represent the entire popu-lation This is discussed in much more detail in the next chapter A survey, by definition, always involves a sample of the population Therefore, to speak of a 100 per cent sample
is contradictory; if it is a sample, it must be less than 100 per cent of the population
2.2 Describing data
One of the first challenges for statistics is to describe data Obviously, one can provide
a complete set of data to a decision maker However, the human mind is not capable
of utilising such information efficiently and effectively For example, a census of the United States would produce observations on over 300 million people, while one of India would produce observations of over 1 billion people A listing of those observa-tions represents something that most human beings would be incapable of utilising What is required, then, is to find some ways to simplify and describe data, so that use-ful information is preserved but the sheer magnitude of the underlying data is hidden, thereby not distracting the human analyst or decision maker
Before examining ways in which data might be presented or described, such that the mind can grasp the essential information contained therein, it is important to under-stand the nature of different types of data that can be collected To do this, it seems useful to consider the measurement of a human population, especially since that is the main topic of the balance of this book
In mathematical statistics, we refer to things called variables A variable is a
char-acteristic of the population that may take on differing or varying values for different members of the population Thus, variables that could be used to describe members
of a human population may include such characteristics as name, address, age or date
of birth, place of birth, height, weight, eye colour, hair colour, and shoe size Each of these characteristics provides differing levels of information that can be used in vari-ous ways We can divide these characteristics into four different types of scales, a scale representing a way of measuring the characteristic
Trang 36be ordered alphabetically or can be ordered in any of a number of arbitrary ways, such
as the order in which data are collected on individuals However, no information is provided by changing the order of the names Therefore, the only thing that the name
provides is a label for each member of the population This is called a nominal scale
A nominal scale is the least informative of the different types of scales that can be used
to measure characteristics, but its lack of other information does not render it of less value Other examples of nominal data are the colours of hair or eyes of the members
of the population, bus route numbers, the numbers assigned to census collection tricts, names of firms listed on a country’s stock exchange, and the names of magazines stocked by a newsagency
dis-Ordinal scales
Each person in the population has an address The address will usually include a house number and a street name, along with the name of the town or suburb in which the house is located The address clearly also represents a label, just as does the person’s name However, in the case of the address, there is more information provided If the addresses are sorted by number and by street, in most places in the world this will pro-vide additional information These sorted addresses will actually help an investigator
to locate each home, in that it is expected that the houses are arranged in numerical order along the street, and probably with odd numbers on one side of the street and
even numbers on the other side As a result, there is order information provided in the address It is, therefore, known as an ordinal scale However, if it is known that one
person lives at 27 Main Street, and another person lives at 35 Main Street, this does not indicate how far apart these two people live In some countries, they could be next door
to each other, while in others there might be three houses between them or even seven houses between them (if numbering goes down one side of the street and back on the other) The only thing that would be known is that, starting at the first house on Main Street, one would arrive at 27 before one would arrive at 35 Therefore, order is the only additional information provided by this scale Other examples of ordinal scales would be the list of months in the year, censor ratings of movies, and a list of runners
in the order in which they finished a race
a size represents a constant increase in the length of the shoe Thus, the difference between a size nine and a size ten shoe for a man is the same as the difference between
a size eight and a size nine, and so on for any two adjacent numbers In other words,
Trang 37there is a constant interval between each shoe size On the other hand, there is no
nat-ural zero in this scale (in fact, a size of zero generally does not exist), and it is not true that a size five is half the length of a size ten Therefore, shoe size may be considered
to be an interval scale Women’s dress sizes in a number of countries also represent
an interval scale, in which each increment in dress size represents a constant interval
of increase in size of the dress, but a size sixteen dress is not twice as large as a size eight In many cases, the sizing of an item of clothing as small, medium, large, etc also represents an interval scale Another example of an interval scale is the normal scale of temperature in either degrees Celsius or degrees Fahrenheit An interval of one degree represents the same increase or decrease in temperature, whether it is between 40 and
41 or 90 and 91 However, we are not able to state that 60 degrees is twice as hot as
30 degrees There is also not a natural zero on either the Celsius scale or the Fahrenheit scale Indeed, the Celsius scale sets the temperature at which water freezes as 0, but the Fahrenheit scale sets this at 32, and there is not a particular physical property of the zero on the Fahrenheit scale
in these measures There is ratio information In other words, we know that a person
who is 180 centimetres tall is twice as tall as a person who is 90 centimetres tall, and that a person weighing 45 kilograms is only half the weight of a person weighing 90 kilograms There are two important new pieces of information provided by these mea-sures First, there is a natural zero in the measurement scale Both weight and height have a zero point, which represents the absence of weight or the absence of height Second, there is a multiplicative relationship among the measures on the scale, not just
an additive one Therefore, both weight and height are described as ratio scales Other
examples of ratio scales are distance or length measures, measures of speed, measures
of elapsed time, and so forth However, it should be noted that measurement of clock time is interval-scaled (there is no natural zero, and 5 a.m is not a half of 10 a.m.), while elapsed time is ratio-scaled, because zero minutes represents the absence of any elapsed time, and twenty minutes is twice as long as ten minutes, for example
Trang 38contains the information of the previous type of scale, and then adds new information content Thus an ordinal scale also has nominal information, but adds to that informa-tion on order; an interval scale has both nominal and ordinal information, but adds to that a consistent interval of measurement; and a ratio scale contains all nominal, ordi-nal, and interval information, but adds ratio relationships to them.
There are two other ways in which scales can be described, because most scales can
be measured in different ways The first of these relates to whether the scale is uous or discrete A continuous scale is one in which the measurement can be made to any degree of precision desired For example, we can measure elapsed time to the near-est hour, or minute, or second, or nanosecond, etc Indeed, the only thing that limits the precision by which we can measure this scale is the precision of our instruments for measurement However, there is no natural limit to precision in such cases This is
contin-a continuous sccontin-ale A discrete sccontin-ale, on the other hcontin-and, ccontin-annot be subdivided beyond
a certain point For example, shoe sizes are a discrete scale Many shoe ers will provide shoes in half-size increments, while others will provide them only in whole-size increments Subdivision below half sizes simply is not done Similarly, any measurement that involves counting objects, such as counting the number of members
manufactur-of a population, is a discrete scale We cannot have fractional people, fractional houses,
or fractional cars, for example
The second descriptor of a scale is whether it is inherently exact or approximate By their nature, all continuous scales are approximate This is so because we can always increase the precision of measurement Generally, numbers obtained from counting are exact, unless the counting mechanism is capable of error However, other discrete scales may be approximate or exact In most clothing or shoe sizes, the measure would
be considered approximate, because sizes often differ between manufacturers, and between countries A size nine shoe is not the same size in the United States and in the United Kingdom, for example, nor is it necessarily the same size from two different shoe makers in the same country
It is important to recognise what type of a scale we are dealing with, when mation is measured on scales, because the type of scale will also often either dictate how the information can be presented or restrict the analyst to certain ways of pre-sentation Similarly, whether the measure is discrete or continuous will also affect the presentation of the data, as will, in some cases, whether the data are approximate or exact
infor-2.2.2 Data presentation: graphics
It is appropriate to start with some simple rules about graphical presentations There are four principal types of graphical presentation: scatter plots, pie charts, histograms
or bar charts, and line graphs
A scatter plot is a plot of the frequency with which specific values of a pair of
vari-ables occur in the data Thus, the X-axis of the plot will contain the values of one of the variables that are found in the data, and the Y-axis will contain the values of the other
Trang 39variable As such, any type of measure can be presented on a scatter plot However, if all values occur only once – i.e., are unique to an observation – then a scatter plot is
of no particular interest Therefore although any data can theoretically be plotted on a scatter plot, data that represent unique values, or data that are continuous, and also will probably have frequencies of only one or two at most for any pair of values, will not
be illuminated by a scatter plot
An example of a scatter plot is provided in Figure 2.1, which shows a scatter plot of odometer readings of cars versus the model year of the vehicle The Y-axis is a ratio-scaled variable, and the X-axis is an interval-scaled variable The scatter plot indicates that there probably is a relationship between odometer readings and model year, such that the higher the model year value, the lower the odometer reading, as would be expected This is a useful scatter plot
Figure 2.2 illustrates a scatter plot of two nominal-scaled variables: fuel type versus body type It is not a very useful illustration of the data First, we cannot tell how many points fall at each combination of values Second, all it really tells us is that there are
no taxis (body type 5) in this data set, that all vehicle types use petrol (fuel type 1), that all except motorcycles (body type 6) use diesel (fuel type 2), and that only cars (body
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
Trang 40type 1), four-wheel drive (4WD) vehicles (body type 2), and utility/van/panel vans (body type 3) use dual fuel (fuel type 4) This illustrates that nominal data – both fuel type and body type are nominal scales – may not produce a useful scatter plot.
A pie chart is a circle that is divided into segments representing specific values in
the data, with the length of the segment along the circumference of the circle indicating how frequently the value occurs in the data Again, pie charts can be used with any type
of data, when the information to be presented is the frequency of occurrence However, they will generally not work with continuous data, unless the data are first grouped and converted to discrete categories An example of a pie chart is provided in Figure 2.3 This shows that the pie chart works well for nominal data, in this case the vehicle body type from a survey of households
Figure 2.4 shows a pie chart for category data – i.e., discrete data The data are reported household incomes from a survey of households The categories were those used in the survey Income, being measured in dollars and with a natural zero, is actu-ally a ratio scale In the categories collected, income is a ratio-scaled discrete measure Again, the pie chart provides a good representation of the data
A histogram or bar chart is used for presenting discrete data Such data will be
interval- or ratio-scaled data Histograms can be constructed in several different ways When presenting complex information, bars can be stacked, showing how different
4WD Car Motorcycle Other Taxi Truck Utility vehicle
Figure 2.3 Pie chart of vehicle body types
Figure 2.4 Pie chart of household income groups