2012 (EBOOK) collecting, managing, and assessing data using sample surveys

Using Sample SurveysCollecting, Managing, and Assessing Data Using Sample Surveys provides a thorough, step-by-step guide to the design and mentation of surveys.. Having shown readers h

Trang 2

Using Sample Surveys

Collecting, Managing, and Assessing Data Using Sample Surveys

provides a thorough, step-by-step guide to the design and mentation of surveys Beginning with a primer on basic statistics, the first half of the book takes readers on a comprehensive tour through the basics of survey design Topics covered include the ethics of surveys, the design of survey procedures, the design of the survey instrument, how to write questions, and how to draw representative samples Having shown readers how to design sur-veys, the second half of the book discusses a number of issues sur-rounding their implementation, including repetitive surveys, the economics of surveys, Web-based surveys, coding and data entry, data expansion and weighting, the issue of nonresponse, and the documenting and archiving of survey data The book is an excel-lent introduction to the use of surveys for graduate students as well

imple-as a useful reference work for scholars and professionals

peter stopher is Professor of Transport Planning at the Institute of Transport and Logistics Studies at the University of Sydney He has also been a professor at Northwestern University, Cornell University, McMaster University, and Louisiana State University Professor Stopher has developed a substantial reputa-tion in the field of data collection, particularly for the support of travel forecasting and analysis He pioneered the development of travel and activity diaries as a data collection mechanism, and has written extensively on issues of sample design, data expansion, nonresponse biases, and measurement issues

Trang 4

Collecting, Managing, and Assessing Data Using

Sample Surveys

Peter Stopher

Trang 5

Cambridge University Press

The Edinburgh Building, Cambridge CB2 8RU, UK

Published in the United States of America by Cambridge University Press, New York www.cambridge.org

Information on this title: www.cambridge.org/9780521681872

This publication is in copyright Subject to statutory exception

and to the provisions of relevant collective licensing agreements,

no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 2012

Printed in the United Kingdom at the University Press, Cambridge

A catalogue record for this publication is available from the British Library

ISBN 978-0-521-86311-7 Hardback

ISBN 978-0-521-68187-2 Paperback

Cambridge University Press has no responsibility for the persistence or

accuracy of URLs for external or third-party internet websites referred to

in this publication, and does not guarantee that any content on such

websites is, or will remain, accurate or appropriate.

Trang 6

To my wife, Carmen, with grateful thanks for your faith in me and your continuing support and encouragement.

Trang 8

Some useful properties of variances and standard deviations 46

Trang 9

Other measures of variability 53

3.1.1 A definition of sampling methodology 65

4.3.1 Retaining detail and confidentiality 85

6 Methods for conducting surveys of human populations 104

Trang 10

7.2.1 The size and number of focus groups 128

7.2.3 Analysing the focus group discussions 131

8.2.1 Classification and behaviour questions 138

8.4.3 Some general issues on question layout 159

Trang 11

Appearance of the survey 161

9.2.7 Use of categories and other responses 185

9.3.4 Avoid using ‘Tick all that apply’ formats 1939.3.5 Develop response categories that are mutually exclusive

Trang 12

10.3 Stated response questions 206

10.3.3 Number of choice alternatives or scenarios 207

10.4 Some concluding comments on stated response survey design 210

11.4.2 Completeness of aggregate sampling units 228

Trang 13

11.7.5 External factors 246

Estimating population statistics and sampling errors 273

Sampling error of ratios and proportions 279

Stratified sampling with a uniform sampling fraction 289

Estimating population statistics and sampling errors 290

Trang 14

Equal allocation 294

Stratified sampling with variable sampling fraction 295

Estimating population statistics and sampling errors 296Non-coincident study domains and strata 296Optimum allocation and economic design 297

Estimating population values and sampling statistics 308

Equal clusters: population values and standard errors 317

Unequal clusters: population values and standard errors 322Random selection of unequal clusters 324

Trang 15

14 Repetitive surveys 337

14.4 Subsampling on the second and subsequent occasions 341

14.6 Practical issues in designing and conducting panel surveys 343

14.8 Methods for administering practical panel surveys 349

15.3.2 Telephone recruitment with a postal survey with or

16.5.1 Frequently asked questions, fact sheet, or brochure 374

Trang 16

16.8.3 Repeated requests for callback 380

16.12 Summary comments on survey implementation 383

17.2 The internet as an optional response mechanism 388

17.3.1 Differences between paper and internet surveys 389

17.3.3 Ability to fill in the Web survey in multiple sittings 392

Trang 17

19 Data expansion and weighting 418

20.2.2 Reducing nonresponse and increasing response rates 440

Trang 18

Multiple imputation 458

21.5 Adherence to quality measures and guidance 476

22.3.1 A GPS survey as a potential substitute for a household

The effect of multiple observations of each respondent

Trang 19

23 Documenting and archiving 499

Trang 20

2.1 Scatter plot of odometer reading versus model year page12

2.7 Line graph of maximum and minimum temperatures for thirty days 152.8 Ogive of cumulative household income data from Figure 2.5 16

2.14 Distribution of maximum temperatures from Table 2.4 292.15 Distribution of minimum temperatures from Table 2.4 30

2.18 Box and whisker plot of income data from Table 2.5 362.19 Box and whisker plot of maximum temperatures 372.20 Box and whisker plot of minimum temperatures 372.21 Box and whisker plot of vehicles passing through the green phase 43

2.24 Comparison of normal distributions with different variances 462.25 Scatter plot of maximum versus minimum temperature 52

3.1 Extract of random numbers from the RAND Million Random Digits 72

Trang 21

4.2 First page of an example subject information sheet 88

4.3 Second page of the example subject information sheet 89

8.2 Example of an unacceptable questionnaire format 164

8.3 Example of an acceptable questionnaire format 165

8.4 Excerpt from a survey showing arrows to guide respondent 168

8.5 Extract from a questionnaire showing use of graphics 169

8.6 Columned layout for asking identical questions about multiple people 171

8.7 Inefficient and efficient structures for organising serial questions 172

8.8 Instructions placed at the point to which they refer 173

8.9 Example of an unacceptable questionnaire format with response codes 175

9.1 Example of a sequence of questions that do not require answers 178

9.2 Example of a sequence of questions that do require answers 179

9.4 Example of a belief question with a more vague response 181

9.5 Two alternative response category sets for the age question 185

9.7 Examples of questions with unordered response categories 187

9.8 An example of mixed ordered and unordered categories 188

9.10 An unordered alternative to the question in Figure 9.8 189

9.12 Example of a failure to achieve mutual exclusivity and exhaustiveness 194

9.13 Correction to mutual exclusivity and exhaustiveness 195

9.16 An alternative that keeps the wording of the measure 197

9.17 An alternative way to deal with a double-barrelled question 197

10.2 Example of a qualitative question using number categories 200

10.3 Example of unbalanced positive and negative categories 201

10.4 Example of balanced positive and negative categories 201

10.5 Example of placing the neutral option at the end 202

10.6 Example of distinguishing the neutral option from ‘No opinion’ 202

10.7 Use of columned layout for repeated category responses 203

10.8 Alternative layout for repeated category responses 204

10.11 Rephrasing questions to remove requirement for ‘Agree’/‘Disagree’ 206

11.1 Example of a postcard reminder for the first reminder 215

Trang 22

11.2 Framework for understanding respondent burden 241

14.1 Schematic of the four types of repetitive samples 338

14.2 Rotating panel showing recruitment, attrition, and rotation 353

18.1 An unordered set of responses requiring coding 402

18.2 A possible format for asking for an address 409

20.1 Illustration of the categorisation of response outcomes 436

Trang 23

2.1 Frequencies and proportions of vehicle types page 18

2.2 Frequencies, proportions, and cumulative values for household

2.3 Minimum and maximum temperatures for a month (°C) 20

2.6 Growth rates of an investment fund, 1993–2004 26

2.9 Number of vehicles passing through the green phase of a traffic light 32

2.10 Sorted number of vehicles passing through the green phase 32

2.12 Deviations from the mean for the income data of Table 2.5 38

2.14 Sorted number of vehicles passing through the green phase 43

2.15 Deviations for vehicles passing through the green phase 44

2.16 Values of variance and standard deviation for values of p and q 47

2.17 Deviations for vehicles passing through the green phase raised to third

2.18 Deviations from the mean for children’s ages 58

2.19 Data on household size, annual income, and number of vehicles for

2.20 Deviations needed for covariance and correlation estimates 61

3.1 Heights of 100 (fictitious) university students (cm) 76

3.5 Random sample of ten students (in order drawn) 77

Trang 24

6.2 Mixed-mode survey types (based on Dillman and Tarnai, 1991) 121

13.1 Partial listing of households for a simple random sample 272

13.2 Excerpt of random numbers from the RAND Million Random Digits 273

13.3 Selection of sample of 100 members using four-digit groups from

13.4 Data from twenty respondents in a fictitious survey 276

13.6 Data for drawing an optimum household travel survey sample 299

13.7 Optimal allocation of the 2,000-household sample 299

13.8 Optimal allocation and expected sampling errors by stratum 300

13.9 Results of equal allocation for the household travel survey 300

13.10 Given information for economic design of the optimal allocation 301

13.11 Preliminary sample sizes and costs for economic design of the

13.12 Estimation of the final sample size and budget 302

13.13 Comparison of optimal allocation, equal allocation, and economic

13.14 Comparison of sampling errors from the three sample designs 303

13.15 Desired stratum sample sizes and results of recruitment calls 305

13.17 Two-stage sample of students from the university 311

13.18 Multistage sample using disproportionate sampling at the first stage 313

13.19 Calculations for standard error from sample in Table 13.18 315

13.23 Calculations for paired selections and successive differences 332

18.1 Potential complex codes for income categories 406

18.2 Example codes for use of the internet and mobile phones 407

19.1 Results of an hypothetical household survey 424

19.2 Calculation of weights for the hypothetical household survey 424

19.9 Results of an hypothetical household survey compared to

19.10 Two-way distribution of completed surveys by percentage

19.11 Results of factoring the rows of Table 19.10 428

Trang 25

19.12 Second iteration, in which columns are factored 428

19.13 Third iteration, in which rows are factored again 429

19.14 Weights derived from the iterative proportional fitting 429

20.1 Final disposition codes for RDD telephone surveys 439

23.1 Preservation metadata elements and description 504

Trang 26

As is always the case, many people have assisted in the process that has led to this book First, I would like to acknowledge all those, too numerous to mention by name, who have helped me over the years, to learn and understand some of the basics of design-ing and implementing surveys They have been many and they have taught me much

of what I now know in this field However, having said that, I would particularly like

to acknowledge those whom I have worked with over the past fifteen years or more on the International Steering Committee for Travel Survey Conferences (ISCTSC), who have contributed enormously to broadening and deepening my own understandings of surveys In particular, I would like to mention, in no particular order, Arnim Meyburg, Martin Lee-Gosselin, Johanna Zmud, Gerd Sammer, Chester Wilmot, Werner Brög, Juan de Dios Órtuzar, Manfred Wermuth, Kay Axhausen, Patrick Bonnel, Elaine Murakami, Tony Richardson, (the late) Pat van der Reis, Peter Jones, Alan Pisarski, Mary Lynn Tischer, Harry Timmermans, Marina Lombard, Cheryl Stecher, Jean-Loup Madre, Jimmy Armoogum, and (the late) Ryuichi Kitamura All these individuals have inspired and helped me and contributed in various ways to this book, most of them, probably, without realising that they have done so

I would also like to acknowledge the support I have received in this endeavour from the University of Sydney, and especially from the director of the Institute of Transport and Logistics Studies, Professor David Hensher Both David and the university have provided a wide variety of support for the writing and production of this book, for which I am most grateful

However, most importantly, I would like to acknowledge the enormous support and encouragement from my wife, Carmen, and her patience, as I have often spent long hours on working on this book, and her unquestioning faith in me that I could do it She has been an enduring source of strength and inspiration to me Without her, I doubt that this book would have been written

As always, a book can see the light of day only through the encouragement and support of a publisher and those assisting in the publishing process I would like to acknowledge Chris Harrison of Cambridge University Press, who first thought that this book might be worth publishing and encouraged me to develop the outline for

Trang 27

it, and then provided critical input that has helped to shape the book into what it has become I would also like to thank profusely Mike Richardson, who carefully and thor-oughly copy-edited the manuscript, improving immensely its clarity and complete-ness I would also like to thank Joanna Breeze, the production editor at Cambridge She has worked with me with all the delays I have caused in the book production, and has still got this book to publication in a very timely manner However, as always, and

in spite of the help of these people, any errors that remain in the book are entirely my responsibility

Finally, I would like to acknowledge the contributions made by the many students I have taught over the years in this area of survey design The interactions we have had, the feedback I have received, and the enjoyment I have had in being able to teach this material and see students understand and appreciate what good survey design entails have been most rewarding and have also contributed to the development of this book I hope that they and future students will find this book to be of help to them and a contin-uing reference to some of those points that we have discussed

Peter StopherBlackheath, New South Wales

August 2011

Trang 28

1.1 The purpose of this book

There are a number of books available that treat various aspects of survey design, pling, survey implementation, and so forth (examples include Cochran, 1963; Dillman,

sam-1978, 2000; Groves and Couper, 1998; Kish, 1965; Richardson, Ampt, and Meyburg,

1995; and Yates, 1965) However, there does not appear to be a single book that covers all aspects of a survey, from the inception of the survey itself through to archiving the data This is the purpose of this book The reader will find herein a complete treatment

of all aspects of a survey, including all the elements of design, the requirements for testing and refinement, fielding the survey, coding and analysing the resulting data, documenting what happened, and archiving the data, so that nothing is lost from what

is inevitably an expensive process

This book concentrates on surveys of human populations, which are both more lenging generally and more difficult both to design and to implement than most sur-veys of non-human populations In addition, because of the background of the author, examples are drawn mainly from surveys in the area of transport planning However, the examples are purely illustrative; no background is needed in transport planning to understand the examples, and the principles explained are applicable to any survey that involves human response to a survey instrument In spite of this focus on human partic-ipation in the survey process, there are occasional references to other types of surveys, especially observational and counting types of surveys

chal-In writing this book, the author has tried to make this as complete a treatment as sible Although extensive references are included to numerous publications and books

pos-in various aspects of measurpos-ing data, the reader should be able to fpos-ind all that he or she requires within the covers of this book This includes a chapter on some basic aspects

of statistics and probability that are used subsequently, particularly in the development

of the statistical aspects of surveys

In summary, then, the purpose of this book is to provide the reader with an sive and, as far as possible, exhaustive treatment of issues involved in the design and execution of surveys of human populations It is the intent that, whether the reader is

exten-a student, exten-a professionexten-al who hexten-as been exten-asked to design exten-and implement exten-a survey, or

Trang 29

someone attempting to gain a level of knowledge about the survey process, all tions will be answered within these pages This is undoubtedly a daunting task The reader will be able to judge the extent to which this has been achieved The book is also designed that someone who has no prior knowledge of statistics, probability, surveys,

ques-or the purposes to which surveys may be put can pick up and read this book, gaining knowledge and expertise in doing so At the same time, this book is designed as a ref-erence book To that end, an extensive index is provided, so that the user of this book who desires information on a particular topic can readily find that topic, either from the table of contents, or through the index

1.2 Scope of the book

As noted in the previous section, the book starts with a treatment of some basic tics and probability The reader who is familiar with this material may find it appro-priate to skip this chapter However, for those who have already learnt material of this type but not used it for a while, as well as those who are unfamiliar with the material,

statis-it is recommended that this chapter be used as a means for review, refreshment, or even first-time learning It is then followed by a chapter that outlines some basic issues

of surveys, including a glossary of terms and definitions that will be found helpful

in reading the remainder of the book A number of fundamental issues, pertinent to overall survey design, are raised in this chapter Chapter 4 introduces the topic of the ethics of surveys, and outlines a number of ethical issues and proposes a number of basic ethical standards to which surveys of human populations should adhere The fifth chapter of the book discusses the primary issues of designing a survey A major underlying theme of this chapter is that there is no such thing as an ‘all-purpose sur-vey’ Experience has repeatedly demonstrated that only surveys designed with a clear purpose in mind can be successful

The next nine chapters deal with all the various design issues in a survey, given that

we have established the overall purpose or purposes of the survey The first of these chapters (Chapter 6) discusses and describes all the current methods that are available for conducting surveys of human populations, in which people are asked to partic-ipate in the survey process Mention is also made of some methods of dealing with other types of survey that are appropriate when the objects of the survey are observed

in some way and do not participate in the process In Chapter 7, the topic of focus groups is introduced, and potential uses of focus groups in designing quantitative and qualitative surveys are discussed The chapter does not provide an exhaustive treat-ment of this topic, but does provide a significant amount of detail on how to organise and design focus groups In Chapter 8, the design of survey instruments is discussed

at some length Illustrations of some principles of design are included, drawn pally from transport and related surveys Chapters 9 and 10 deal with issues relating

princi-to question design and question wording and special issues relating princi-to qualitative and preference surveys Chapter 11 deals with the design of data collection procedures themselves, including such issues as item and unit nonresponse, what constitutes a

Trang 30

complete response, the use of proxy reporting and its effects, and so forth The seventh

of this group of chapters (Chapter 12) deals with pilot surveys and pretests – a topic that is too often neglected in the design of surveys A number of issues in designing and undertaking such surveys and tests are discussed Chapter 13 deals with the topic

of sample design and sampling issues In this chapter, there is extensive treatment of the statistics of sampling, including estimation of sampling errors and determination of sample sizes The chapter describes most of the available methods of sampling, includ-ing simple random samples, stratified samples, multistage samples, cluster samples, systematic samples, choice-based samples, and a number of sampling methods that are often considered but that should be avoided in most instances, such as quota samples, judgemental samples, and haphazard samples

Chapter 14 addresses the topic of repetitive surveys Many surveys are intended to

be done as a ‘one-off’ activity For such surveys, the material covered in the preceding chapters is adequate However, there are many surveys that are intended to be repeated from time to time This chapter deals with such issues as repeated cross-sectional sur-veys, panel surveys, overlapping samples, and continuous surveys In particular, this chapter provides the reader with a means to compare the advantages and disadvantages

of the different methods, and it also assists in determining which is appropriate to apply in a given situation

Chapter 15 builds on the material in the preceding chapters and deals with the issue

of survey economics This is one of the most troublesome areas, because, as many companies have found out, it is all too easy to be bankrupted by a survey that is under-taken without a real understanding and accounting of the costs of a survey While information on actual costs will date very rapidly, this chapter attempts to provide rel-ative data on costs, which should help the reader estimate the costs of different survey strategies This chapter also deals with many of the potential trade-offs in the design

of surveys

Chapter 16 delves into some of the issues relating to the actual survey tation process This includes issues relating to training survey interviewers and moni-toring the performance of interviewers, and the chapter discusses some of the danger signs to look for during implementation This chapter also deals with issues regarding the ethics of survey implementation, especially the relationships between the survey firm, the client for the survey, and the members of the public who are the respondents

implemen-to the survey Chapter 17 introduces a topic that is becoming of increasing interest: Web-based surveys Although this is a field that is as yet quite young, there are an increasing number of aspects that have been researched and from which the reader can benefit Chapter 18 deals with the process of coding and data entry A major issue in this topic is the geographic coding of places that may be requested in a survey

Chapter 19 addresses the topics of data expansion and weighting Data expansion is outlined as a function of the sampling method, and statistical procedures for expanding each of the different types of sample are provided in this chapter Weighting relates to problems of survey bias, resulting either from incomplete coverage of the population in the sampling process or from nonresponse by some members of the subject population

Trang 31

This is an increasingly problematic area for surveys of human populations, resulting from a myriad of issues relating to voluntary participation Chapter 20 addresses the issue of nonresponse more completely Here, issues of who is likely to respond and who is not are discussed Methods to increase response rates are described, and refer-ence is made again to the economics of the survey design The question of computing response rates is also addressed in this chapter This is usually the most widely recog-nised statistic for assessing the quality of a survey, but it is also a statistic that is open

to numerous methods of computation, and there is considerable doubt as to just what

it really means

Chapter 21 deals with a range of other measures of data quality, some that are eral and some, by way of example, that are specific to surveys in transport These mea-sures are provided as a way to illustrate how survey-specific measures of quality can

gen-be devised, depending on the purposes of the survey Chapter 22 discusses some issues

of the future of human population surveys, especially in the light of emerging ogies and their potential application and misapplication to the survey task

technol-Chapter 23, the final chapter in the book, covers the issues of documenting and archiving the data This all too often neglected area of measuring data is discussed at some length A list of headings for the final report on the survey is provided, along with suggestions as to what should be included under the headings The issue of archiving data is also addressed at some length Data are expensive to collect and are rarely archived appropriately The result is that many expensive surveys are effectively lost soon after the initial analyses are undertaken In addition, knowledge about the survey

is often lost when those who were most centrally involved in the survey move on to other assignments, or leave to work elsewhere

1.3 Survey statistics

Statistics in general, and survey statistics in particular, constitute a relatively young area of theory and practice The earliest instance of the use of statistics is probably in the middle of the sixteenth century, and related to the start of data collection in France regarding births, marriages, and deaths, and in England to the collection of data on

deaths in London each week (Berntson et al., 2005) It was then not until the middle

of the eighteenth century that publications began to appear advancing some of the liest theories in statistics and probability However, much of the modern development

ear-of statistics did not take place until the late nineteenth and early twentieth centuries

(Berntson et al., 2005):

Beginning around 1880, three famous mathematicians, Karl Pearson, Francis Galton and Edgeworth, created a statistical revolution in Europe Of the three mathematicians, it was Karl Pearson, along with his ambition and determination, that led people to consider him the founder of the twentieth- century science of statistics.

It was only in the early twentieth century that most of the now famous names in tistics made their contributions to the field These included such statisticians as Karl

Trang 32

Pearson, Francis Galton, C R Rao, R A Fisher, E S Pearson, and Jerzy Neyman, among many others, who all made major contributions to what we know today as the science of statistics and probability.

Survey sampling statistics is of even more recent vintage Among the most notable names in this field of study are those of R A Fisher, Frank Yates, Leslie Kish, and

W G Cochran Fisher may have given survey sampling its birth, both through his own contributions and through his appointment of Frank Yates as assistant statistician at Rothamsted Experimental Station in 1931 In this post, Yates developed, often in col-laboration with Fisher, what may be regarded as the beginnings of survey sampling in the form of experimental designs (O’Connor and Robertson, 1997) His book Sampling Methods for Censuses and Surveys was first published in 1949, and it appears to be the first book on statistical sampling designs

Leslie Kish, who founded the Survey Research Institute at the University of Michigan, is also regarded as one of the founding fathers of modern survey sampling

methods, and he published his seminal work, called Survey Sampling, in 1965 Close

in time to Kish, W G Cochran published his seminal work, Sampling Techniques, in

1963

Based on these efforts, the science of survey sampling cannot be considered to be much over fifty years old – a very new scientific endeavour As a result of this rela-tive recency, there is still much to be done in developing the topic of survey sampling, while technologies for undertaking surveys have undergone and continue to undergo rapid evolution The fact that most of the fundamental books on the topic are about forty years old suggests that it is time to undertake an updated treatise on the topic Hence, this book has been undertaken

Trang 33

2.1 Some definitions in statistics

Statistics is defined by the Oxford Dictionary of English Etymology as ‘the political

science concerned with the facts of a state or community’, and the word is derived

from the German statistisch The beginning of modern statistics was in the sixteenth

century, when large amounts of data began to be collected on the populations of tries in Europe, and the task was to make sense of these vast amounts of data As statis-tics has evolved from this beginning, it has become a science concerned with handling large quantities of data, but also with using much smaller amounts of data in an effort

coun-to represent entire populations, when the task of handling data on the entire population

is too large or expensive The science of statistics is concerned with providing inputs

to political decision making, to the testing of hypotheses (understanding what would happen if …), drawing inferences from limited data, and, considering the data limita-tions, doing all these things under conditions of uncertainty

A word used commonly in statistics and surveys is population The population is

defined as the entire collection of elements of concern in a given situation It is also

sometimes referred to as a universe Thus, if the elements of concern are pre-school

children in a state, then the population is all the pre-school children in the state at the time of the study If the elements of concern are elephants in Africa, then the popula-tion consists of all the elephants currently in Africa If the elements of concern are the vehicles using a particular freeway on a specified day, then the population is all the vehicles that use that particular freeway on that specific day

It is very clear that statistics is the study of data Therefore, it is necessary to

understand what is meant by data The word data is a plural noun from the Latin datum, meaning given facts As used in English, the word means given facts from which other facts may be inferred Data are fundamental to the analysis and model-ling of real-world phenomena, such as human populations, the behaviour of firms, weather systems, astronomical processes, sociological processes, genetics, etc Therefore, one may state that statistics is the process for handling and analysing data, such that useful conclusions can be drawn, decisions made, and new knowledge accumulated

Trang 34

Another word used in connection with statistics is observation An observation may

be defined as the information that can be seen about a member of a subject population

An observation comprises data about relevant characteristics of the member of the population This population may be people, households, galaxies, private firms, etc Another way of thinking of this is that an observation represents an appropriate group-ing of data, in which each observation consists of a set of data items describing one member of the population

A parameter is a quantity that describes some property of the population Parameters

may be given as numbers, proportions, or percentages For example, the number of male pre-school children in the state might be 16,897, and this number is a parameter The proportion of baby elephants in Africa might be 0.39, indicating that 39 per cent

of all elephants in Africa at this time are babies This is also a parameter Sometimes, one can define a particular parameter as being critical to a decision This would then

be called a decision parameter For example, suppose that a decision is to be made

as to whether or not to close a primary school The decision parameter might be the number of schoolchildren that would be expected to attend that school in, say, the next five years

A sample is some subset of a population It may be a large proportion of the

popu-lation, or a very small proportion of the population For example, a survey of Sydney households, which comprise a population of about 1,300,000 might consist of 130,000, households (a 10 per cent sample) or 300 households (a 0.023 per cent sample)

A statistic is a numerical quantity that describes a sample It is therefore the

equiva-lent of a parameter, but for a sample rather than the population For example, a survey

of 130,000 households in Sydney might have shown that 52 per cent of households own their own home or are buying it This would be a statistic If, on the other hand, a figure of 54 per cent was determined from a census of the 1,300,000 households, then this figure would be a parameter

Statistical inference is the process of making statements about a population based

on limited evidence from a sample study Thus, if a sample of 130,000 households

in Sydney was drawn, and it was determined that 52 per cent of these owned or were purchasing their homes, then statistical inference would lead one to propose that this might mean that 676,000 (52 per cent of 1,300,000) households in Sydney own or are purchasing their homes

2.1.1 Censuses and surveys

Of particular relevance to this book is the fact that there are two methods for ing data about a population of interest The first of these is a census, which involves making observations of every member of the population Censuses of the human pop-ulation have been undertaken in most countries of the world for many years There are references in the Bible to censuses taken among the early Hebrews, and later by the Romans at the time of the birth of Christ In Europe, most censuses began in the eigh-teenth century, although a few began earlier than that In the United States of America,

Trang 35

censuses began in the nineteenth century Many countries undertake a census once

in each decade, either in the year ending in zero or in one Some countries, such as Australia, undertake a census twice in each decade A census may be as simple as a head count (enumerating the total size of the population) or it may be more complex,

by collecting data on a number of characteristics of each member of the population, such as name, address, age, country of birth, etc

A survey is similar to a census, except that it is conducted on a subset of the tion, not the entire population A survey may involve a large percentage of the population

popula-or may be restricted to a very small sample of the population Much of the science of survey statistics has to do with how one makes a small sample represent the entire popu-lation This is discussed in much more detail in the next chapter A survey, by definition, always involves a sample of the population Therefore, to speak of a 100 per cent sample

is contradictory; if it is a sample, it must be less than 100 per cent of the population

2.2 Describing data

One of the first challenges for statistics is to describe data Obviously, one can provide

a complete set of data to a decision maker However, the human mind is not capable

of utilising such information efficiently and effectively For example, a census of the United States would produce observations on over 300 million people, while one of India would produce observations of over 1 billion people A listing of those observa-tions represents something that most human beings would be incapable of utilising What is required, then, is to find some ways to simplify and describe data, so that use-ful information is preserved but the sheer magnitude of the underlying data is hidden, thereby not distracting the human analyst or decision maker

Before examining ways in which data might be presented or described, such that the mind can grasp the essential information contained therein, it is important to under-stand the nature of different types of data that can be collected To do this, it seems useful to consider the measurement of a human population, especially since that is the main topic of the balance of this book

In mathematical statistics, we refer to things called variables A variable is a

char-acteristic of the population that may take on differing or varying values for different members of the population Thus, variables that could be used to describe members

of a human population may include such characteristics as name, address, age or date

of birth, place of birth, height, weight, eye colour, hair colour, and shoe size Each of these characteristics provides differing levels of information that can be used in vari-ous ways We can divide these characteristics into four different types of scales, a scale representing a way of measuring the characteristic

Trang 36

be ordered alphabetically or can be ordered in any of a number of arbitrary ways, such

as the order in which data are collected on individuals However, no information is provided by changing the order of the names Therefore, the only thing that the name

provides is a label for each member of the population This is called a nominal scale

A nominal scale is the least informative of the different types of scales that can be used

to measure characteristics, but its lack of other information does not render it of less value Other examples of nominal data are the colours of hair or eyes of the members

of the population, bus route numbers, the numbers assigned to census collection tricts, names of firms listed on a country’s stock exchange, and the names of magazines stocked by a newsagency

dis-Ordinal scales

Each person in the population has an address The address will usually include a house number and a street name, along with the name of the town or suburb in which the house is located The address clearly also represents a label, just as does the person’s name However, in the case of the address, there is more information provided If the addresses are sorted by number and by street, in most places in the world this will pro-vide additional information These sorted addresses will actually help an investigator

to locate each home, in that it is expected that the houses are arranged in numerical order along the street, and probably with odd numbers on one side of the street and

even numbers on the other side As a result, there is order information provided in the address It is, therefore, known as an ordinal scale However, if it is known that one

person lives at 27 Main Street, and another person lives at 35 Main Street, this does not indicate how far apart these two people live In some countries, they could be next door

to each other, while in others there might be three houses between them or even seven houses between them (if numbering goes down one side of the street and back on the other) The only thing that would be known is that, starting at the first house on Main Street, one would arrive at 27 before one would arrive at 35 Therefore, order is the only additional information provided by this scale Other examples of ordinal scales would be the list of months in the year, censor ratings of movies, and a list of runners

in the order in which they finished a race

a size represents a constant increase in the length of the shoe Thus, the difference between a size nine and a size ten shoe for a man is the same as the difference between

a size eight and a size nine, and so on for any two adjacent numbers In other words,

Trang 37

there is a constant interval between each shoe size On the other hand, there is no

nat-ural zero in this scale (in fact, a size of zero generally does not exist), and it is not true that a size five is half the length of a size ten Therefore, shoe size may be considered

to be an interval scale Women’s dress sizes in a number of countries also represent

an interval scale, in which each increment in dress size represents a constant interval

of increase in size of the dress, but a size sixteen dress is not twice as large as a size eight In many cases, the sizing of an item of clothing as small, medium, large, etc also represents an interval scale Another example of an interval scale is the normal scale of temperature in either degrees Celsius or degrees Fahrenheit An interval of one degree represents the same increase or decrease in temperature, whether it is between 40 and

41 or 90 and 91 However, we are not able to state that 60 degrees is twice as hot as

30 degrees There is also not a natural zero on either the Celsius scale or the Fahrenheit scale Indeed, the Celsius scale sets the temperature at which water freezes as 0, but the Fahrenheit scale sets this at 32, and there is not a particular physical property of the zero on the Fahrenheit scale

in these measures There is ratio information In other words, we know that a person

who is 180 centimetres tall is twice as tall as a person who is 90 centimetres tall, and that a person weighing 45 kilograms is only half the weight of a person weighing 90 kilograms There are two important new pieces of information provided by these mea-sures First, there is a natural zero in the measurement scale Both weight and height have a zero point, which represents the absence of weight or the absence of height Second, there is a multiplicative relationship among the measures on the scale, not just

an additive one Therefore, both weight and height are described as ratio scales Other

examples of ratio scales are distance or length measures, measures of speed, measures

of elapsed time, and so forth However, it should be noted that measurement of clock time is interval-scaled (there is no natural zero, and 5 a.m is not a half of 10 a.m.), while elapsed time is ratio-scaled, because zero minutes represents the absence of any elapsed time, and twenty minutes is twice as long as ten minutes, for example

Trang 38

contains the information of the previous type of scale, and then adds new information content Thus an ordinal scale also has nominal information, but adds to that informa-tion on order; an interval scale has both nominal and ordinal information, but adds to that a consistent interval of measurement; and a ratio scale contains all nominal, ordi-nal, and interval information, but adds ratio relationships to them.

There are two other ways in which scales can be described, because most scales can

be measured in different ways The first of these relates to whether the scale is uous or discrete A continuous scale is one in which the measurement can be made to any degree of precision desired For example, we can measure elapsed time to the near-est hour, or minute, or second, or nanosecond, etc Indeed, the only thing that limits the precision by which we can measure this scale is the precision of our instruments for measurement However, there is no natural limit to precision in such cases This is

contin-a continuous sccontin-ale A discrete sccontin-ale, on the other hcontin-and, ccontin-annot be subdivided beyond

a certain point For example, shoe sizes are a discrete scale Many shoe ers will provide shoes in half-size increments, while others will provide them only in whole-size increments Subdivision below half sizes simply is not done Similarly, any measurement that involves counting objects, such as counting the number of members

manufactur-of a population, is a discrete scale We cannot have fractional people, fractional houses,

or fractional cars, for example

The second descriptor of a scale is whether it is inherently exact or approximate By their nature, all continuous scales are approximate This is so because we can always increase the precision of measurement Generally, numbers obtained from counting are exact, unless the counting mechanism is capable of error However, other discrete scales may be approximate or exact In most clothing or shoe sizes, the measure would

be considered approximate, because sizes often differ between manufacturers, and between countries A size nine shoe is not the same size in the United States and in the United Kingdom, for example, nor is it necessarily the same size from two different shoe makers in the same country

It is important to recognise what type of a scale we are dealing with, when mation is measured on scales, because the type of scale will also often either dictate how the information can be presented or restrict the analyst to certain ways of pre-sentation Similarly, whether the measure is discrete or continuous will also affect the presentation of the data, as will, in some cases, whether the data are approximate or exact

infor-2.2.2 Data presentation: graphics

It is appropriate to start with some simple rules about graphical presentations There are four principal types of graphical presentation: scatter plots, pie charts, histograms

or bar charts, and line graphs

A scatter plot is a plot of the frequency with which specific values of a pair of

vari-ables occur in the data Thus, the X-axis of the plot will contain the values of one of the variables that are found in the data, and the Y-axis will contain the values of the other

Trang 39

variable As such, any type of measure can be presented on a scatter plot However, if all values occur only once – i.e., are unique to an observation – then a scatter plot is

of no particular interest Therefore although any data can theoretically be plotted on a scatter plot, data that represent unique values, or data that are continuous, and also will probably have frequencies of only one or two at most for any pair of values, will not

be illuminated by a scatter plot

An example of a scatter plot is provided in Figure 2.1, which shows a scatter plot of odometer readings of cars versus the model year of the vehicle The Y-axis is a ratio-scaled variable, and the X-axis is an interval-scaled variable The scatter plot indicates that there probably is a relationship between odometer readings and model year, such that the higher the model year value, the lower the odometer reading, as would be expected This is a useful scatter plot

Figure 2.2 illustrates a scatter plot of two nominal-scaled variables: fuel type versus body type It is not a very useful illustration of the data First, we cannot tell how many points fall at each combination of values Second, all it really tells us is that there are

no taxis (body type 5) in this data set, that all vehicle types use petrol (fuel type 1), that all except motorcycles (body type 6) use diesel (fuel type 2), and that only cars (body

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000

Trang 40

type 1), four-wheel drive (4WD) vehicles (body type 2), and utility/van/panel vans (body type 3) use dual fuel (fuel type 4) This illustrates that nominal data – both fuel type and body type are nominal scales – may not produce a useful scatter plot.

A pie chart is a circle that is divided into segments representing specific values in

the data, with the length of the segment along the circumference of the circle indicating how frequently the value occurs in the data Again, pie charts can be used with any type

of data, when the information to be presented is the frequency of occurrence However, they will generally not work with continuous data, unless the data are first grouped and converted to discrete categories An example of a pie chart is provided in Figure 2.3 This shows that the pie chart works well for nominal data, in this case the vehicle body type from a survey of households

Figure 2.4 shows a pie chart for category data – i.e., discrete data The data are reported household incomes from a survey of households The categories were those used in the survey Income, being measured in dollars and with a natural zero, is actu-ally a ratio scale In the categories collected, income is a ratio-scaled discrete measure Again, the pie chart provides a good representation of the data

A histogram or bar chart is used for presenting discrete data Such data will be

interval- or ratio-scaled data Histograms can be constructed in several different ways When presenting complex information, bars can be stacked, showing how different

4WD Car Motorcycle Other Taxi Truck Utility vehicle

Figure 2.3 Pie chart of vehicle body types

Figure 2.4 Pie chart of household income groups

Định dạng
Số trang	561
Dung lượng	2,96 MB