(BQ) Part 2 book Probability and statistics for engineers and scientists has contents: Hypothesis testing, regression, analysis of variance, goodness of fit tests and categorical data analysis, nonparametric hypothesis tests, quality control, life testing, simulation, bootstrap statistical methods, and permutation tests.
Trang 2PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS
Fourth Edition
Trang 3of the accompanying code (“the product”) cannot and do not warrant the performance
or results that may be obtained by using the product The product is sold “as is” withoutwarranty of merchantability or fitness for any particular purpose AP warrants only thatthe magnetic diskette(s) on which the code is recorded is free from defects in material andfaulty workmanship under the normal use and service for a period of ninety (90) daysfrom the date the product is delivered The purchaser’s sole and exclusive remedy in theevent of a defect is expressly limited to either replacement of the diskette(s) or refund ofthe purchase price, at AP’s sole discretion
In no event, whether as a result of breach of contract, warranty, or tort (includingnegligence), will AP or anyone who has been involved in the creation or production ofthe product be liable to purchaser for any damages, including any lost profits, lost savings,
or other incidental or consequential damages arising out of the use or inability to use theproduct or any modifications thereof, or due to the contents of the code, even if AP hasbeen advised on the possibility of such damages, or for any claim by any other party.Any request for replacement of a defective diskette must be postage prepaid and must
be accompanied by the original defective diskette, your mailing address and telephonenumber, and proof of date of purchase and purchase price Send such requests, statingthe nature of the problem, to Academic Press Customer Service, 6277 Sea Harbor Drive,Orlando, FL 32887, 1-800-321-5068 AP shall have no obligation to refund the purchaseprice or to replace a diskette based on claims of defects in the nature or operation of theproduct
Some states do not allow limitation on how long an implied warranty lasts, nor exclusions
or limitations of incidental or consequential damage, so the above limitations and exclusionsmay not apply to you This warranty gives you specific legal rights, and you may also haveother rights, which vary from jurisdiction to jurisdiction
The re-export of United States original software is subject to the United States lawsunder the Export Administration Act of 1969 as amended Any further sale of the productshall be in compliance with the United States Department of Commerce Administrationregulations Compliance with such regulations is your responsibility and not the respon-sibility of AP
Trang 4PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS
■ Fourth Edition ■
Sheldon M Ross Department of Industrial Engineering and Operations Research
University of California, Berkeley
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Academic Press is an imprint of Elsevier
Trang 584 Theobald’s Road, London WC1X 8RR, UK
This book is printed on acid-free paper ∞
Copyright © 2009, Elsevier Inc All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.co.uk You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting
“Customer Support” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Application submitted
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
ISBN 13: 978-0-12-370483-2
For all information on all Elsevier Academic Press publications
visit our Web site at www.elsevierdirect.com
Typesetted by: diacriTech, India.
Printed in Canada
09 10 9 8 7 6 5 4 3 2 1
Trang 8Preface xiii
Chapter 1 Introduction to Statistics 1
1.1 Introduction 1
1.2 Data Collection and Descriptive Statistics 1
1.3 Inferential Statistics and Probability Models 2
1.4 Populations and Samples 3
1.5 A Brief History of Statistics 3
Problems 7
Chapter 2 Descriptive Statistics 9
2.1 Introduction 9
2.2 Describing Data Sets 9
2.2.1 Frequency Tables and Graphs 10
2.2.2 Relative Frequency Tables and Graphs 10
2.2.3 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots 14
2.3 Summarizing Data Sets 17
2.3.1 Sample Mean, Sample Median, and Sample Mode 17
2.3.2 Sample Variance and Sample Standard Deviation 22
2.3.3 Sample Percentiles and Box Plots 24
2.4 Chebyshev’s Inequality 27
2.5 Normal Data Sets 31
2.6 Paired Data Sets and the Sample Correlation Coefficient 33
Problems 41
Chapter 3 Elements of Probability 55
3.1 Introduction 55
3.2 Sample Space and Events 56
3.3 Venn Diagrams and the Algebra of Events 58
3.4 Axioms of Probability 59
vii
Trang 93.5 Sample Spaces Having Equally Likely Outcomes 61
3.6 Conditional Probability 67
3.7 Bayes’ Formula 70
3.8 Independent Events 76
Problems 80
Chapter 4 Random Variables and Expectation 89
4.1 Random Variables 89
4.2 Types of Random Variables 92
4.3 Jointly Distributed Random Variables 95
4.3.1 Independent Random Variables .101
*4.3.2 Conditional Distributions .105
4.4 Expectation 107
4.5 Properties of the Expected Value 111
4.5.1 Expected Value of Sums of Random Variables 115
4.6 Variance 118
4.7 Covariance and Variance of Sums of Random Variables 121
4.8 Moment Generating Functions 125
4.9 Chebyshev’s Inequality and the Weak Law of Large Numbers 127
Problems 130
Chapter 5 Special Random Variables .141
5.1 The Bernoulli and Binomial Random Variables 141
5.1.1 Computing the Binomial Distribution Function 147
5.2 The Poisson Random Variable 148
5.2.1 Computing the Poisson Distribution Function .155
5.3 The Hypergeometric Random Variable 156
5.4 The Uniform Random Variable 160
5.5 Normal Random Variables 168
5.6 Exponential Random Variables 176
*5.6.1 The Poisson Process 180
*5.7 The Gamma Distribution 183
5.8 Distributions Arising from the Normal 186
5.8.1 The Chi-Square Distribution .186
5.8.2 The t-Distribution 190
5.8.3 The F -Distribution .192
*5.9 The Logistics Distribution 193
Problems 195
Chapter 6 Distributions of Sampling Statistics 203
6.1 Introduction 203
6.2 The Sample Mean 204
Trang 106.3 The Central Limit Theorem 206
6.3.1 Approximate Distribution of the Sample Mean 212
6.3.2 How Large a Sample Is Needed? 214
6.4 The Sample Variance 215
6.5 Sampling Distributions from a Normal Population 216
6.5.1 Distribution of the Sample Mean .217
6.5.2 Joint Distribution of X and S2 217
6.6 Sampling from a Finite Population 219
Problems 223
Chapter 7 Parameter Estimation 231
7.1 Introduction 231
7.2 Maximum Likelihood Estimators 232
*7.2.1 Estimating Life Distributions 240
7.3 Interval Estimates 242
7.3.1 Confidence Interval for a Normal Mean When the Variance Is Unknown 248
7.3.2 Confidence Intervals for the Variance of a Normal Distribution 253
7.4 Estimating the Difference in Means of Two Normal Populations 255
7.5 Approximate Confidence Interval for the Mean of a Bernoulli Random Variable 262
*7.6 Confidence Interval of the Mean of the Exponential Distribution 267
*7.7 Evaluating a Point Estimator 268
*7.8 The Bayes Estimator 274
Problems 279
Chapter 8 Hypothesis Testing .293
8.1 Introduction 294
8.2 Significance Levels 294
8.3 Tests Concerning the Mean of a Normal Population 295
8.3.1 Case of Known Variance 295
8.3.2 Case of Unknown Variance: The t-Test 307
8.4 Testing the Equality of Means of Two Normal Populations 314
8.4.1 Case of Known Variances 314
8.4.2 Case of Unknown Variances 316
8.4.3 Case of Unknown and Unequal Variances 320
8.4.4 The Paired t-Test 321
8.5 Hypothesis Tests Concerning the Variance of a Normal Population 323
8.5.1 Testing for the Equality of Variances of Two Normal Populations 324
8.6 Hypothesis Tests in Bernoulli Populations 325
8.6.1 Testing the Equality of Parameters in Two Bernoulli Populations .329
Trang 118.7 Tests Concerning the Mean of a Poisson Distribution 332
8.7.1 Testing the Relationship Between Two Poisson Parameters 333
Problems 336
Chapter 9 Regression 353
9.1 Introduction 353
9.2 Least Squares Estimators of the Regression Parameters 355
9.3 Distribution of the Estimators 357
9.4 Statistical Inferences About the Regression Parameters 363
9.4.1 Inferences Concerningβ 364
9.4.2 Inferences Concerningα .372
9.4.3 Inferences Concerning the Mean Responseα + βx0 373
9.4.4 Prediction Interval of a Future Response 375
9.4.5 Summary of Distributional Results 377
9.5 The Coefficient of Determination and the Sample Correlation Coefficient 378
9.6 Analysis of Residuals: Assessing the Model 380
9.7 Transforming to Linearity 383
9.8 Weighted Least Squares 386
9.9 Polynomial Regression 393
*9.10 Multiple Linear Regression 396
9.10.1 Predicting Future Responses 407
9.11 Logistic Regression Models for Binary Output Data 412
Problems 415
Chapter 10 Analysis of Variance .441
10.1 Introduction 441
10.2 An Overview 442
10.3 One-Way Analysis of Variance 444
10.3.1 Multiple Comparisons of Sample Means 452
10.3.2 One-Way Analysis of Variance with Unequal Sample Sizes 454
10.4 Two-Factor Analysis of Variance: Introduction and Parameter Estimation 456
10.5 Two-Factor Analysis of Variance: Testing Hypotheses 460
10.6 Two-Way Analysis of Variance with Interaction 465
Problems 473
Chapter 11 Goodness of Fit Tests and Categorical Data Analysis 485
11.1 Introduction 485
11.2 Goodness of Fit Tests When All Parameters Are Specified 486
11.2.1 Determining the Critical Region by Simulation 492
11.3 Goodness of Fit Tests When Some Parameters Are Unspecified 495
Trang 1211.4 Tests of Independence in Contingency Tables 497
11.5 Tests of Independence in Contingency Tables Having Fixed Marginal Totals 501
*11.6 The Kolmogorov–Smirnov Goodness of Fit Test for Continuous Data 506
Problems 510
Chapter 12 Nonparametric Hypothesis Tests .517
12.1 Introduction 517
12.2 The Sign Test 517
12.3 The Signed Rank Test 521
12.4 The Two-Sample Problem 527
*12.4.1 The Classical Approximation and Simulation .531
12.5 The Runs Test for Randomness 535
Problems 539
Chapter 13 Quality Control 547
13.1 Introduction 547
13.2 Control Charts for Average Values: The X -Control Chart 548
13.2.1 Case of Unknownμ and σ 551
13.3 S-Control Charts 556
13.4 Control Charts for the Fraction Defective 559
13.5 Control Charts for Number of Defects 561
13.6 Other Control Charts for Detecting Changes in the Population Mean 565
13.6.1 Moving-Average Control Charts 565
13.6.2 Exponentially Weighted Moving-Average Control Charts .567
13.6.3 Cumulative Sum Control Charts .573
Problems 575
Chapter 14* Life Testing 583
14.1 Introduction 583
14.2 Hazard Rate Functions 583
14.3 The Exponential Distribution in Life Testing 586
14.3.1 Simultaneous Testing — Stopping at the rth Failure 586
14.3.2 Sequential Testing .592
14.3.3 Simultaneous Testing — Stopping by a Fixed Time .596
14.3.4The Bayesian Approach .598
14.4 A Two-Sample Problem 600
14.5 The Weibull Distribution in Life Testing 602
14.5.1 Parameter Estimation by Least Squares .604
Problems 606
Trang 13Chapter 15 Simulation, Bootstrap Statistical Methods, and
Permutation Tests .613
15.1 Introduction 613
15.2 Random Numbers 614
15.2.1The Monte Carlo Simulation Approach .616
15.3 The Bootstrap Method 617
15.4 Permutation Tests 624
15.4.1 Normal Approximations in Permutation Tests 627
15.4.2Two-Sample Permutation Tests 631
15.5 Generating Discrete Random Variables 632
15.6 Generating Continuous Random Variables 634
15.6.1 Generating a Normal Random Variable 636
15.7 Determining the Number of Simulation Runs in a Monte Carlo Study 637
Problems 638
Appendix of Tables 641
Index 647
∗ Denotes optional material.
Trang 14The fourth edition of this book continues to demonstrate how to apply probability theory
to gain insight into real, everyday statistical problems and situations As in the previouseditions, carefully developed coverage of probability motivates probabilistic models of realphenomena and the statistical procedures that follow This approach ultimately results
in an intuitive understanding of statistical procedures and strategies most often used bypracticing engineers and scientists
This book has been written for an introductory course in statistics or in probability andstatistics for students in engineering, computer science, mathematics, statistics, and thenatural sciences As such it assumes knowledge of elementary calculus
ORGANIZATION AND COVERAGE
Chapter 1 presents a brief introduction to statistics, presenting its two branches of
descrip-tive and inferential statistics, and a short history of the subject and some of the peoplewhose early work provided a foundation for work done today
The subject matter of descriptive statistics is then considered in Chapter 2 Graphs and
tables that describe a data set are presented in this chapter, as are quantities that are used
to summarize certain of the key properties of the data set
To be able to draw conclusions from data, it is necessary to have an understanding ofthe data’s origination For instance, it is often assumed that the data constitute a “ran-dom sample” from some population To understand exactly what this means and what itsconsequences are for relating properties of the sample data to properties of the entire pop-ulation, it is necessary to have some understanding of probability, and that is the subject
of Chapter 3 This chapter introduces the idea of a probability experiment, explains the
concept of the probability of an event, and presents the axioms of probability
Our study of probability is continued in Chapter 4, which deals with the important concepts of random variables and expectation, and in Chapter 5, which considers some
special types of random variables that often occur in applications Such random variables
as the binomial, Poisson, hypergeometric, normal, uniform, gamma, chi-square, t, and
F are presented.
xiii
Trang 15In Chapter 6, we study the probability distribution of such sampling statistics as the
sample mean and the sample variance We show how to use a remarkable theoreticalresult of probability, known as the central limit theorem, to approximate the probabilitydistribution of the sample mean In addition, we present the joint probability distribution
of the sample mean and the sample variance in the important special case in which theunderlying data come from a normally distributed population
Chapter 7 shows how to use data to estimate parameters of interest For instance, a
scientist might be interested in determining the proportion of Midwestern lakes that areafflicted by acid rain Two types of estimators are studied The first of these estimates thequantity of interest with a single number (for instance, it might estimate that 47 percent
of Midwestern lakes suffer from acid rain), whereas the second provides an estimate in theform of an interval of values (for instance, it might estimate that between 45 and 49 percent
of lakes suffer from acid rain) These latter estimators also tell us the “level of confidence”
we can have in their validity Thus, for instance, whereas we can be pretty certain that theexact percentage of afflicted lakes is not 47, it might very well be that we can be, say, 95percent confident that the actual percentage is between 45 and 49
Chapter 8 introduces the important topic of statistical hypothesis testing, which is
concerned with using data to test the plausibility of a specified hypothesis For instance,such a test might reject the hypothesis that fewer than 44 percent of Midwestern lakes are
afflicted by acid rain The concept of the p-value, which measures the degree of plausibility
of the hypothesis after the data have been observed, is introduced A variety of hypothesistests concerning the parameters of both one and two normal populations are considered.Hypothesis tests concerning Bernoulli and Poisson parameters are also presented
Chapter 9 deals with the important topic of regression Both simple linear
regression — including such subtopics as regression to the mean, residual analysis, andweighted least squares — and multiple linear regression are considered
Chapter 10 introduces the analysis of variance Both one-way and two-way (with and
without the possibility of interaction) problems are considered
Chapter 11 is concerned with goodness of fit tests, which can be used to test whether a
proposed model is consistent with data In it we present the classical chi-square goodness
of fit test and apply it to test for independence in contingency tables The final section
of this chapter introduces the Kolmogorov–Smirnov procedure for testing whether datacome from a specified continuous probability distribution
Chapter 12 deals with nonparametric hypothesis tests, which can be used when one
is unable to suppose that the underlying distribution has some specified parametric form(such as normal)
Chapter 13 considers the subject matter of quality control, a key statistical technique
in manufacturing and production processes A variety of control charts, including not onlythe Shewhart control charts but also more sophisticated ones based on moving averagesand cumulative sums, are considered
Chapter 14 deals with problems related to life testing In this chapter, the exponential,
rather than the normal, distribution plays the key role
Trang 16In Chapter 15 (new to the fourth edition), we consider the statistical inference
tech-niques of bootstrap statistical methods and permutation tests We first show how ities can be obtained by simulation and then how to utilize simulation in these statisticalinference approaches
probabil-About the CD
Packaged along with the text is a PC disk that can be used to solve most of the statistical
problems in the text For instance, the disk computes the p-values for most of the hypothesis
tests, including those related to the analysis of variance and to regression It can also beused to obtain probabilities for most of the common distributions (For those studentswithout access to a personal computer, tables that can be used to solve all of the problems
in the text are provided.)
One program on the disk illustrates the central limit theorem It considers randomvariables that take on one of the values 0, 1, 2, 3, 4, and allows the user to enter the prob-
abilities for these values along with an integer n The program then plots the probability mass function of the sum of n independent random variables having this distribution By increasing n, one can “see” the mass function converge to the shape of a normal density
function
ACKNOWLEDGMENTS
We thank the following people for their helpful comments on the Fourth Edition:
• Charles F Dunkl, University of Virginia, Charlottesville
• Gabor Szekely, Bowling Green State University
• Krzysztof M Ostaszewski, Illinois State University
• Michael Ratliff, Northern Arizona University
• Wei-Min Huang, Lehigh University
• Youngho Lee, Howard University
• Jacques Rioux, Drake University
• Lisa Gardner, Bradley University
• Murray Lieb, New Jersey Institute of Technology
• Philip Trotter, Cornell University
Trang 18INTRODUCTION TO STATISTICS
It has become accepted in today’s world that in order to learn about something, you must
first collect data Statistics is the art of learning from data It is concerned with the collection
of data, its subsequent description, and its analysis, which often leads to the drawing ofconclusions
1.2 DATA COLLECTION AND DESCRIPTIVE STATISTICS
Sometimes a statistical analysis begins with a given set of data: For instance, the governmentregularly collects and publicizes data concerning yearly precipitation totals, earthquakeoccurrences, the unemployment rate, the gross domestic product, and the rate of inflation.Statistics can be used to describe, summarize, and analyze these data
In other situations, data are not yet available; in such cases statistical theory can beused to design an appropriate experiment to generate data The experiment chosen shoulddepend on the use that one wants to make of the data For instance, suppose that aninstructor is interested in determining which of two different methods for teaching com-puter programming to beginners is most effective To study this question, the instructormight divide the students into two groups, and use a different teaching method for eachgroup At the end of the class the students can be tested and the scores of the members
of the different groups compared If the data, consisting of the test scores of members ofeach group, are significantly higher in one of the groups, then it might seem reasonable tosuppose that the teaching method used for that group is superior
It is important to note, however, that in order to be able to draw a valid conclusionfrom the data, it is essential that the students were divided into groups in such a mannerthat neither group was more likely to have the students with greater natural aptitude forprogramming For instance, the instructor should not have let the male class members beone group and the females the other For if so, then even if the women scored significantlyhigher than the men, it would not be clear whether this was due to the method used to teachthem, or to the fact that women may be inherently better than men at learning programming
1
Trang 19skills The accepted way of avoiding this pitfall is to divide the class members into the twogroups “at random.” This term means that the division is done in such a manner that allpossible choices of the members of a group are equally likely.
At the end of the experiment, the data should be described For instance, the scores
of the two groups should be presented In addition, summary measures such as the age score of members of each of the groups should be presented This part of statistics,
aver-concerned with the description and summarization of data, is called descriptive statistics.
1.3 INFERENTIAL STATISTICS AND
To be able to draw a conclusion from the data, we must take into account the possibility
of chance For instance, suppose that the average score of members of the first group isquite a bit higher than that of the second Can we conclude that this increase is due to theteaching method used? Or is it possible that the teaching method was not responsible forthe increased scores but rather that the higher scores of the first group were just a chanceoccurrence? For instance, the fact that a coin comes up heads 7 times in 10 flips does notnecessarily mean that the coin is more likely to come up heads than tails in future flips.Indeed, it could be a perfectly ordinary coin that, by chance, just happened to land heads
7 times out of the total of 10 flips (On the other hand, if the coin had landed heads
47 times out of 50 flips, then we would be quite certain that it was not an ordinary coin.)
To be able to draw logical conclusions from data, we usually make some assumptions
about the chances (or probabilities) of obtaining the different data values The totality of these assumptions is referred to as a probability model for the data.
Sometimes the nature of the data suggests the form of the probability model that isassumed For instance, suppose that an engineer wants to find out what proportion ofcomputer chips, produced by a new method, will be defective The engineer might select
a group of these chips, with the resulting data being the number of defective chips in thisgroup Provided that the chips selected were “randomly” chosen, it is reasonable to suppose
that each one of them is defective with probability p, where p is the unknown proportion
of all the chips produced by the new method that will be defective The resulting data can
then be used to make inferences about p.
In other situations, the appropriate probability model for a given data set will not bereadily apparent However, careful description and presentation of the data sometimesenable us to infer a reasonable model, which we can then try to verify with the use ofadditional data
Because the basis of statistical inference is the formulation of a probability model todescribe the data, an understanding of statistical inference requires some knowledge of
Trang 20the theory of probability In other words, statistical inference starts with the assumptionthat important aspects of the phenomenon under study can be described in terms ofprobabilities; it then draws conclusions by using data to make inferences about theseprobabilities.
In statistics, we are interested in obtaining information about a total collection of elements,
which we will refer to as the population The population is often too large for us to examine
each of its members For instance, we might have all the residents of a given state, or all thetelevision sets produced in the last year by a particular manufacturer, or all the households
in a given community In such cases, we try to learn about the population by choosingand then examining a subgroup of its elements This subgroup of a population is called
a sample.
If the sample is to be informative about the total population, it must be, in some sense,representative of that population For instance, suppose that we are interested in learningabout the age distribution of people residing in a given city, and we obtain the ages of thefirst 100 people to enter the town library If the average age of these 100 people is 46.2years, are we justified in concluding that this is approximately the average age of the entirepopulation? Probably not, for we could certainly argue that the sample chosen in this case
is probably not representative of the total population because usually more young studentsand senior citizens use the library than do working-age citizens
In certain situations, such as the library illustration, we are presented with a sample andmust then decide whether this sample is reasonably representative of the entire population
In practice, a given sample generally cannot be assumed to be representative of a populationunless that sample has been chosen in a random manner This is because any specificnonrandom rule for selecting a sample often results in one that is inherently biased towardsome data values as opposed to others
Thus, although it may seem paradoxical, we are most likely to obtain a representativesample by choosing its members in a totally random fashion without any prior consid-erations of the elements that will be chosen In other words, we need not attempt todeliberately choose the sample so that it contains, for instance, the same gender percentageand the same percentage of people in each profession as found in the general population.Rather, we should just leave it up to “chance” to obtain roughly the correct percentages.Once a random sample is chosen, we can use statistical inference to draw conclusions aboutthe entire population by studying the elements of the sample
1.5 A BRIEF HISTORY OF STATISTICS
A systematic collection of data on the population and the economy was begun in the Italian
city-states of Venice and Florence during the Renaissance The term statistics, derived from the word state, was used to refer to a collection of facts of interest to the state The idea of
collecting data spread from Italy to the other countries of Western Europe Indeed, by the
Trang 21first half of the 16th century it was common for European governments to require parishes
to register births, marriages, and deaths Because of poor public health conditions this laststatistic was of particular interest
The high mortality rate in Europe before the 19th century was due mainly to epidemicdiseases, wars, and famines Among epidemics, the worst were the plagues Starting withthe Black Plague in 1348, plagues recurred frequently for nearly 400 years In 1562, as away to alert the King’s court to consider moving to the countryside, the City of Londonbegan to publish weekly bills of mortality Initially these mortality bills listed the places
of death and whether a death had resulted from plague Beginning in 1625 the bills wereexpanded to include all causes of death
In 1662 the English tradesman John Graunt published a book entitled Natural and Political Observations Made upon the Bills of Mortality Table 1.1, which notes the total
number of deaths in England and the number due to the plague for five different plagueyears, is taken from this book
TABLE 1.1 Total Deaths in England
Source: John Graunt, Observations Made upon the Bills of Mortality.
3rd ed London: John Martyn and James Allestry (1st ed 1662).
Graunt used London bills of mortality to estimate the city’s population For instance,
to estimate the population of London in 1660, Graunt surveyed households in certainLondon parishes (or neighborhoods) and discovered that, on average, there were approxi-mately 3 deaths for every 88 people Dividing by 3 shows that, on average, there wasroughly 1 death for every 88/3 people Because the London bills cited 13,200 deaths inLondon for that year, Graunt estimated the London population to be about
13,200× 88/3 = 387,200Graunt used this estimate to project a figure for all England In his book he noted thatthese figures would be of interest to the rulers of the country, as indicators of both thenumber of men who could be drafted into an army and the number who could betaxed
Graunt also used the London bills of mortality — and some intelligent guesswork as towhat diseases killed whom and at what age — to infer ages at death (Recall that the bills
of mortality listed only causes and places at death, not the ages of those dying.) Grauntthen used this information to compute tables giving the proportion of the population that
Trang 22TABLE 1.2 John Graunt’s Mortality Table
Age at Death Number of Deaths per 100 Births
6 and 15, and so on
Graunt’s estimates of the ages at which people were dying were of great interest to those
in the business of selling annuities Annuities are the opposite of life insurance in that onepays in a lump sum as an investment and then receives regular payments for as long as onelives
Graunt’s work on mortality tables inspired further work by Edmund Halley in 1693.Halley, the discoverer of the comet bearing his name (and also the man who was mostresponsible, by both his encouragement and his financial support, for the publication of
Isaac Newton’s famous Principia Mathematica), used tables of mortality to compute the
odds that a person of any age would live to any other particular age Halley was influential
in convincing the insurers of the time that an annual life insurance premium should depend
on the age of the person being insured
Following Graunt and Halley, the collection of data steadily increased throughout theremainder of the 17th and on into the 18th century For instance, the city of Paris begancollecting bills of mortality in 1667, and by 1730 it had become common practice through-out Europe to record ages at death
The term statistics, which was used until the 18th century as a shorthand for the
descrip-tive science of states, became in the 19th century increasingly identified with numbers Bythe 1830s the term was almost universally regarded in Britain and France as being synony-mous with the “numerical science” of society This change in meaning was caused by thelarge availability of census records and other tabulations that began to be systematicallycollected and published by the governments of Western Europe and the United Statesbeginning around 1800
Throughout the 19th century, although probability theory had been developed by suchmathematicians as Jacob Bernoulli, Karl Friedrich Gauss, and Pierre-Simon Laplace, itsuse in studying statistical findings was almost nonexistent, because most social statisticians
Trang 23at the time were content to let the data speak for themselves In particular, statisticians
of that time were not interested in drawing inferences about individuals, but rather wereconcerned with the society as a whole Thus, they were not concerned with sampling butrather tried to obtain censuses of the entire population As a result, probabilistic inferencefrom samples to a population was almost unknown in 19th century social statistics
It was not until the late 1800s that statistics became concerned with inferring conclusionsfrom numerical data The movement began with Francis Galton’s work on analyzing hered-itary genius through the uses of what we would now call regression and correlation analysis(see Chapter 9), and obtained much of its impetus from the work of Karl Pearson Pearson,who developed the chi-square goodness of fit tests (see Chapter 11), was the first director
of the Galton Laboratory, endowed by Francis Galton in 1904 There Pearson originated
a research program aimed at developing new methods of using statistics in inference Hislaboratory invited advanced students from science and industry to learn statistical methodsthat could then be applied in their fields One of his earliest visiting researchers was W S.Gosset, a chemist by training, who showed his devotion to Pearson by publishing his ownworks under the name “Student.” (A famous story has it that Gosset was afraid to publishunder his own name for fear that his employers, the Guinness brewery, would be unhappy
to discover that one of its chemists was doing research in statistics.) Gosset is famous for
his development of the t-test (see Chapter 8).
Two of the most important areas of applied statistics in the early 20th century werepopulation biology and agriculture This was due to the interest of Pearson and others athis laboratory and also to the remarkable accomplishments of the English scientist Ronald
A Fisher The theory of inference developed by these pioneers, including among others
TABLE 1.3 The Changing Definition of Statistics
Statistics has then for its object that of presenting a faithful representation of a state at a determined epoch (Quetelet, 1849)
Statistics are the only tools by which an opening can be cut through the formidable thicket of
difficulties that bars the path of those who pursue the Science of man (Galton, 1889)
Statistics may be regarded (i) as the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data (Fisher, 1925)
Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology (E Pearson, 1936)
Statistics is the name for that science and art which deals with uncertain inferences — which uses numbers to find out something about nature and experience (Weaver, 1952)
Statistics has become known in the 20th century as the mathematical tool for analyzing experimental and observational data (Porter, 1986)
Statistics is the art of learning from data (this book, 2009)
Trang 24Karl Pearson’s son Egon and the Polish born mathematical statistician Jerzy Neyman, wasgeneral enough to deal with a wide range of quantitative and practical problems As aresult, after the early years of the 20th century a rapidly increasing number of people
in science, business, and government began to regard statistics as a tool that was able toprovide quantitative solutions to scientific and practical problems (see Table 1.3).Nowadays the ideas of statistics are everywhere Descriptive statistics are featured inevery newspaper and magazine Statistical inference has become indispensable to publichealth and medical research, to engineering and scientific studies, to marketing and qual-ity control, to education, to accounting, to economics, to meteorological forecasting, topolling and surveys, to sports, to insurance, to gambling, and to all research that makesany claim to being scientific Statistics has indeed become ingrained in our intellectualheritage
Problems
1. An election will be held next week and, by polling a sample of the votingpopulation, we are trying to predict whether the Republican or Democraticcandidate will prevail Which of the following methods of selection is likely toyield a representative sample?
(a) Poll all people of voting age attending a college basketball game.
(b) Poll all people of voting age leaving a fancy midtown restaurant.
(c) Obtain a copy of the voter registration list, randomly choose 100 names, and
question them
(d) Use the results of a television call-in poll, in which the station asked its listeners
to call in and name their choice
(e) Choose names from the telephone directory and call these people.
2. The approach used in Problem 1(e) led to a disastrous prediction in the 1936presidential election, in which Franklin Roosevelt defeated Alfred Landon by a
landslide A Landon victory had been predicted by the Literary Digest The
maga-zine based its prediction on the preferences of a sample of voters chosen from lists
of automobile and telephone owners
(a) Why do you think the Literary Digest’s prediction was so far off ?
(b) Has anything changed between 1936 and now that would make you believe
that the approach used by the Literary Digest would work better today?
3. A researcher is trying to discover the average age at death for people in the United
States today To obtain data, the obituary columns of the New York Times are read
for 30 days, and the ages at death of people in the United States are noted Do youthink this approach will lead to a representative sample?
Trang 254. To determine the proportion of people in your town who are smokers, it has beendecided to poll people at one of the following local spots:
(a) the pool hall;
(b) the bowling alley;
(c) the shopping mall;
$75,000
(a) Would the university be correct in thinking that $75,000 was a good
approxi-mation to the average salary level of all of its graduates? Explain the reasoningbehind your answer
(b) If your answer to part (a) is no, can you think of any set of conditions
relat-ing to the group that returned questionnaires for which it would be a goodapproximation?
6. An article reported that a survey of clothing worn by pedestrians killed at night intraffic accidents revealed that about 80 percent of the victims were wearing dark-colored clothing and 20 percent were wearing light-colored clothing The conclu-sion drawn in the article was that it is safer to wear light-colored clothing at night
(a) Is this conclusion justified? Explain.
(b) If your answer to part (a) is no, what other information would be needed
before a final conclusion could be drawn?
7. Critique Graunt’s method for estimating the population of London Whatimplicit assumption is he making?
8. The London bills of mortality listed 12,246 deaths in 1658 Supposing that asurvey of London parishes showed that roughly 2 percent of the population diedthat year, use Graunt’s method to estimate London’s population in 1658
9. Suppose you were a seller of annuities in 1662 when Graunt’s book was published.Explain how you would make use of his data on the ages at which people weredying
10. Based on Graunt’s mortality table:
(a) What proportion of people survived to age 6?
(b) What proportion survived to age 46?
(c) What proportion died between the ages of 6 and 36?
Trang 26DESCRIPTIVE STATISTICS
In this chapter we introduce the subject matter of descriptive statistics, and in doing solearn ways to describe and summarize a set of data Section 2.2 deals with ways of describ-ing a data set Subsections 2.2.1 and 2.2.2 indicate how data that take on only a relativelyfew distinct values can be described by using frequency tables or graphs, whereas Subsec-tion 2.2.3 deals with data whose set of values is grouped into different intervals Section 2.3discusses ways of summarizing data sets by use of statistics, which are numerical quantitieswhose values are determined by the data Subsection 2.3.1 considers three statistics that areused to indicate the “center” of the data set: the sample mean, the sample median, and thesample mode Subsection 2.3.2 introduces the sample variance and its square root, calledthe sample standard deviation These statistics are used to indicate the spread of the values
in the data set Subsection 2.3.3 deals with sample percentiles, which are statistics that tell
us, for instance, which data value is greater than 95 percent of all the data In Section 2.4
we present Chebyshev’s inequality for sample data This famous inequality gives a lowerbound to the proportion of the data that can differ from the sample mean by more than
k times the sample standard deviation Whereas Chebyshev’s inequality holds for all data
sets, we can in certain situations, which are discussed in Section 2.5, obtain more precise
estimates of the proportion of the data that is within k sample standard deviations of the
sample mean In Section 2.5 we note that when a graph of the data follows a bell-shapedform the data set is said to be approximately normal, and more precise estimates are given
by the so-called empirical rule Section 2.6 is concerned with situations in which the dataconsist of paired values A graphical technique, called the scatter diagram, for presentingsuch data is introduced, as is the sample correlation coefficient, a statistic that indicatesthe degree to which a large value of the first member of the pair tends to go along with alarge value of the second
2.2 DESCRIBING DATA SETS
The numerical findings of a study should be presented clearly, concisely, and in such
a manner that an observer can quickly obtain a feel for the essential characteristics of
9
Trang 27the data Over the years it has been found that tables and graphs are particularly usefulways of presenting data, often revealing important features such as the range, the degree
of concentration, and the symmetry of the data In this section we present some commongraphical and tabular ways for presenting data
2.2.1 Frequency Tables and Graphs
A data set having a relatively small number of distinct values can be conveniently presented
in a frequency table For instance, Table 2.1 is a frequency table for a data set consisting of the
starting yearly salaries (to the nearest thousand dollars) of 42 recently graduated studentswith B.S degrees in electrical engineering Table 2.1 tells us, among other things, that thelowest starting salary of $47,000 was received by four of the graduates, whereas the highestsalary of $60,000 was received by a single student The most common starting salary was
$52,000, and was received by 10 of the students
TABLE 2.1 Starting Yearly Salaries
Data from a frequency table can be graphically represented by a line graph that plots the
distinct data values on the horizontal axis and indicates their frequencies by the heights ofvertical lines A line graph of the data presented in Table 2.1 is shown in Figure 2.1
When the lines in a line graph are given added thickness, the graph is called a bar graph.
Figure 2.2 presents a bar graph
Another type of graph used to represent a frequency table is the frequency polygon, which
plots the frequencies of the different data values on the vertical axis, and then connects theplotted points with straight lines Figure 2.3 presents a frequency polygon for the data ofTable 2.1
2.2.2 Relative Frequency Tables and Graphs
Consider a data set consisting of n values If f is the frequency of a particular value, then the ratio f /n is called its relative frequency That is, the relative frequency of a data value is
Trang 29FIGURE 2.3 Frequency polygon for starting salary data.
the proportion of the data that have that value The relative frequencies can be representedgraphically by a relative frequency line or bar graph or by a relative frequency polygon.Indeed, these relative frequency graphs will look like the corresponding graphs of theabsolute frequencies except that the labels on the vertical axis are now the old labels (thatgave the frequencies) divided by the total number of data points
EXAMPLE 2.2a Table 2.2 is a relative frequency table for the data of Table 2.1 The tive frequencies are obtained by dividing the corresponding frequencies of Table 2.1 by
rela-42, the size of the data set ■
A pie chart is often used to indicate relative frequencies when the data are not numerical
in nature A circle is constructed and then sliced into different sectors; one for each distincttype of data value The relative frequency of a data value is indicated by the area of its sector,this area being equal to the total area of the circle multiplied by the relative frequency ofthe data value
EXAMPLE 2.2b The following data relate to the different types of cancers affecting the 200most recent patients to enroll at a clinic specializing in cancer These data are represented
in the pie chart presented in Figure 2.4 ■
Trang 30Bladder 6%
Lung 21%
Breast 25%
Colon 16%
Prostate
27.5%
FIGURE 2.4
Trang 31Type of Cancer Number of New Cases Relative Frequency
2.2.3 Grouped Data, Histograms, Ogives, and
Stem and Leaf Plots
As seen in Subsection 2.2.2, using a line or a bar graph to plot the frequencies of data values
is often an effective way of portraying a data set However, for some data sets the number
of distinct values is too large to utilize this approach Instead, in such cases, it is useful to
divide the values into groupings, or class intervals, and then plot the number of data values
falling in each class interval The number of class intervals chosen should be a trade-offbetween (1) choosing too few classes at a cost of losing too much information about theactual data values in a class and (2) choosing too many classes, which will result in thefrequencies of each class being too small for a pattern to be discernible Although 5 to 10
TABLE 2.3 Life in Hours of 200 Incandescent Lamps
Trang 32class intervals are typical, the appropriate number is a subjective choice, and of course, youcan try different numbers of class intervals to see which of the resulting charts appears to
be most revealing about the data It is common, although not essential, to choose classintervals of equal length
The endpoints of a class interval are called the class boundaries We will adopt the end inclusion convention, which stipulates that a class interval contains its left-end but not
left-its right-end boundary point Thus, for instance, the class interval 20–30 contains all values
that are both greater than or equal to 20 and less than 30.
Table 2.3 presents the lifetimes of 200 incandescent lamps A class frequency table forthe data of Table 2.3 is presented in Table 2.4 The class intervals are of length 100, withthe first one starting at 500
TABLE 2.4 A Class Frequency Table
Frequency (Number of Data Values in
FIGURE 2.5 A frequency histogram.
Trang 33FIGURE 2.6 A cumulative frequency plot.
A bar graph plot of class data, with the bars placed adjacent to each other, is called
a histogram The vertical axis of a histogram can represent either the class frequency or the relative class frequency; in the former case the graph is called a frequency histogram and
in the latter a relative frequency histogram Figure 2.5 presents a frequency histogram of the
data in Table 2.4
We are sometimes interested in plotting a cumulative frequency (or cumulative relativefrequency) graph A point on the horizontal axis of such a graph represents a possibledata value; its corresponding vertical plot gives the number (or proportion) of the datawhose values are less than or equal to it A cumulative relative frequency plot of the data
of Table 2.3 is given in Figure 2.6 We can conclude from this figure that 100 percent
of the data values are less than 1,500, approximately 40 percent are less than or equal to
900, approximately 80 percent are less than or equal to 1,100, and so on A cumulative
frequency plot is called an ogive.
An efficient way of organizing a small- to moderate-sized data set is to utilize a stem and leaf plot Such a plot is obtained by first dividing each data value into two parts —
its stem and its leaf For instance, if the data are all two-digit numbers, then we could letthe stem part of a data value be its tens digit and let the leaf be its ones digit Thus, forinstance, the value 62 is expressed as
Trang 34EXAMPLE 2.2c Table 2.5 gives the monthly and yearly average daily minimum temperatures
2.3 SUMMARIZING DATA SETS
Modern-day experiments often deal with huge sets of data For instance, in an attempt
to learn about the health consequences of certain common practices, in 1951 the medicalstatisticians R Doll and A B Hill sent questionnaires to all doctors in the United Kingdomand received approximately 40,000 replies Their questions dealt with age, eating habits,and smoking habits The respondents were then tracked for the ensuing 10 years and thecauses of death for those who died were monitored To obtain a feel for such a large amount
of data, it is useful to be able to summarize it by some suitably chosen measures In this
section we present some summarizing statistics, where a statistic is a numerical quantity
whose value is determined by the data
2.3.1 Sample Mean, Sample Median, and Sample Mode
In this section we introduce some statistics that are used for describing the center of a set of
data values To begin, suppose that we have a data set consisting of the n numerical values
x1, x2, , x n The sample mean is the arithmetic average of these values
Trang 35TABLE 2.5Normal Daily Minimum Temperature — Selected Cities
[In Fahrenheit degrees Airport data except as noted Based on standard 30-year period, 1961 through 1990]
Annual State Station Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec avg.
GA Atlanta 31.5 34.5 42.5 50.2 58.7 66.2 69.5 69.0 63.5 51.9 42.8 35.0 51.3
HI Honolulu 65.6 65.4 67.2 68.7 70.3 72.2 73.5 74.2 73.5 72.3 70.3 67.0 70.0
ID Boise 21.6 27.5 31.9 36.7 43.9 52.1 57.7 56.8 48.2 39.0 31.1 22.5 39.1
IL Chicago 12.9 17.2 28.5 38.6 47.7 57.5 62.6 61.6 53.9 42.2 31.6 19.1 39.5 Peoria 13.2 17.7 29.8 40.8 50.9 60.7 65.4 63.1 55.2 43.1 32.5 19.3 41.0
MN Duluth −2.2 2.8 15.7 28.9 39.6 48.5 55.1 53.3 44.5 35.1 21.5 4.9 29.0 Minneapolis-St Paul 2.8 9.2 22.7 36.2 47.6 57.6 63.1 60.3 50.3 38.8 25.2 10.2 35.3
Trang 36then the sample mean of the data set y1, , y nis
¯y =
n
i=1 (ax i + b)/n =
EXAMPLE 2.3a The winning scores in the U.S Masters golf tournament in the years from
1999 to 2008 were as follows:
280, 278, 272, 276, 281, 279, 276, 281, 289, 280
Find the sample mean of these scores
SOLUTION Rather than directly adding these values, it is easier to first subtract 280 from
each one to obtain the new values y i = x i − 280:
0,−2, −8, −4, 1, −1, −4, 1, 9, 0Because the arithmetic average of the transformed data set is
¯y = −8/10
it follows that
Sometimes we want to determine the sample mean of a data set that is presented in
a frequency table listing the k distinct values v1, , v khaving corresponding frequencies
f1, , f k Since such a data set consists of n= k
i=1f i observations, with the value v i
appearing f i times, for each i = 1, , k, it follows that the sample mean of these n data
Trang 37EXAMPLE 2.3b The following is a frequency table giving the ages of members of a symphonyorchestra for young adults.
Another statistic used to indicate the center of a data set is the sample median; loosely
speaking, it is the middle value when the data set is arranged in increasing order
Definition
Order the values of a data set of size n from smallest to largest If n is odd, the sample median is the value in position (n + 1)/2; if n is even, it is the average of the values in positions n/2 and n/2+ 1
Thus the sample median of a set of three values is the second smallest; of a set of fourvalues, it is the average of the second and third smallest
EXAMPLE 2.3c Find the sample median for the data described in Example 2.3b
SOLUTION Since there are 54 data values, it follows that when the data are put in increasingorder, the sample median is the average of the values in positions 27 and 28 Thus, the
The sample mean and sample median are both useful statistics for describing the centraltendency of a data set The sample mean makes use of all the data values and is affected byextreme values that are much larger or smaller than the others; the sample median makesuse of only one or two of the middle values and is thus not affected by extreme values.Which of them is more useful depends on what one is trying to learn from the data Forinstance, if a city government has a flat rate income tax and is trying to estimate its totalrevenue from the tax, then the sample mean of its residents’ income would be a more usefulstatistic On the other hand, if the city was thinking about constructing middle-incomehousing, and wanted to determine the proportion of its population able to afford it, thenthe sample median would probably be more useful
Trang 38EXAMPLE 2.3d In a study reported in Hoel, D G., “A representation of mortality data by
competing risks,” Biometrics, 28, pp 475–488, 1972, a group of 5-week-old mice were
each given a radiation dose of 300 rad The mice were then divided into two groups;the first group was kept in a germ-free environment, and the second in conventionallaboratory conditions The numbers of days until death were then observed The data forthose whose death was due to thymic lymphoma are given in the following stem and leafplots (whose stems are in units of hundreds of days); the first plot is for mice living in thegerm-free conditions and the second for mice living under ordinary laboratory conditions
Determine the sample means and the sample medians for the two sets of mice
SOLUTION It is clear from the stem and leaf plots that the sample mean for the set ofmice put in the germ-free setting is larger than the sample mean for the set of mice in theusual laboratory setting; indeed, a calculation gives that the former sample mean is 344.07,whereas the latter one is 292.32 On the other hand, since there are 29 data values for thegerm-free mice, the sample median is the 15th largest data value, namely, 259; similarly,the sample median for the other set of mice is the 10th largest data value, namely, 265.Thus, whereas the sample mean is quite a bit larger for the first data set, the sample mediansare approximately equal The reason for this is that whereas the sample mean for the firstset is greatly affected by the five data values greater than 500, these values have a muchsmaller effect on the sample median Indeed, the sample median would remain unchanged
if these values were replaced by any other five values greater than or equal to 259 It appearsfrom the stem and leaf plots that the germ-free conditions probably improved the life span
of the five longest living rats, but it is unclear what, if any, effect it had on the life spans ofthe other rats ■
Trang 39Another statistic that has been used to indicate the central tendency of a data set is the
sample mode, defined to be the value that occurs with the greatest frequency If no single
value occurs most frequently, then all the values that occur at the highest frequency are
called modal values.
EXAMPLE 2.3e The following frequency table gives the values obtained in 40 rolls of a die
Find (a) the sample mean, (b) the sample median, and (c) the sample mode.
SOLUTION (a) The sample mean is
¯x = (9 + 16 + 15 + 20 + 30 + 42)/40 = 3.05
(b) The sample median is the average of the 20th and 21st smallest values, and is thus
equal to 3 (c) The sample mode is 1, the value that occurred most frequently. ■
2.3.2 Sample Variance and Sample Standard Deviation
Whereas we have presented statistics that describe the central tendencies of a data set, weare also interested in ones that describe the spread or variability of the data values A statisticthat could be used for this purpose would be one that measures the average value of thesquares of the distances between the data values and the sample mean This is accomplished
by the sample variance, which for technical reasons divides the sum of the squares of the
differences by n − 1 rather than n, where n is the size of the data set.
Trang 40SOLUTION As the sample mean for data set A is¯x = (3 + 4 + 6 + 7 + 10)/5 = 6, it follows
that its sample variance is
s2= [(−3)2+ (−2)2+ 02+ 12+ 42]/4 = 7.5
The sample mean for data set B is also 6; its sample variance is
s2= [(−26)2+ (−1)2+ 92+ (18)2]/3 ≈ 360.67Thus, although both data sets have the same sample mean, there is a much greater variability
in the values of the B set than in the A set. ■
The following algebraic identity is often useful for computing the sample variance:
n
i=1 (x i − ¯x)2
That is, if s y2and s x2are the respective sample variances, then
s y2= b2s x2