1.2 The development of biostatistics 21.3 The statistical frame oj" mind 4 2.5 Frequency distribut ions 14 3.7 Sample statistics and parameters 37 3.S Practical methods jilr computiny me
Trang 1INTRODUCTION TO
BIOSTATISTICS
SECOND EDITION
State University(~fNew York at Stony Brook
DOVER PUBLICATIONS, INC.
Mineola, New York
Trang 2Copyright ((') 1969, 1973, 19RI 19R7 by Robert R Sokal and F James Rohlf
All rights reserved.
Bih/iographim/ Note
This Dover edition, first published in 2009, is an unabridged republication of
the work originally published in 1969 by W H Freeman and Company, New
York The authors have prepared a new Preface for this edition.
Lihrary01'Congress Cata/oging-in-Puhlimtio/l Data
SokaL Robert R.
Introduction to Biostatistics / Robert R Sokal and F James Rohlf.
Dovcr cd.
p cm.
Originally published: 2nd cd New York: W.H Freeman, 1969.
Includes bibliographical references and index.
Manufactured in the United Stales of America
Dover Puhlications, Inc., 31 East 2nd Street, Mineola, N.Y 11501
to Julie and Janice
Trang 31.2 The development of biostatistics 2
1.3 The statistical frame oj" mind 4
2.5 Frequency distribut ions 14
3.7 Sample statistics and parameters 37
3.S Practical methods jilr computiny mean and standard deviation 39
3.9 The coefficient oj" variation 43
Trang 44
CONTENTS
INTRODUCTION TO PROBABILITY DISTRIBUTIONS:
4.1 Probability, random sampling, and hypothesis testing 48
4.2 The binomial distribution 54
CONTENTS
9.2 Two-way anova: Significance testing 197 9.3 Two-way anOl'a without replication 199
10.3 Nonparametric methods in lieu of anova 220
5.
6
5.1 Frequency distributions of continuous variables 75
5.2 Derivation of the normal distribution 76
5.3 Properties of the normal distriblltion 78
5.4 ApplicatiollS of the normal distribution 82
5.5 Departures /rom normality: Graphic merhods 85
6.1 Distribution and variance of means 94
6.2 Distribution and variance oj' other statistics 101
6.3 I ntroduction to confidence limits 103
6.4 Student's t distriblllion 106
6.5 Confidence limits based 0/1 sllmple statistic.5 109
6.6 The chi-square distriburion 112
6.7 Confidence limits fur variances 114
6.8 Introducrion /(Ihyporhesis resting 115
6.9 Tests of simple hypotheses employiny the rdistriburion
6.10 Testiny the hypothesis 11 0 : fT2 = fT6 129
1/.3 The linear regression eqllation 235
J 1.4 More than one vallie of Y for each value of X 11.5 Tests of siyn!ficance in reqression 250
1/.7 Residuals and transformations in reyression 259 11.8 A nonparametric test for rewession 263
12.2 The product-moment correlation coefficient 270
/2.3 Significance tests in correlation 280 /2.4 Applications 0/correlation 284 /2.5 Kendall's coefficient of rank correlation 286
243
7.1 The variance.\ of samples and rheir meallS 134
7.3 The hypothesis II,,: fT; = fT~ 143
7.4 lIeteroyeneiry IInWn!l sample means 143
7.5 Parritio/li/l!l the rotal sum of squares UlU/ dewees o/freedom
295 301305
Trang 5Preface to the Dover Edition
We are pleased and honored to see the re-issue of the second edition of our tion to Biostatistics by Dover Publications On reviewing the copy, we find there
Introduc-is little in it that needs changing for an introductory textbook of biostatIntroduc-istics for anadvanced undergraduate or beginning graduate student The book furnishes an intro-duction to most of the statistical topics such students are likely to encounter in theircourses and readings in the biological and biomedical sciences
The reader may wonder what we would change if we were to write this book anew.Because of the vast changes that have taken place in modalities of computation in thelast twenty years, we would deemphasize computational formulas that were designedfor pre-computer desk calculators (an age before spreadsheets and comprehensivestatistical computer programs) and refocus the reader's attention to structural for-mulas that not only explain the nature of a given statistic, but are also less prone torounding error in calculations performed by computers Inthis spirit, we would omitthe equation (3.8) on page 39 and draw the readers' attention to equation (3.7) instead.Similarly, we would use structural formulas in Boxes 3.1 and 3.2 on pages 4\ and 42,respectively; on page 161 and in Box 8.1 on pages 163/164, as well as in Box 12.1
on pages 278/279
Secondly, we would put more emphasis on permutation tests and resampling methods.Permutation tests and bootstrap estimates are now quite practical We have found thisapproach to be not only easier for students to understand but in many cases preferable
to the traditional parametric methods that are emphasized in this book
Robert R Sokal
F James RohlfNovember 2008
Trang 6The favorable reception that the first edition of this book received from teachersand students encouraged us to prepare a second edition In this revised edition,
we provide a thorough foundation in biological statistics for the undergraduate
student who has a minimal knowledge of mathematics We intend Introduction
to Biostatistics to be used in comprehensive biostatistics courses, but it can also
be adapted for short courses in medical and professional schools; thus, weinclude examples from the health-related sciences
We have extracted most of this text from the more-inclusive second edition
of our own Biometry We believe that the proven pedagogic features of that
book, such as its informal style, will be valuable here
We have modified some of the features from Biometry; for example, in
Introduction to Biostatistics we provide detailed outlines for statistical
compu-tations but we place less emphasis on the compucompu-tations themselves Why?Students in many undergraduate courses are not motivated to and have fewopportunities to perform lengthy computations with biological research ma-terial; also, such computations can easily be made on electronic calculatorsand microcomputers Thus, we rely on the course instructor to advise students
on the best computational procedures to follow
We present material in a sequence that progresses from descriptive statistics
to fundamental distributions and the testing of elementary statistical hypotheses;
we then proceed immediately to the analysis of variance and the familiar t test
Trang 7xiv PREFACE
(which is treated as a special case of the analysis of variance and relegated to
several sections of the book) We do this deliberately for two reasons: (I) since
today's biologists all need a thorough foundation in the analysis of variance,
students should become acquainted with the subject early in the course; and (2)
if analysis of variance is understood early, the need to use the t distribution is
reduced (One would still want to use it for the setting of confidence limits and
in a few other special situations.) All ttests can be carried out directly as
anal-yses of variance and the amount of computation of these analanal-yses of variance
is generally equivalent to that oft tests
This larger second edition includes the Kolgorov-Smirnov two-sample test,
non parametric regression, stem-and-Ieaf diagrams, hanging histograms, and the
Bonferroni method of multiple comparisons We have rewritten the chapter on
the analysis of frequencies in terms of theGstatistic rather than X 2
, because theformer has been shown to have more desirable statistical properties Also, be-
cause of the availability of logarithm functions on calculators, the computation
of the Gstatistic is now easier than that of the earlier chi-square test Thus, we
reorient the chapter to emphasize log-likelihood-ratio tests We have also added
new homework exercises
We call speciaL double-numbered tables "boxes." They can be used as
con-venient guides for computation because they show the computational methods
for solving various types of biostatistica! problems They usually contain all
the steps necessary to solve a problem from the initial setup to the final result
Thus, students familiar with material in the book can use them as quick
sum-mary reminders of a technique
We found in teaching this course that we wanted students to be able to
refer to the material now in these boxes We discovered that we could not cover
even half as much of our subject if we had to put this material on the
black-board during the lecture, and so we made up and distributed box'?" dnd asked
students to refer to them during the lecture Instructors who usc this book may
wish to usc the boxes in a similar manner
We emphasize the practical applications of statistics to biology in this book;
thus we deliberately keep discussions of statistical theory to a minimum
De-rivations are given for some formulas, but these are consigned to Appendix A I,
where they should be studied and reworked by the student Statistical tables
to which the reader can refer when working through the methods discussed in
this book are found in Appendix A2
We are grateful to K R Gabriel, R C Lewontin and M Kabay for their
E Russek-Cohen, and M Singh for comments on an early draft of this book
We also appreciate the work of our secretaries, Resa Chapey and Cheryl Daly,
with preparing the manuscripts, and of Donna DiGiovanni, Patricia Rohlf, and
Barbara Thomson with proofreading
Robert R Sokal
F Jamcs Rohlf
INTRODUCTION TO BIOSTATISTICS
Trang 8CHAPTER 1
Introduction
This chapter sets the stage for your study of biostatistics In Section 1.1, wedefine the field itself We then cast a neccssarily brief glance at its historicaldevclopment in Section 1.2 Then in Section 1.3 we conclude the chapter with
a discussion of the attitudes that the person trained in statistics brings tobiological rcsearch
Wc shall define hiostatistics as the application of statisti("(ll methods to the lution of biologi("(ll prohlems.The biological problems of this definition are thosearising in the basic biological sciences as well as in such applied areas as thehealth-related sciences and the agricultural sciences Biostatistics is also called
so-biological statisticsor biometry.
The definition of biostatistics leaves us somewhat up in the air-"statistics"
layman The number of definitions you can find for it is limited only by thenumber of books you wish to consult We might define statistics in its modern
Trang 92 CHAPTER 1 / INTRODUCTION 1.2 / THE DEVELOPMENT OF BIOSTATISTICS 3
sense as the scientific study of numerical data based on natural phenomena All
parts of this definition are important and deserve emphasis:
validity of scientific evidence We must always be objective in presentation and
evaluation of data and adhere to the general ethical code of scientific
method-ology, or we may find that the old saying that "figures never lie, only statisticians
do" applies to us
hence it deals with quantities of information, not with a single datum Thus,th~
measurement of a single animal or the response from a single biochemical test
will generally not be of interest
N~merical: Unless data of a study can be quantified in one way or another,
they WIll not be amenable to statistical analysis Numerical data can be
mea-surements (the length or width of a structure or the amount of a chemical in
a body fluid, for example) or counts (such as the number of bristles or teeth)
those events in animate and inanimate nature that take place outside the control
of human beings, but also those evoked by scientists and partly under their
control, as in experiments Different biologists will concern themselves with
different levels of natural phenomena; other kinds of scientists, with yet different
ones But all would agree that the chirping of crickets, the number of peas in
a pod, and the age of a woman at menopause are natural phenomena The
heartbeat of rats in response to adrenalin, the mutation rate in maize after
may still be considered natural, even though scientists have interfered with the
phenomenon through their intervention The average biologist would not
con-sider the number of stereo sets bought by persons in different states in a given
year to be a natural phenomenon Sociologists or human ecologists, however,
might so consider it and deem it worthy of study The qualification "natural
phenomena" is included in the definition of statistics mostly to make certain
th.at the phenomena studied are not arbitrary ones that are entirely under the
an expenment
The word "statistics" is also used in another, though related, way It can
be the plural of the noun statistic, which refers to anyone of many computed
or estimated statistical quantities, such as the mean, the standard deviation, or
the correlation coetllcient Each one of these is a statistic
1.2 The development of biostatistics
Modern statistics appears to have developed from two sources as far back as
the seventeenth century The first source was political science; a form of statistics
developed as a quantitive description of the various aspects of the affairs of
a govcrnment or state (hence the term "statistics") This subject also became
known as political arithmetic Taxes and insurance caused people to become
interested in problems of censuses, longevity, and mortality Such considerationsassumed increasing importance, especially in England as the country prosperedduring the development of its empire John Graunt (1620-1674) and WilliamPetty (1623-1687) were early students of vital statistics, and others followed intheir footsteps
At about the same time, the second source of modern statistics developed:the mathematical theory of probability engendered by the interest in games
of chance among the leisure classes of the time Important contributions tothis theory were made by Blaise Pascal (1623-1662) and Pierre de Fermat(1601-1665), both Frenchmen Jacques Bernoulli (1654-1705), a Swiss, laid the
foundation of modern probability theory in Ars Conjectandi Abraham de
Moivre (1667-1754), a Frenchman living in England, was the first to combinethe statistics of his day with probability theory in working out annuity valuesand to approximate the important normal distribution through the expansion
of the binomial
A later stimulus for the development of statistics came from the science ofastronomy, in which many individual observations had to be digested into acoherent theory Many of the famous astronomers and mathematicians of theeighteenth century, such as Pierre Simon Laplace (1749-1827) in France andKarl Friedrich Gauss (1777 -1855) in Germany, were among the leaders in thisfield The latter's lasting contribution to statistics is the development of themethod of least squares
Perhaps the earliest important figure in biostatistic thought was AdolpheQuetelet (1796-1874), a Belgian astronomer and mathematician, who in hiswork combined the theory and practical methods of statistics and applied them
to problems of biology, medicine, and sociology Francis Galton (1822-1911),
a cousin of Charles Darwin, has been called the father of biostatistics andeugenics The inadequacy of Darwin's genetic theories stimulated Galton to try
to solve the problems of heredity Galton's major contribution to biology washis application of statistical methodology to the analysis of biological variation,particularly through the analysis of variability and through his study of regres-sion and correlation in biological measurements His hope of unraveling thelaws of genetics through these procedures was in vain He started with the mostditllcult material and with the wrong assumptions However, his methodologyhas become the foundation for the application of statistics to biology
Karl Pearson (1857 -1936), at University College, London, became ested in the application of statistical methods to biology, particularly in thedemonstration of natural selection Pearson's interest came about through theinfluence of W F R Weldon (1860- 1906), a zoologist at the same institution.Weldon, incidentally, is credited with coining the term "biometry" for the type
inter-of studies he and Pearson pursued Pearson continued in the tradition inter-of Galtonand laid the foundation for much of descriptive and correlational statistics.The dominant figure in statistics and hiometry in the twentieth century hasbeen Ronald A Fisher (1890 1962) His many contributions to statistical theorywill become obvious even to the cursory reader of this hook
Trang 104 CHAPTER 1 / INTRODUCTION 1.3 / THE STATISTICAL FRAME OF MIND 5
Statistics today is a broad and extremely active field whose applications
touch almost every science and even the humanities New applications for
sta-tistics are constantly being found, and no one can predict from what branch
of statistics new applications to biology will be made
1.3 The statistical frame of mind
A brief perusal of almost any biological journal reveals how pervasive the use
of statistics has become in the biological sciences Why has there been such a
marked increase in the use of statistics in biology? Apparently, because
biol-ogists have found that the interplay of biological causal and response variables
does not fit the classic mold of nineteenth-century physical science In that
century, biologists such as Robert Mayer, Hermann von Helmholtz, and others
tried to demonstrate that biological processes were nothing but
physicochemi-cal phenomena In so doing, they helped create the impression that the
experi-mental methods and natural philosophy that had led to such dramatic progress
in the physical sciences should be imitated fully in biology
Many biologists, even to this day, have retained the tradition of strictly
mechanistic and deterministic concepts of thinking (while physicists,
interest-ingly enough, as their science has become more refined, have begun to resort
to statistical approaches) In biology, most phenomena are affected by many
causal factors, uncontrollable in their variation and often unidentifiable
Sta-tistics is needed to measure such variable phenomena, to determine the error
of measurement, and to ascertain the reality of minute but important differences
A misunderstanding of these principles and relationships has given rise to
the attitude of some biologists that if differences induced by an experiment, or
observed by nature, are not clear on plain inspection (and therefore are in need
of statistical analysis), they are not worth investigating There are few legitimate
fields of inquiry, however, in which, from the nature of the phenomena studied,
statistical investigation is unnecessary
Statistical thinking is not really different from ordinary disciplined scientific
thinking, in which we try to quantify our observations In statistics we express
our degree of belief or disbelief as a probability rather than as a vague, general
statement For example, a statement that individuals of species A are larger
than those of species B or that women suffer more often from disease X than
do men is of a kind commonly made by biological and medical scientists Such
statements can and should be more precisely expressed in quantitative form
In many ways the human mind is a remarkable statistical machine,
absorb-ing many facts from the outside world, digestabsorb-ing these, and regurgitatabsorb-ing them
in simple summary form From our experience we know certain events to occur
frequently, others rarely "Man smoking cigarette" is a frequently observed
event, "Man slipping on banana peel," rare We know from experience that
Japanese are on the average shorter than Englishmen and that Egyptians are
on the average darker than Swedes We associate thunder with lightning almost
always, flies with garbage cans in the summer frequently, but snow with the
southern Californian desert extremely rarely All such knowledge comes to us
as a result of experience, both our own and that of others, which we learnabout by direct communication or through reading All these facts have beenprocessed by that remarkable computer, the human brain, which furnishes anabstract This abstract is constantly under revision, and though occasionallyfaulty and biased, it is on the whole astonishingly sound; it is our knowledge
of the moment
Although statistics arose to satisfy the needs of scientific research, the opment of its methodology in turn affected the sciences in which statistics isapplied Thus, through positive feedback, statistics, created to serve the needs
devel-of natural science, has itself affected the content and methods devel-of the biologicalsciences To cite an example: Analysis of variance has had a tremendous effect
in influencing the types of experiments researchers carry out The whole field ofquantitative genetics, one of whose problems is the separation of environmentalfrom genetic effects, depends upon the analysis of variance for its realization,and many of the concepts of quantitative genetics have been directly builtaround the designs inherent in the analysis of variance
Trang 112.1 / SAMPLES AND POPULA nONS 7
I !
In Section 2, I we explain the statistical meaning of the terms "sample" and
"population," which we shall be using throughout this book Then, in Section
2.2, we come to the types of observations that we obtain from biological research
material; we shall see how these correspond to the different kinds of variables
upon which we perform the various computations in the rest of this book In
Section 2.3 we discuss the degree of accuracy necessary for recording data and
the procedure for rounding olT hgures We shall then be ready to consider in
Section 2.4 certain kinds of derived data frequently used in biological
science -among them ratios and indices-and the peculiar problems of accuracy and
distribution they present us Knowing how to arrange data in frequency
distri-butions is important because such arrangements give an overall impression of
the general pattern of the variation present in a sample and also facilitate further
computational procedures Frequency distributions, as well as the presentation
of numerical data, are discussed in Section 2.5 In Section 2.6 we briefly describe
the computational handling of data
2.1 Samples and populations
We shall now define a number of important terms necessary for an
individual observations. They are observations or measurements taken on the smallest sampling unit.These smallest sampling units frequently, but not neces-sarily, are also individuals in the ordinary biological sense.Ifwe measure weight
in 100 rats, then the weight of each rat is an individual observation; the hundredrat weights together represent thesample of observations,defined asa collection
of individual observations selected by a specified procedure. In this instance, one
sense-that is, one rat However, if we had studied weight in a single rat over
a period of time, the sample of individual observations would be the weights
in a study of ant colonies, where each colony is a basic sampling unit, eachtemperature reading for one colony is an individual observation, and the sample
of observations is the temperatures for all the colonies considered.Ifwe consider
an estimate of the DNA content of a single mammalian sperm cell to be anindividual observation, the sample of observations may be the estimates of DNAcontent of all the sperm cells studied in one individual mammal
We have carefully avoided so far specifying what particular variable wasbeing studied, because the terms "individual observation" and "sample of ob-servations" as used above define only the structure but not the nature of the
sta-tistics is "variable." However, in biology the word "eharacter" is frequently usedsynonymously More than one variable can be measured on each smallestsampling unit Thus, in a group of 25 mice we might measure the blood pHand the erythrocyte count Each mouse (a biological individual) is the smallestsampling unit, blood pH and red cell count would be the two variables studied.the pH readings and cell counts are individual observations, and two samples
of 25 observations (on pH and on erythrocyte count) would result Or we mightspeak of a hil'ariate sampleof 25 observations each referring to a pH readingpaired with an erythrocyte count
Next we define population. The biological definition of this lerm is wellknown It refers to all the individuals of a given species (perhaps of a givenlife-history stage or sex) found in a circumscribed area at a given time Instatistics, population always means the totality0/indil'idual ohsenJatiolls ahout which in/ere/In's are 10 he frlLlde, exist illy anywhere in the world or at lcast u'ithill
a definitely specified sampling area limited in space alld time. If you take fivemen and study the number of Ieucocytes in their peripheral blood and youarc prepared to draw conclusions about all men from this sample of five thenthe population from which the sample has been drawn represents the leucocytecounts of all extant males of the species Homo sapiens. If on the other hand.you restrict yllursclf to a more narrowly specified sample such as five male
Trang 128 CHAPTER 2 ! DATA IN BIOSTATISTICS 2.2 / VARIABLES IN B10STATISTlCS 9Chinese, aged 20, and you are restricting your conclusions to this particular
group, then the population from which you are sampling will be leucocyte
numbers of all Chinese males of age 20.
A common misuse of statistical methods is to fail to define the statistical
population about which inferences can be made A report on the analysis of
a sample from a restricted population should not imply that the results hold
in general The population in this statistical sense is sometimes referred to as
A population may represent variables of a concrete collection of objects or
creatures, such as the tail lengths of all the white mice in the world, the leucocyte
counts of all the Chinese men in the world of age 20, or the DNA content of
all the hamster sperm cells in existence: or it may represent the outcomes of
experiments, such as all the heartbeat frequencies produced in guinea pigs by
injections of adrenalin In cases of the first kind the population is generally
finite Although in practice it would be impossible to collect count, and examine
all hamster sperm cells, all Chinese men of age 20, or all white mice in the world,
these populations are in fact finite Certain smaller populations, such as all the
whooping cranes in North America or all the recorded cases of a rare but easily
diagnosed disease X may well lie within reach of a total census By contrast,
an experiment can be repeated an infinite number of times (at least in theory)
A given experiment such as the administration of adrenalin to guinea pigs
could be repeated as long as the experimenter could obtain material and his
or her health and patience held out The sample of experiments actually
per-formed is a sample from an intlnite number that could be performed
Some of the statistical methods to be developed later make a distinction
between sampling from finite and from infinite populations However, though
populations arc theoretically finite in most applications in biology, they are
generally so much larger than samples drawn from them that they can be
con-sidered de facto infinite-sized populations
Each biologi<.:al discipline has its own set of variables which may indude
con-ventional morpholl.lgKal measurements; concentrations of <.:hemicals in body
Iluids; rates of certain biologi<.:al proccsses; frcquencies of certain events as in
gcndics, epidemiology, and radiation biology; physical readings of optical or
electronic machinery used in biological research: and many more
We have already referred to biological variables in a general way but we
have not yet defined them We shall define a I'ariahleas a property with respect
docs not ditTer wilhin a sample at hand or at least among lhe samples being
studied, it <.:annot be of statistical inlerL·st Length, height, weight, number of
teeth vitaminC content, and genolypcs an: examples of variables in ordinary,
genetically and phenotypically diverse groups of lHganisms Warm-bloodedness
in a group of m,lI11m,tls is not, since mammals are all alike in this regard,
although body temperature of individual mammals would, of course, be avariable
We can divide variables as follows:
Variables
Measurement variablesContinuous variablesDiscontinuous variablesRanked variables
Attributes
of values between any two fixed points For example, between the two lengthmeasurements 1.5 and 1.6 em there are an infinite number of lengths that could
be measured if one were so inclined and had a precise enough method ofcalibration Any given reading of a continuous variable, such as a length of1.57 mm, is therefore an approximation to the exact reading, which in practice
is unknowable Many of the variables studied in biology are continuous ables Examples are lengths, areas, volumes weights, angles, temperatures.periods of time percentages concentrations, and rates
vari-Contrasted with continuous variables are the discontilluous IJllriahlt's. alsoknown as meristicor discrete vilrilih/t's.These are variables that have only cer-tain fixed numerical values with no intermediate values possible in between.Thus the number of segments in a certain insect appendage may be 4 or 5 or
6 but never 51 or 4.3 Examples of discontinuous variahks arc numbers of agiven structure (such as segments, bristles teeth, or glands), numbers of ollspring,numbers of colonies of microorganisms or animals or numbers of plants in agiven quadrat
Some variables cannot he measured but at least can be ordered or ranked
by their magnitude Thus in an experiment one might record the rank ordn
of emergence of ten pupae without specifying the exact time at which each pupaemerged In such cases we code the data as a rallked mriahle, the order ofemergence Spe<.:ial methods for dealing with su<.:h variables have been devel-oped and several arc furnished in this book By expressing a variable as a series
of ranks, such as 1,2.3,4.5 we do not imply that the ditTeren<.:e in magnitudebetween, say, ranks I and 2 is identical to or even proportional tn the dif-feren<.:e between ranks 2 and 3
Variables that <.:annot be measured but must be expressed qualitatively arccalled altrihutes, or lIominal I'liriahies. These are all properties sudl as bla<.:k
or white pregnant or not pregnant, dead or alive, male or female When suchattributes are combined wilh frequen<.:ies, they can bc lrcated statistically Of
XO mi<.:e, we may, for instance state that four were hlad two agouti and the
Trang 1310 CHAPTER2 / DATA IN BIOSTATISTICS 2.3 / ACCURACY AND PRECISION OF DATA 11rest gray When attributes are combined with frequencies into tables suitable
for statistical analysis, they are referred to as enumeration data Thus the
enu-meration data on color in mice would be arranged as follows:
In some cases attributes can be changed into measurement variables if this is
desired Thus colors can be changed into wavelengths or color-chart values
Certain other attributes that can be ranked or ordered can be coded to
be-come ranked variables For example, three attributes referring to a structure
as "poorly developed," "well developed," and "hypertrophied" could be coded
I, 2, and 3
A term that has not yet been explained is variate In this book we shall use
it as a single reading, score,or observation of a given variable Thus, if we have
measurements of the length of the tails of five mice, tail length will be a
con-tinuous variable, and each of the five readings of length will be a variate In
this text we identify variables by capital letters, the most common symbol being
Y. Thus Y may stand for tail length of mice A variate will refer to a given
length measurement; 1'; is the measurement of tail length of the ith mouse, and
Y 4 is the measurement of tail length of the fourth mouse in our sample
Color
BlackAgoutiGrayTotal number of mice
Frequency
4
2
7480
Most continuous variables, however, are approximate We mean by thisthat the exact value of the single measurement, the variate, is unknown andprobably unknowable The last digit of the measurement stated should implyprecision; that is, it should indicate the limits on the measurement scale betweenwhich we believe the true measurement to lie Thus, a length measurement of12.3 mm implies that the true length of the structure lies somewhere between
12.25 and 12.35 mm Exactly where between these implied limits the real length
is we do not know But where would a true measurement of 12.25 fall? Would
it not equally likely fall in either of the two classes 12.2 and 12.3-clearly anunsatisfactory state of affairs? Such an argument is correct, but when we record
a number as either 12.2 or 12.3, we imply that the decision whether to put itinto the higher or lower class has already been taken This decision was nottaken arbitrarily, but presumably was based on the best available measurement
Ifthe scale of measurement is so precise that a value of 12.25 would clearlyhave been recognized, then the measurement should have been recorded
originally to four significant figures Implied limits, therefore, always carry one more figure beyond the last significant one measured by the observer.
Hence, it follows that if we record the measurement as 12.32, we are implyingthat the true value lies between 12.315 and 12.325 Unless this is what we mean,there would be no point in adding the last decimal figure to our original mea-surements Ifwe do add another figure, we must imply an increase in precision
We see, therefore, that accuracy and precision in numbers are not absolute cepts, but are relative Assuming there is no bias, a number becomes increasinglymore accurate as we are able to write more significant figures for it (increase itsprecision) To illustrate this concept of the relativity of accuracy, consider thefollowing three numbers:
Meristic variates, though ordinarily exact, may be recorded approximatelywhen large numbers are involved Thus when counts are reported to the nearestthousand, a count of 36,000 insects in a cubic meter of soil, for example, impliesthat the true number varies somewhere from 35,500 to 36,500 insects
To how many significant figures should we record measurements? If we array
2.3 Accuracy and precision of data
"Accuracy" and "precision" are used synonymously in everyday speech, but in
statistics we define them more rigorously Accuracy is the closeness ola measured
or computed vallie to its true lJalue Precisio/l is the closeness olrepeated
measure-ments A biased but sensitive scale might yield inaccurate but precise weight By
chance, an insensitive scale might result in an accurate reading, which would,
however, be imprecise, since a repeated weighing would be unlikely to yield an
equally accurate weight Unless there is bias in a measuring instrument, precision
will lead to accuracy We need therefore mainly be concerned with the former
Precise variates are usually, but not necessarily, whole numbers Thus, when
we count four eggs in a nest, there is no doubt about the exact number of eggs
in the nest if we have counted eorrectly; it is 4, not 3 or 5, and clearly it could
not be 4 plus or minus a fractional part Meristic, or discontinuous, variables are
generally measured as exact numbers Seemingly, continuous variables derived
from meristic ones can under certain conditions also be exact numbers For
instance, ratios between exact numbers arc themselves also exact If in a colony
of animals there are I X females and 12 males, the ratio of females to males (a
193 192.8 192.76
192.5 193.5 192.75 192.85 192.755 192.765
Trang 1412 CHAPTER2 / DATA IN BIOSTATISTICS 2.4 / DERIVED VARIABLES 13
one, an easy rule to remember is thatthe number of unit steps from the smallest
to the largest measurement in an array should usually be between 30 and 300.
Thus, if we are measuring a series of shells to the nearest millimeter and the
largest is 8 mm and the smallest is 4 mm wide, there are only four unit steps
between the largest and the smallest measurement Hence, we should measure
our shells to one more significant decimal place Then the two extreme
measure-ments might be8.2 mm and 4.1 mm, with 41 unit steps between them (counting
the last significant digit as the unit); this would be an adequate number of unit
steps The reason for such a rule is that an error of1in the last significant digit
of a reading of4mm would constitute an inadmissible error of25%,but an error
ofIin the last digit of4.1 is less than2.5%.Similarly, if we measured the height
of the tallest of a series of plants as 173.2 cm and that of the shortest of these
plants as 26.6 em, the difference between these limits would comprise 1466 unit
steps (of0.1 cm), which are far too many It would therefore be advisable to
record the heights to the nearest centimeter as follows: 173 cm for the tallest
and 27 cm for the shortest This would yield 146 unit steps Using the rule we
ha ve stated for the number of unit steps, we shall record two or three digits for
most measurements
The last digit should always be significant; that is, it should imply a range
for the true measurement of from half a "unit step" below to half a "unit step"
above the recorded score, as illustrated earlier This applies to all digits, zero
included Zeros should therefore not be written at the end of approximate
num-bers to the right of the decimal point unless they are meant to be significant
digits Thus 7.80 must imply the limits 7.795 to 7.805 If7.75 to 7.85 is implied,
the measurement should be recorded as 7.8
When the number of significant digits is to be reduced, we carry out the
process ofrOll/utin?} ofrnumbers The rules for rounding off are very simple A
digit to be rounded ofT is not changed if it is followed by a digit less than 5 If
the digit to be rounded off is followed by a digit greater than5 or by 5 followed
by other nonzero digits, it is increased by 1 When the digit to be rounded ofT
is followed by a 5standing alone or a 5followed by zeros, it is unchanged if it
is even but increased by I if it is odd The reason for this last rule is that when
sueh numbers are summed in a long series, we should have as many digits
raised as are being lowered, on the average; these changes should therefore
balance oul Practice the above rules by rounding ofT the following numbers to
the indicated number of significant digits:
Num"er Siyrli/icarlt di"its desired
Most pocket calculators or larger computers round off their displays using
a different rule: they increase the preceding digit when the following digit is a
5 standing alone or with trailing zeros However, since most of the machinesusable for statistics also retain eight or ten significant figures internally, theaccumulation of rounding errors is minimized Incidentally, if two calculatorsgive answers with slight differences in the final (least significant) digits, suspect
a different number of significant digits in memory as a cause of the disagreement
2.4 Derived variables
The majority of variables in biometric work are observations recorded as directmeasurements or counts of biological material or as readings that are the output
of various types of instruments However, there is an important class of variables
in biological research that we may call thederived or computed variables These
are generally based on two or more independently measured variables whoserelations are expressed in a certain way We are referring to ratios, percentages,concentrations, indices, rates, and the like
Aratio expresses as a single value the relation that two variables have, one
to the other In its simplest form, a ratio is expressed as in 64:24, which mayrepresent the number of wild-type versus mutant individuals, the number ofmales versus females, a count of parasitized individuals versus those not para-sitized, and so on These examples imply ratios based on counts Aratio bascd
on a continuous variable might be similarly expressed as 1.2: 1.8, which mayrepresent the ratio of width to length in a sclerite of an insect or the ratiobetween the concentrations of two minerals contained in water or soil Ratiosmay also be expressed as fractions; thus, the two ratios above could be expressed
as~:and U However, for computational purposes it is more useful to expressthe ratio as a quotient The two ratios cited would therefore be 2.666 and0.666 , respectively These are pure numbers, not expressed in measurementunits of any kind It is this form for ratios that we shall consider further
Percellta~je.~are also a type of ratio Ratios, percentages, and concentrationsare basic quantities in much biological research, widely used and generallyfamiliar
An index is the ratio ofthe valueof one varia hie to the value ofa so-called standard OIlC. A well-known example of an index in this sense is the cephalicindex in physical anthropology Conceived in the wide sense, an index could
be the average of two measurements-either simply, such as t(length of A +
length ofB), or in weighted fashion, such as :\[(2 x length ofA)+length ofBj
Rates are important in many experimental fields of biology The amount
of a substance liberated per unit weight or volume of biological material, weightgain per unit time, reproductive rates per unit population size and time (birthrates), and death rates would fall in this category
The use of ratios and percentages is deeply ingrained in scientific thought.Often ratios may be the only meaningful way to interpret and understand cer-tain types of biological problems If the biological process bcing investigated
Trang 1514 CHAPTER 2 / DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 15
20
FIGURE 2.1 Sampling from a populatl<lI1 of hirth weights of infants (a continuous variahle) A A sample of 2';
B A sample of 100 C A sample of 500 D A sample of 2000.
operates on the ratio of the variables studied, one must examine this ratio to
understand the process Thus, Sinnott and Hammond (1935) found that
inter-preted through a form index based on a length-width ratio, but not through
the independent dimensions of shape By similar methods of investigation, we
should be able to find selection affecting body proportions to exist in the
evolu-tion of almost any organism
There are several disadvantages to using ratios First, they are relatively
inaccurate Let us return to the ratio ::~ mentioned above and recall from the
previous section that a measurement of 1.2 implies a true range of measurement
of the variable from 1.15 to 1.25; similarly, a measurement of 1.8 implies a range
from 1.75 to 1.85 We realize, therefore, that the true ratio may vary anywhere
4.2% if 1.2 is an original measurement: (1.25 - 1.2)/1.2; the corresponding
maxi-mal error for the ratio is 7.0%: (0.714 - 0.667)/0.667 Furthermore, the best
estimate of a ratio is not usually the midpoint between its possible ranges Thus,
in our example the midpoint between the implied limits is 0.668 and the ratio
based on U is 0.666 ; while this is only a slight difference, the discrepancy
may be greater in other instances
A second disadvantage to ratios and percentages is that they may not be
approximately normally distributed (see Chapter 5) as required by many
statis-tical tests This difficulty can frequently be overcome by transformation of the
variable (as discussed in Chapter 10) A third disadvantage of ratios is that
in using them one loses information about the relationships between the two
variables except for the information about the ratio itself
2.5 Frequency distributions
If we were to sample a population of birth weights of infants, we could represent
each sampled measurement by a point along an axis denoting magnitude of
birth weight This is illustrated in Figure 2.1 A, for a sample of 25 birth weights
If we sample repeatedly from the population and obtain 100 birth weights, we
shall probably have to place some of these points on top of other points in
order to reeord them all correctly (Figure 2.1H). As we continue sampling
assemblage of points will continue to increase in size but will assume a fairly
definite shape The outline of the mound of points approximates the distribution
of the variable Remember that a continuous variable such as birth weight can
assume an infinity of values between any two points on the abscissa The
refine-ment of our measurerefine-ments will determine how fine the number of recorded
divisions bctween any two points along the axis will be
The distribution of a variable is of considerable biological interest If we
find that the dislributioll is asymmetrical and drawn out in one direction, it tells
us that there is, perhaps, selectioll that causes organisms to fall preferentially
in one of the tails of the distribution, or possibly that the scale of measuremenl
Trang 16The above is an example of a quantitative frequency distribution, since Y isclearly a measurement variable However, arrays and frequency distributionsneed not be limited to such variables We can make frequency distributions ofattributes, calledqualitative frequency distributions. In these, the various classesare listed in some logical or arbitrary order For example, in genetics we mighthave a qualitative frequency distribution as follows:
\'um),Pr of plants '1\\adrat
CHAPTER2 / DATA IN BIOSTATISTICS
FIGURE 2.2
ftacca in 500 quadrats Data from Table 2.2;
orginally from Archibald (1950).
2.5 / FREQUENCY DISTRIBUTIONS
Variable
y 9
87
4599 men and 47Xt> women.
This tells us that there are two classes of individuals, those identifed by theA
-phenotype, of which 86 were found, and those comprising the homozygote cessive aa, of which 32 were seen in the sample
re-An example of a more extensive qualitative frequency distribution is given
in Table 2.1, which shows the distribution of melanoma (a type of skin cancer)over body regions in men and women This table tells us that the trunk andlimbs are the most frequent sites for melanomas and that the buccal cavity, therest of the gastrointestinal tract, and the genital tract arc rarely atllicted by this
()/Jsel'l'ed)i-e4u ('IuT
Men Women
I
chosen is such as to bring about a distortion of the distribution If, in a sample
of immature insects, we discover that the measurements are bimodally
distrib-uted (with two peaks), this would indicate that the population is dimorphic
This means that different species or races may have become intermingled in
our sample Or the dimorphism could have arisen from the presence of both
sexes or of different instars
There are several characteristic shapes of frequency distributions The most
common is the symmetrical bell shape (approximated by the bottom graph in
Figure 2.1), which is the shape of the normal frequency distribution discussed
in Chapter 5 There are also skewed distributions (drawn out more at one tail
than the other), I.-shaped distributions as in Figure 2.2, U-shaped distributions,
and others, all of which impart significant information ahout the relationships
they represent We shall have more to say about the implications of various
types of distrihutions in later chapters and sections
After researchers have obtained data in a given study, they must arrange
the data in a form suitable for computation and interpretation We may assume
that variates are randomly ordered initially or are in the order in which the
measurements have been taken A simple arrangement would be an armr of
the data hy order of magnitude Thus for example, the variates 7, 6, 5, 7, X, 9,
6, 7, 4, 6, 7 could be arrayed in order of decreasing magnitude as follows: 9, X,
7 7, 7, 7, 6, 6, 6, 5, 4 Where there an: some variates of the same value such as
the 6\ and Ts in this lictitillllS example a time-saving device might immediately
have occurred to you namely to list a frequency for each of the recurring
variates; thus: 9, X, 7(4 x) ()(3xI,5,4 Such a shorthand notatioll is one way to
represent aFCII'h'IICI' disll'ihlllioll, which is simply an arrangement of thec1as~es
of variates with the frequency of I:ach class indicated ConventIOnally, a
tre-qUl:ncy distrihutioll IS stall:d III tabular form; for our exampk, this is dOlle as
Fyc Totall:ascs
Sourct'. Data (Ii}X~)
645
.1645
II 21')3371
47X6
Trang 1718 CHAPTER 2 / DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 19
SouI'ce. Data from Archibald (t 950).
TABU: 2.2
A meristic frequency distribution.
Number of plants of the sedgeearn f/acca found in 500 quadrats.
type of cancer We often encounter other examples of qualitative frequency
distributions in ecology in the form of tables, or species lists, of the inhabitants
of a sampled ecological area Such tables catalog the inhabitants by species or
at a higher taxonomic level and record the number of specimens observed for
each The arrangement of such tables is usually alphabetical, or it may follow
a special convention, as in some botanical species lists
A quantitative frequency distribution based on meristic variates is shown
in Table 2.2 This is an example from plant ecology: the number of plants per
quadrat sampled is listed at the left in the variable column; the observed
fre-quency is shown at the right
Quantitative frequency distributions based on a continuous variable arc
the most commonly employed frequency distributions; you should become
thoroughly familiar with them An example is shown in Box 2.1 It is based on
25 femur lengths measured in an aphid population The 25 readings are shown
at the top of Box 2.1 in the order in which they were obtained as measurements
(They could have been arrayed according to their magnitude.) The data are
next set up in a frequency distribution The variates increase in magnitude by
unit steps of 0.1 The frequency distribution is prepared by entering each variate
in turn on the scale and indicating a count by a conventional tally mark When
all of the items have heen tallied in the corresponding class, the tallies are
con-verted into numerals indicating frequencies in the next column Their sum is
indicated by I.f.
What have we achieved in summarizing our data') The original 25 variates
arc now represented by only 15 classes We find that variates 3.6, 3.8, and 4.3
have the highest frequencies However, we also note that there arc several classes,
such as 3.4 or 3.7 that arc not represented by a single aphid This gives the
No of plallts per quadrat
y
o123
4
5
6
78Total
Observed fi-equellcy
f
181 118
97
54329
531
500
entire frequency distribution a drawn-out and scattered appearance The reasonfor this is that we have only 25 aphids, too few to put into a frequency distribu-tion with 15 classes To obtain a more cohesive and smooth-looking distribu-tion, we have to condense our data into fewer classes This process is known
described in the following paragraphs
We should realize that grouping individual variates into classes of widerrange is only an extension of the same process that took place when we obtainedthe initial measurement Thus, as we have seen in Section 2.3, when we measure
an aphid and record its femur length as 3.3 units, we imply thereby that thetrue measurement lies between 3.25 and 3.35 units, but that we were unable tomeasure to the second decimal place In recording the measurement initially as
3.3 units, we estimated that it fell within this range Had we estimated that itexceeded the value of 3.35, for example, we would have given it the next higherscore, 3.4 Therefore, all the measurements between 3.25 and 3.35 were in factgrouped into the class identified by the class mark 3.3 Our class intervalwas0.1 units Ifwe now wish to make wider class intervals, we are doing nothingbut extending the range within which measurements arc placed into one class.Reference to Box 2.1 will make this process clear We group the data twice
in order to impress upon the reader the flexibility of the process In the firstexample of grouping, the class interval has been doubled in width; that is, ithas been made to equal 0.2 units If we start at the lower end, the implied classlimits will now be from 3.25 to 3.45, the limits for the next class from 3.45 to3.65, and so forth
Our next task is to find the class marks This was quite simple in the quency distribution shown at the left side of Box 2.1, in which the original mea-surements were used as class marks However, now we are using a class intervaltwice as wide as before, and the class marks are calculated by taking the mid-point of the new class intervals Thus, to lind the class mark of the first class,
fre-we take the midpoint betfre-ween 3.25 and 3.45 which turns out to be 3.35 Wenote that the class mark has one more decimal place than the original measure-ments We should not now be led to believe that we have suddenly achievedgreater precision Whenever we designate a class interval whose lastsiqnijicant
digit is even (0.2 in this case), the class mark will carry one more decimal placethan the original measurements On the right side of the table in Box 2.1 thedata are grouped once again, using a class interval of 0.3 Because of the oddlast significant digit the class mark now shows as many decimal places as theoriginal variates, the midpoint hetween 3.25 and 3.55 heing 3.4
Once the implied class limits and the class mark for the lirst class havebeen correctly found, the others can bc writtcn down by inspection withoutany spccial comfJutation Simply add the class interval repeatedly to each ofthe values Thus, starting with the lower limit 3.25 by adding 0.2 we obtain3.45, 3.65 3,X5 and so forth; similarly for the class marks we ohtain 3,35,3.55.3.75, and so forth It should he ohvious that the wider the class intervals themore comp;let the data hecome hut also the less precise However, looking at
Trang 18BOX 2.1
Preparation of frequency distribution and grouping into fewer classes with wider class intervals
Twenty-five femur lengths of the aphidPemphigus. Measurements are in mm x 10-1•
4.45-4.654.65-4.85
4.554.75
Source: Data from R R Sakal.
Histogram of the original frequency distribution shown above and of the grouped distribution with5classes Line below
abscissa shows class marks for the grouped frequency distribution Shaded bars represent original frequency distribution;
hollow bars represent grouped distribution.
Y (femur length, in units of 0.1 rom)
For a detailed account of the process of grouping, see Section 2.5
Trang 1922 CHAPTER 2! DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 23
When the shape of a frequency distribution is of particular interest, we maywish 10 present the distribution in graphic form when discussing the results.This is generally done by means of frequency diagrams, of which there arc twocommon types For a distribution of meristic data we employ a hal' dia!fl"il III ,
We then put the next digit of the first variate (a "leaf") at that level of the stemcorresponding to its leading digit(s) The first observation in our sample is 4.9
We therefore place a 9 next to the 4 The next variate is 4.6 It is entered byfinding the stem level for the leading digit 4 and recording a 6 next to the 9that is already there Similarly, for the third variate, 5.5, we record a 5next tothe leading digit 5 We continue in this way until all 15 variates have beenentered (as "leaves") in sequence along the appropriate leading digits of the stem.The completed array is the equivalent of a frequency distribution and has theappearance of a histogram or bar diagram (see the illustration) Moreover, itpermits the efficient ordering of the variates Thus, from the completed array
it becomes obvious that the appropriate ordering of the 15 variates is 2.3, 3.6,3.7,4.4.4.6,4.9,5.5,6.4,7.1,7.3,9.1.9.8,12.7, 16.3, 18.0.The median can easily
be read off the stem-and-Ieaf display It is clearly 6.4 For very large samples,stem-and-Ieaf displays may become awkward In such cases a conventionalfrequency distribution as in Box 2 I would be preferable
Coml'lc/cd array (,)'(cl' /5) ';/<,/,7
SIt'I':'
the frequency distribution of aphid femur lengths in Box 2 I, we notice that the
initial rather chaotic structure is being simplified by grouping When we group
the frequency distribution into five classes with a class interval of 0.3 units, it
becomes notably bimodal (that is, it possesses two peaks of frequencies)
In setting up frequency distributions, from 12to20classes should be
estab-lished This rule need not be slavishly adhered to, but it should be employed
with some of the common sense that comes from experience in handling
statis-tical data The number of classes depends largely on the size of the sample
studied Samples of less than 40 or 50 should rarely be given as many as 12
classes, since that would provide too few frequencies per class On the other
hand, samples of several thousand may profitably be grouped into more than
20classes If the aphid data of Box2.1 need to be grouped, they should probably
not be grouped into more than 6 classes
Ifthe original data provide us with fewer classes than we think we should
have, then nothing can be done if the variable is meristic, since this is the nature
of the data in question However, with a continuous variable a scarcity of classes
would indicate that we probably had not made our measurements with sufficient
precision.Ifwe had followed the rules on number of significant digits for
mea-surements stated in Section 2.3, this could not have happened
Whenever we come up with more than the desired number of classes,
group-ing should be undertaken When the data are meristic, the implied limits of
continuous variables are meaningless Yet with many meristic variables, such
as a bristle number varying from a low of 13 to a high of81,it would probably
be wise to group the variates into classes, each containing several counts This
can best be done by using an odd number as a class interval so that the class
mark representing the data will be a whole rather than a fractional number
Thus if we were to group the bristle numbers 13 14, 15, and 16 into one class,
the class mark would have to be 14.5, a meaningless value in terms of bristle
number It would therefore be better to use a class ranging over 3 bristles or
5 bristles giving the integral value 14or 15 as a class mark
Grouping data into frequency distributions was necessary when
compu-tations were done by pencil and paper Nowadays even thousands of variates
can be processed efficiently by computer without prior grouping However,
fre-quency distributions are still extremcly useful as a tool for data analysis This
is especially true in an age in which it is all too easy for a researcher to obtain
a numerical result from a computer program without ever really examining the
data for outliers or for other ways in which the sample may not conform to
the assumptions of the statistical methods
Rather than using tally marks to set up a frequency distribution, as was
done in Box 2.1, we can employ Tukey's stem-and-lea{ display. This technique
is an improvement, since it not only results in a frequency distribution of the
variates of a sample but also permits easy checking of the variates and ordering
them into an array (neither of which is possible with tally marks) This technique
will therefore be useful in computing the median of a sample (sec Section 3.3)
and in computing various tests that require ordered arrays of the sample variates
Trang 20Birth weight (in oz.)
2.6 The handling of data
Data must be handled skillfully and expeditiously so that statistics can be
prac-ticed successfully Readers should therefore acquaint themselves with the
var-the variable (in our case, var-the number of plants per quadrat), and var-the ordinate
represents the frequencies The important point about such a diagram is that
the bars do not touch each other, which indicates that the variable is not
con-tinuous By contrast, continuous variables, such as the frequency distribution
of the femur lengths of aphid stem mothers, are graphed as a histogrum In a
histogram the width of each bar along the abscissa represents a class interval
of the frequency distribution and the bars touch each other to show that the
actual limits of the classes are contiguous The midpoint of the bar corresponds
to the class mark At the bottom of Box 2.1 are shown histograms of the
frc-quency distribution of the aphid data ungrouped and grouped The height of
each bar represents the frequency of the corresponding class
To illustrate that histograms are appropriate approximations to the
con-tinuous distributions found in nature, we may take a histogram and make the
class intervals more narrow, producing more classes The histogram would then
clearly have a closer fit to a continuous distribution We can continue this
pro-cess until the class intervals become infinitesimal in width At this point the
histogram becomes the continuous distribution of the variable
Occasionally the class intervals of a grouped continuous frequency
distri-hution arc unequal For instance, in a frequency distridistri-hution of ages we might
have more detail on the dilTerent ages of young individuals and less accurate
identilication of the ages of old individuals In such cases, the class intervals
I'm the older age groups would be wider, those for the younger age groups
nar-rower In representing such data the bars of the histogram arc drawn with
dilkrent widths
Figure 2.3 shows another graphical mode of representation of a frequency
distribution of a continuous variahle (in this case, birth weight in infants) As
we shall sec later the shapes of distrihutions seen in such frequency polygons
can reveal much about the biological situations alTecting the given variable
25
2.6 / THE HANDLING OF DATA
*I;(lr illhll"lll<ltilHl or t() {)nkr Cillltact 1':Xl'kr S()thvare Wehsitc:htlp:l/\\'\\.'\\".l'Xl'ICl"s()rtwarl'.\."il111 1'~ IIl:lil:
salcs(II'l'Xl'tnso!lwal"l·.l"llili Thl'se progralllS arc cOlllpatible with Willl!ows XI' allli Vista.
In this book we ignore "pencil-and-paper" short-cut methods for tions, found in earlier textbooks of statistics, since we assume that the studenthas access to a calculator or a computer Some statistical methods are veryeasy to use because special tables exist that provide answers for standard sta-tistical problems; thus, almost no computation is involved An example isFinney's table, a 2-by-2 contingency table containing small frequencies that isused for the test of independence (Pearson and Hartley, 1958, Table 38) Forsmall problems, Finney's table can be used in place of Fisher's method of findingexact probabilities, which is very tedious Other statistical techniques are soeasy to carry out that no mechanical aids are needed Some are inherentlysimple, such as the sign test (Section 10.3) Other methods are only approximatebut can often serve the purpose adequately; for example, we may sometimessubstitute an easy-to-evaluate median (defined in Section 3.3) for the mean(described in Sections 3.1 and 3.2) which requires eomputation
We can use many new types of equipment to perform statistical
computa-tions-many more than we eould have when Introduction to Biostutistics was
first published The once-standard electrically driven mechanical desk calculatorhas eompletely disappeared Many new electronic devices, from small pocketealculators to larger desk-top computers, have replaced it Such devices are sodiverse that we will not try to survey the field here Even if we did, the rate ofadvance in this area would be so rapid that whatever we might say would soonbecome obsolete
We cannot really draw the line between the more sophisticated electroniccalculators on the one hand, and digital computers There is no abrupt increase
in capabilities between the more versatile programmable calculators and thesimpler microcomputers, just as there is none as we progress from microcom-puters to minicomputers and so on up to the large computers that one associateswith the central computation center of a large university or research laboratory.All can perform computations automatically and be controlled by a set ofdetailed instructions prepared by the user Most of these devices, including pro-grammable small calculators, arc adequate for all of the computations described
in this book even for large sets of data
The material in this book consists or relatively standard statisticalcomputations that arc available in many statistical programs BIOMstatl is
a statistical software package that includes most or the statistical methodscovered in this hook
The use of modern data processing procedures has one inherent danger.One can all too easily either feed in erroneous data or choose an inappropriateprogram Users must select programs carefully to ensure that those programsperform the desired computations, give numerically reliable results, and arc asfree from error as possible When using a program for the lirst time, one shouldtest it using data from textbooks with which one is familiar Some programs
CHAPTER 2 / DATA IN BIOSTATISTICS
FIGURE 2.3
Frequency polygon Birth weights of 9465 males infants Chinese third-class patients in Singapore, 1950 and 1951 Data from Millis and Seng (1954).
Trang 2126 CHAPTER 2 / DATA IN BIOSTATISTICS
are notorious because the programmer has failed to guard against excessive
rounding errors or other problems Users of a program should carefully check
the data being analyzed so that typing errors are not present In addition,
pro-grams should help users identify and remove bad data values and should provide
them with transformations so that they can make sure that their data satisfy
the assumptions of various analyses
Exercises
7815.01,2.9149 and 20.1500 What are the implied limits before and afler
round-ing? Round these same numbers to one decimal place.
ANS For the first value: 107; 106.545 -106.555; 106.5 -107.5; 106.6
(a) Statistical and biological populations (b) Variate and individual (c) Accuracy
and precision (repeatabilityl (dl Class Interval and class marl\ leI Bar diagram
and histogram tf) Abscissa and ordinate.
them into a frequency distribution? Give class limits as well as class marks.
do-mestic pigeons into a frequency distribution and draw its histogram (data from
Olson and Miller 1958) Measuremcnts are in millimeters.
How precisely should you measure the wing length of a species of mosquitoes
in a study of geographic variation if the smallest specimen has a length of anoul
Transform the 40 measurements in Exercise 2.4 into common logarithms (use a
table or calculator) and make a frcquency distribution of these transformed
variates Comment on the resulting change in the pattern of the frcquency
dis-tribution from that found before.
For the data (lfTahles 2.1 and 2.2 i<kntify the individual ohservatlons, samples,
populations, and varia nics.
Make a stem-and-Icaf display pf the data givcn In Exercise 2.4.
The distributIOn of ages of striped bass captured by book and line from the East
River and the Hudson River during 19XO were reported as follows (Young, 1981):
Show this distribution in the form of a bar diagram.
An early and fundamental stage in any science is the descriptive stage Untilphenomena can be accurately described, an analysis of their causes IS p:emature.The question "What?" comes before "How?" Unless we know so~ethmg a~out
pigs, as well as its fluctuations from day to day and within days, we shall beunable to ascertain the effect of a given dose of a drug upon thIS vanable ln
a sizable sample it would be tedious to obtain our knowledge of the material
by contemplating each individual observation We need some form of summary
to permit us to deal with the data in manageable form, as well as to be able
to share our findings with others in scientific talks and publications A togram or bar diagram of the frequency distribution would be one type ofsummary, However, for most purposes, a numerical summary is needed todescribe concisely, yet accurately, the properties of the observed frequency
sta-tistics. This chapter will introduce you to some of them and show how theyarc computed
Two kinds of descriptive statistics will be discussed in this chapter: statistics
of location and statistics of dispersion Thestatistics of location(also known as
Trang 2228 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.1 / THE ARITHMETIC MEAN 29
measures of central tendency) describe the position of a sample along a given
dimension representing a variable For example, after we measure the length of
the animals within a sample, we will then want to know whether the animals
are closer, say, to 2 cm or to 20 cm To express a representative value for the
sample of observations-for the length of the animals-we use a statistic of
location But statistics of location will not describe the shape of a frequency
distribution The shape may be long or very narrow, may be humped or
U-shaped, may contain two humps, or may be markedly asymmetrical
Quanti-tative measures of such aspects of frequency distributions are required To this
end we need to define and study the statistics of dispersion.
The arithmetic mean, described in Section 3.1, is undoubtedly the most
important single statistic of location, but others (the geometric mean, the
harmonic mean, the median, and the mode) are briefly mentioned in Sections
3.2, 3.3, and 3.4 Asimple statistic of dispersion (the range) is briefly noted in
Section3.5,and the standard deviation, the most common statistic for describing
dispersion, is explained in Section 3.6 Our first encounter with contrasts
be-tween sample statistics and population parameters occurs in Section 3.7, in
connection with statistics of location and dispersion In Section 3.8 there is a
description of practical methods for computing the mean and standard
devia-tion The coefficient of variation (a statistic that permits us to compare the
relative amount of dispersion in different samples) is explained in the last section
(Section 3.9)
The techniques that will be at your disposal after you have mastered this
chapter will not be very powerful in solving biological problems, but they will
be indispensable tools for any further work in biostatistics Other descriptive
statistics, of both location and dispersion, will be taken up in later chapters
A/J important /Jote: We shall first encounter the use of logarithms in this
chapter To avoid confusion, common logarithms have been consistently
ab-breviated as log, and natural logarithms as In Thus, log \ means loglo x and
In x means log" x.
The most common statistic of location is familiar to everyone Itis the arithml'lic
mean, commonly called the mean or averaye The mean is calculated by summing
all the individual observations or items of a sample and dividing this sum by
the number of items in the sample For instance, as the result of a gas analysis
in a respirometer an investigator obtains the following four readings of oxygen
percentages and sums them:
is symbolized by 1';, which stands for the ith observation in the sample Four
observations could be written symbolically as follows:
Yt , Y z, Y 3 , Y4
We shall define n, the sample size, as the number of items in a sample In thisparticular instance, the sample size n is 4 Thus, in a large sample, we can
symbolize the array from the first to the nth item as follows:
When we wish to sum items, we use the following notation:
i=n
L Y;= Y\ + Y z+ + Y n
i=1
The capital Greek sigma,L, simply means the sum of the items indicated The
i = 1means that the items should be summed, starting with the first one and
ending with the nth one, as indicated by the i= /J above the L The subscriptand superscript are necessary to indicate how many items should be summed.The"i = "in the superscript is usually omitted as superfluous For instance, if
we had wished to sum only the first three items, we would have written Lf~ 1Y;
On the other hand, had we wished to sum all of them except the first one, wewould have written L7~ 2 Y; With some exceptions (which will appear in laterchapters), it is desirable to omit subscripts and superscripts, which generallyadd to the apparent complexity of the formula and, when they arc unnecessary,distract the student's attention from the important relations expressed by theformula Below are seen increasing simplifications of the complete summationnotation shown at the extreme left:
Trang 2330 CHAPTER3 / DESCRIPTIVE STATISTICS 3.2 I OTHER MEANS 31
formula is written as follows:
of the logarithms of variable Y. Since addition of logarithms is equivalent tomultiplication of their antilogarithms, there is another way of representing thisquantity: it is
The computation of the geometric mean by Expression (3.4a) is lJ uite tedious
In practice, the geometric mean has to be computed by transforming the variatesinto logarithms
The reciprocal of the arithmetic mean of reciprocals is called the harmonic
mea/l. If we symbolize it by H y, the formula for the harmonic mean can bewritten in concise form (without subscripts and superscripts) as
The geometric mean permits us to become familiar with another operatorsymbol: capital pi n, which may be read as "product." Just as L symbolizes
the items that follow it The subscripts and superscripts have exactly the samemeaning as in the summation case Thus, Expression (3.4) for the geometricmean can be rewritten more compactly as follows:
their weighted average will be
This formula tells us, "Sum all the(n) items and divide the sum byn."
sheet of cardboard and then cut out the histogram and lay it flat against a
blackboard, supporting it with a pencil beneath, chances are that it would be
out of balance, toppling to either the left or the right If you moved the
sup-porting pencil point to a position about which the histogram would exactly
balance, this point of balance would correspond to the arithmetic mean
We often must compute averages of means or of other statistics that may
differ in their reliabilities because they arc based on different sample sizes At
other times we may wish the individual items to be averaged to have different
average. A general formula for calculating the weighted average of a set of
of Y; in such cases are unlikely to represent variates They are more likely to
be sample means ~ or some other statistics of different reliabilities
variates but are means Thus, if the following three means are based on differing
sample sizes, as shown,
Note that in this example, computation of the weighted mean is exactly
elJuiv-alent to adding up all the original measurements and dividing the sum by the
having the highest mean, will IIllluence the weighted average in proportion to
You may wish to convince yourself that thc geometric mean and the harmonicmean of the four oxygen percentages arc 14.65~~and 14.09·~,respectively Un-less the individual items do not vary, the geometric mean is always less thanthe arithmetic mean, and the harmonic mean is always less than the geometricmean
Some beginners in statistics have difficulty in accepting the fact that sures of location or central tendency other than the arithmetic mean are per-missible or even desirable They feel that the arithmetic mean is the "logical"
Trang 24mea-32 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.4 / THE MODE 33
average and that any other mean would be a distortion This whole problem
relates to the proper scale of measurement for representing data; this scale is
not always the linear scale familiar to everyone, but is sometimes by preference
a logarithmic or reciprocal scale Ifyou have doubts about this question, we
shall try to allay them in Chapter 10, where we discuss the reasons for
trans-forming variables
3.3 The median
ThemedianM is a statistic of location occasionally useful in biological research
Itis defined as that L'alue of the variable(in an ordered array) that has an equal
/lumber of items on either side of it. Thus, the median divides a frequency
dis-tribution into two halves In the following sample of five measurements,
14, 15, 16 19,23
M '= 16, since the third observation has an equal number of observations on
both sides of it We can visualize the median easily if we think of an array
from largest to smallest-for example, a row of men lined up by their heights
The median individual will then be that man having an equal number of men
on his right and left sides His height will be the median height of the
sam-ple considered This quantity is easily evaluated from a samsam-ple array with
an odd number of individuals When the number in the sample is even, the
median is conventionally calculated as the midpoint between the (n/2)th and
the [(n/2) + IJth variate Thus, for the sample of four measurements
14, 15 16, 19the median would be the midpoint between the second and third items or 15.5
Whenever any onc value of a variate occurs morc than once, problems may
devclop in locating the median Computation of the median item becomes morc
involved because all the memhers of a given class in which the median item is
located will havc thc same class mark The median then is the (/I/2)th variate
in the frequency distribution It is usually computed as that point between the
class limits of the median class where thc median individual would be located
(assuml!1g the individuals in the class were evenly distributed)
The median is just one of a family of statistics dividing a frequency
dis-tribution into equal areas It divides the distribution into two halves The three
i/ullrli[l's cut the distribution at the 25 50 and 75'';, points that is, at points
dividing the distribution into first, second third, and fourth quarters by area
(and frequencies) The second quartile is of course, the median (There are also
quintiles deciles, and pcrcentiles dividing the distribution into 5 10 and 100
equal portions rcspectively.)
Medians are most often used for distributions that do not conform to the
standard probahility models, so that nonparametric methods (sec Chaptcr 10)
must be uscd Sometimcs the median is a more representative measure of
loca-tion than the arithmetic mean Such instances almost always involve asymmetric
distributions An often quoted example from economics would be a suitablemeasure of location for the "typical" salary of an employee of a corporation.The very high salaries of the few senior executives would shift the arithmeticmean, the center of gravity, toward a completely unrepresentative value Themedian, on the other hand, would be little affected by a few high salaries; itwould give the particular point on the salary scale above which lie 50% of thesalaries in the corporation, the other half being lower than this figure
In biology an example of the preferred application of a median over thearithmetic mean may be in populations showing skewed distribution, such asweights Thus a median weight of American males 50 years old may be a moremeaningful statistic than the average weight The median is also of importance
in cases where it may be difficult or impossible to obtain and measure all theitems of a sample For example, suppose an animal behaviorist is studyingthe time it takes for a sample of animals to perform a certain behavioral step.The variable he is measuring is the time from the beginning of the experimentuntil each individual has performed What he wants to obtain is an averagetime of performance Such an average time however, can be calculated onlyafter records have been obtained on all the individuals Itmay take a long timefor the slowest animals to complete their performance longer than the observerwishes to spend (Some of them may never respond appropriately, m:.tking thecomputation of a mean impossiblc.) Therefore a convenient statistic of 10catil)Jl
to describe these animals may be the median time of performance Thus solong as the observn knows what the total sample size is, he need not havemeasurements for the right-hand tail of his distribution Similar examples would
be the responses to a drug or poison in a group of individuals (the medianlethal or effective dose LD;;o or EDso )or the median time for a mutation toappear in a number of lines of a species
3.4 The modeThe lIIodl' refers to Ihl' I'(ill/I' rl'pl'esellled h.l' Ihe I/I'ealesl Ill/Ill/WI'of i/ldi/'idl/a/s.
Whcn seen on a frequency distribution the mode is the value of the variablc
at which the curvc pcaks In grouped frequcncy distrihutions the mode as apoint has little mcaning.Itusually sutlices to identify the modal class IIIbiology.the mode docs not have many applications
Distributions having two peaks (equal or unequal in height) are called
himoda/; those with more than two peaks are ml/[till1oda[. In those rarc tributions that arc U-shaped we refcr to the low point at the middle of thedistribution as an {l/lfilllOde.
dis-In evaluating the relative merits of the arithmctic mean the mCdl;1I1 andthe mode a numoer or considerati'lns have to be kept in mind The mean isgenerally preferred In statistics since it has a smaller standard error than otherstatistics of location (see Section 6.2), it is easier to work with mathcmatically.and it has an additional desirable property (explained in Section 6.1): it willtend be normally distriouted even if the original data arc not The mean is
Trang 25f
I 10
8 6 4
2
0 10 8
; ,
" 6 c:
'"
"
0' 1:: 4
One simple measure of dispersion is the ralli/e, which is defined as Ihe diff£'re/lce herll'eell the /(I"!le~1lllld Ihe ~mlllleslilems ill IIslimp/e.Thus the range
of the four oxygen percentages listed earlier (Section 3.1) is
Range= 23.3 - 10.8 = 12S'~,
and the range of the aphid femur lengths (Box 2.1) is
Range= 4.7- 3.3 = 1.4 units of 0.1 millSince the range is a measure of the span of the variates along the scale of thevariable, it is in the same units as the original measurements The range isclearly affected hy even a single outlying value and for this reason is only a
" = 120
Ifi ~.II 'in
CHAPTER3 IJESCRIPTIVE STATISTICS
II
I I I I I
An asymmetrical frequency distribution (skewed to the right) showing location of the mean, median,
and mode Percent butterfat in 120 samples of milk (from a Canadian cattle breeders' record book).
markedly alli:cted by outlying observations; the median and mode arc not The
mean is generally more sensitive to changes in the shape of a frequency
distri-bution, and if it is desired to have a statistic reflecting such changes, the mean
may be preferred
In symmetrical, unimodal distributions the mean, the median, and the mode
are all identical A prime example of this is the well-known normal distribution
of Chapter 5 In a typical asymmetrical distrihution, such as the one shown in
Figure J I, the relative positions of the mode median, and mean arc generally
these: the mean is closest to the drawn-out tail of the distribution, the
Lmode i'sfarthest, and the median is hetween these An easy way to rememher this se-
quence is to recall that they occur in alphabetical order from the longer tail of
the distribution
J.S The range
We now turn to measures of dispersion Figure 3.2demonstrates that radically
dilkrent-Iooking distrihutions may possess the identical arithmetic mean It is
tl"'r,.fl\t-"I,h\/;.\II :,I'l-.t ,,,tl.,,.t· lll' •• ,.· f.r I ~·,.,., •• ;.7; r~~"t ;I •• 1~f·'" " " t L-" C~,"••• ;,1
Trang 2636 CHAPTER3 / DESCRIPTIVE STATISTICS 3.7 / SAMPLE STATISTICS AND PARAMETERS 373.6 The standard deviation
We desire that a measure of dispersion take all items of a distribution into
consideration, weighting each item by its distance from the center of the
distri-bution We shall now try to construct such a statistic In Table 3.1 we show a
shows the vanates in the order in which they were reported The computation
of the mean is shown below the table The mean neutrophil count turns out to
be 7.713
The distance of each variate from the mean is computed as the following
deviation:
E.ach individual deviation, or deviate, is by convention computed as the
Deviates are symbolized by lowercase letters corresponding to the capital letters
manner
We now wish to calculate an average deviation that will sum all the deviates
and divide them by the number of deviates in the sample But note that when
TABLE 3.1 The standard deviation Long method, not recommended for hand or calculator computations but shown here to illus- trate the meaning or the standard deviation The data are hlood neutrophil counts (divided hy \()OO) per microliter, in
15 patients with non hematological tumors.
we sum our deviates, negative and positive deviates cancel out, as is shown
by the sum at the bottom of column (2); this sum appears to be unequal tozero only because of a rounding error Deviations from the arithmetic meanalways sum to zero because the mean is the center of gravity Consequently,
an average based on the sum of deviations would also always equal zero Youare urged to study Appendix AU, which demonstrates that the sum of deviationsaround the mean of a sample is equal to zero
Squaring the deviates gives us column (3) of Table 3.1 and enables us toreach a result other than zero (Squaring the deviates also holds other mathe-matical advantages, which we shall take up in Sections 7.5 and 11.3.) The sum
of the squared deviates (in this case, 308.7770) is a very important quantity instatistics It is called the sum of squares and is identified symbolically as I: y 2.
Another common symbol for the sum of squares is 55
resulting quantity is known as the variance, or the mean square:
The variance is a measure of fundamental importance in statistics and weshall employ it throughout this book At the moment, we need only rememberthat because of the squaring of the deviations, the variance is expressed insquared units To undo the etfect of the squaring, we now take the positivesquare root of the variance and obtain the standard deviation:
V~
Thus, standard dcviation is again cxprcssed ill the original units of ment, since it is a square root of the squared units of the variance
measurc-An important note: The technique just learned and illustrated in Table 3.1
is not the simplest for direct computation of a variance and standard deviation.However, it is often used in computer programs, where accuracy of computa-tions is an important consideration Alternativc and simpler computationalmethods are given in Section 3.8
The observant reader may havc noticed that we have avoided assigningany symbol to either the variance or the standard deviation We shall explainwhy in the next section
3.7 Sample statistics and parameters
Up to now we have calculated statistics from samples without giving too muchthought to what these statistics represent When correctly calculated, a meanand standard deviation will always be absolutely true mcasures of location anddispersion for the samples on which they are based Thus thc truc mcan of thefour oxygen percentagc readings in Section 3.1 is 15.325 ";, The standard devia-tion of the 15 ncutrophil counts is 4.537 Howevcr, only rardy in biology (or
,",t'lf;(.'f;f~I.:' n~""t1r'r'll\ ~
Trang 2738 CHAPTER3 / DESCRIPTIVE STATISTICS 3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION 39
(3.6)
This formulation explains most clearly the meaning of the sum of squares, though it may be inconvenient for computation by hand or calculator, sinceone must first compute the mean before one can square and sum the deviations
al-A quicker computational formula for this quantity is
3.8 Practical methods for computing mean and standard deviation
Three steps are necessary for computing the standard deviation: (I) find I:y2,
the sum of squares; (2) divide by n - I to give the variance; and (3) take the
square root of the variance to obtain the standard deviation The procedureused to compute the sum of squares in Section 3.6 can be expressed by thefollowing formula:
(3.8)(3.7)
contrast to using these as estimates of the population parameters There arealso the rare cases in which the investigator possesses data on the entire popu-lation; in such cases division by nis perfectly justified, because then the inves-tigator is not estimating a parameter but is in fact evaluating it Thus thevariance of the wing lengths of all adult whooping cranes would be a parametricvalue; similarly, if the heights of all winners of the Nobel Prize in physics hadbeen measured, their variance would be a parameter since it would be based
on the entire population
Let us see exactly what this formula represents The first term on the right side
of the cquation, :Ey2, is the sum of all individual Y's, each squared, as follows:
V's; that is, all the Y's are first summcd and this sum is then squared.Ingeneral,this quantity is diffcrent from I:y2, which first squares the Y's and thcn sums
thcm These two terms are identical only if all the Y's are equal Ifyou are notcertain about this, you can convince yourself of this fact by calculating thesetwo quantities for a few numbers
The disadvantage of Expression (3.X) is that the quantitiesI:y2 and(I:Y)2/ 11
may both be quite large, so that accuracy may be lost in computing their ference unless one takes the precaution of carrying sufllcient significant ligures.Why is Expression (3.8) identical with Expression (3.7)? The proof of thisidentity is very simple and is given in Appendix A1.2 You are urged to work
)
308.7770
-only as descriptive summaries of the samples we have studied Almost always we
are interested in thepopulations from which the samples have been taken What
we want to know is not the mean of the particular four oxygen precentages,
but rather the true oxgyen percentage of the universe of readings from which
the four readings have been sampled Similarly, we would like to know the true
mean neutrophil count of the population of patients with nonhematological
tumors, not merely the mean of the 15 individuals measured When studying
dispersion we generally wish to learn the true standard deviations of the
popu-lations and not those of the samples These population statistics, however, are
unknown and (generally speaking) are unknowable Who would be able to
col-lect all the patients with this particular disease and measure their neutrophil
counts? Thus we need to usesample statistics as estimators of population
statis-tics or parameters.
Itis conventional in statistics to use Greek letters for population parameters
the parametric mean of the population Similarly, a sample variance, symbolized
byS2,estimates a parametric variance, symbolized by(f2. Such estimators should
from a population with a known parameter should give sample statistics that,
when averaged, will give the parametric value An estimator that does not do
so is called biased.
However, the sample variance as computed in Section 3.6 is not unbiased On
the average, it will underestimate the magnitude of the population variance(J2
To overcome this bias, mathematical statisticians have shown that when sums
of squares are divided byn - I rather than by II the resulting sample variances
will be unbiased estimators of the population variance For this reason, it is
formula for the standard deviation is therefore customarily given as follows:
-n - I
We note that this value is slightly larger than our previous estimate of 4.537
Of course, the greater the sample size, the less difference there will be between
it refers to a variance obtained by division of the sum of squares by the dewee5
Division of the sum of squares by II is appropriate only when the interest
of the investigator is limited to the sample at hand and to its variance and
Trang 2840 CHAPTER 3 / DESCRIPTIVE STATISTICS
3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION 41
The resulting class marks are values slll:h as 0, 8, 16, 24, 32, and so on Theyare then divided by 8 which changes them to 0, I, 2 3, 4, and so on, which isthe desired formal The details of the computation can be learned from the box.When checking the results of calculations, it is frequently useful to have
an approximate method for estimating statistics so that gross errors in tation can be detected A simple method for estimating the mean is to averagethe largest and smallest observation to obtain the so-called midranye. For theneutrophil counts of Table 3.1, this value is (2.3+ 18.0)/2 = 10.15 (not a verygood estimate) Standard deviations can be estimated from ranges by appro-priate division of the range as follows:
compu-BOX 3.1
Cakulation orYandsfromunordereddata
Neutrophilconots, unordered as shown in Table 3.1
through it to build up your confidence in handling statistical symbols and
formulas
It is sometimes possible to simplify computations by recoding variates into
simpler form We shall use the term additive coding for the addition or
sub-traction of a constant (since subsub-traction is only addition of a negative number)
We shall similarly use multiplicative coding to refer to the multiplication or
division by a constant (since division is multiplication by the reciprocal of the
divisor) We shall use the term combination coding to mean the application of
both additive and multiplicative coding to the same set of data In Appendix
A 1.3 we examine the consequences of the three types of coding in the
com-putation of means, variances, and standard deviations
For the case ofmeans,the formula for combination coding and decoding is
the most generally applicable one Ifthe coded variable is ~=D(Y+C). then
Y= "- C D
where C is an additive code and D is a multiplicative code
On considering the effects of coding variates on the values ofvariances and
squares, variances, or standard deviations The mathematical proof is given in
Appendix A 1.3 but we can see this intuitively, because an additive code has
no effect on the distance of an item from its mean The distance from an item
of 15 to its mean of 10 would be 5 Ifwe were to code the variates by
sub-tracting a constant of 10, the item would now be 5 and the mean zero The
difference between them would still be 5 Thus if only additive coding is
em-ployed, the only statistic in need of decoding is the mean But multiplicative
coding docs have an effect on sums of squares varianccs and standard
devia-tions The standard deviations have to be divided by the multiplicative code
just as had to be done for the mean However, the sums of squares or variances
have to be divided by the multiplicative codes squared, because they are squared
terms, and the multiplicative factor becomes squared during the operations In
combination coding the additive code can be ignored
When the data are unordered the computation of the mean and standard
deviation proceeds as in Box 3.1, which is based on the unordered
neutrophiJ-count data shown in Table 3.1 We chose not to apply coding to these data
since it would not have simplilled the computations appreciably
When the data arc arrayed in a frequency distribution, the computations
can be made much simpler When computing the statistics, you can often avoid
the need for manual entry of large numbers of individual variates jf you first
set up a frequency distribution Sometimes the data will come to you already
in the form of a frequency distribution having been grouped by the researcher
The computation of Yand s from a frequency distribution is illustrated in
Box 3.2 The data are the birth weights of male Chinese children, first encountered
in Figure 2.3 The calculation is simplilled by coding to remove the awk ward
class marks This is done by subtracting 59.5, the lowest class mark of the array
Dil'idl' Ihl' ran'll' IJy
3 4 5
66~
•
Trang 2942 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.9 / THE COEFFICIENT OF VARIATION 43
•
BOX 3.2
Calculation ofY, s, ami V from a frequency distribution.
Birth weights of male Chinesein ounces
The range of the neutrophil counts is 15.7 When this value is divided by 4, weget an estimate for the standard deviation of 3.925, which compares with thecalculated value of 4.696 in Box 3.1 However, when we estimate mean andstandard deviation of the aphid femur lengths of Box 2.1 in this manner, weobtain 4.0 and 0.35, respectively These are good estimates of the actual values
of 4.004 and 0.3656, the sample mean and standard deviation
Having obtained the standard deviation as a measure of the amount of variation
in the data, you may be led to ask, "Now what'?" At this stage in our prehension of statistical theory, nothing really useful comes of the computations
com-we have carried out Hocom-wever, the skills just learned are basic to all later tical work So far, the only use that we might have for the standard deviation
statis-is as an estimate of the amount of variation in a population Thus we maywish to compare the magnitudes of the standard deviations of similar popula-
tions and see whether population A is more or less variable than population B.
When populations differ appreciably in their means the direct comparison
of their variances or standard deviations is less useful, since larger organismsusually vary more than smaller one For instance, the standard deviation ofthe tail lengths of elephants is obviously much greater than the entire tail1cngth
of a mouse To compare the relative amounts of variation in populations having
different means, the coefficient (!{ variation, symbolized by V (or occasionally
CV), has been developed This is simply the standard deviation expressed as apercentage of the mean Its formula is
V=
-~ y
3.9 The coefficient of variation
For example, the coeflicient of variation of the birth weights in Box J.2 IS12.37".:, as shown at the bottom of that box The coeflicient of variation ISindependent of the unit of measurement and is expressed as a percentage.Coefficients of variation are used when one wishes to compare the variation
of two populations without considering the magnitude of their means (It isprobably of little interest to discover whether the birth weights of the Chinesechildren are more or less variable than the femur lengths of the aphid stem
mothers However, we can calculate V for the latter as (0.3656 x 1(0)/4.004=
9.13%, which would suggest that the birth weights arc morc variable.) Often,
we shall wish to test whether a given biological sample is more variable for onecharacter than for another Thus, for a sample of rats, is hody weight morcvariable than blood sugar content'! A second, frequent typc of comparison,especially in systematics, is among different populations for the same character.Thus, we may have measured wing length in samples of hirds from severallocalities We wish to know whether anyone of these populations is more vari-able than the others An answer to this question can be obtained hyexaminingthe coeillcients of variation of wing length in these samples
Coding and decoding
Trang 3044 CHAPTER3 / DESCRIPTIVE STATISTICS EXERCISES 45
3.1 Find Y, s, V, and the median for the following data (mg of glycine per mg of
creatinine in the urine of 37 chimpanzees; from Gartler, Firschein, and
Dob-zhansky, 1956) ANS Y= 0.1l5, s= 0.10404
3.2 Find the mean, standard deviation, and coefficient of variation for the pigeon
data given in Exercise 2.4 Group the data into ten classes, recompute Yand s,
and compare them with the results obtained from ungrouped data Compute
the median for the grouped data
3.3 The following are percentages of butterfat from 120 registered three-year-old
Ayrshire cows selected at random from a Canadian stock record book
(a) Calculate Y,s, and Vdirectly from the data _
(b) Group the data in a frequency distribution and again calculate Y, s, and V.
Compare the results with those of (a) How much precision has been lost by
grouping? Also calculate the median
.055.100.050.019
.135.120.080.100
.052.110.110.100
.077.100.110.116
.026.350.120
.440.100.133
.300.300.100
by 8.0 first and then added 5.2?
EstimateJ1and(fusing the midrange and the range (see Section 3.8) for the data
in Exercises 3.1, 3.2, and 3.3 How well do these estimates agree with the mates given by Yand s? ANS Estimates ofJ1and(ffor Exercise 3.2 are 0.224and 0.1014
esti-Show that the equation for the variance can also be written as
Ly 2- ny2
=._-~._ n-IUsing the striped bass age distribution given in Exercise 2.9, compute the fol-lowing statistics: Y, S2, s, V, median, and mode ANS Y= 3.043,S2= 1.2661,
s= 1.125, V=36.98%, median=2.948, mode=3
Use a calculator and compare the results of using Equations 3.7 and 3.8 tocomputeS2 for the following artificial data sets:
(a) 1, 2, 3, 4, 5(b) 9001, 9002, 9003, 9004, 9005(c) 90001, 90002, 90003, 90004, 90005(d) 900001,900002,900003,900004,900005Compare your results with those of one or more computer programs What isthe correct answer? Explain your results
3.4 What ciTed would adding a constant 5.2 t() all observations have upon the
Ill/merical values of the following statistics: Y. 1/, average deviation, median
Trang 31INTRODUCTION TO PROBABILITY DISTRIBUTIONS 47
Introduction to Probability
Distributions: The Binomial and
Poisson Distributions
In Section 2.5 we first encountered frequency distributions For example, Table
2.2 shows a distribution for a meristic, or discrete (discontinuous), variable, the
number of sedge plants per quadrat Examples of distributions for continuous
variables are the femur lengths of aphids in Box 2.1 and the human birth weights
in Box 3.2 Each of these distributions informs us about the absolute frequency
of any given class and permits us to computate the relative frequencies of any
class of variable Thus, most of the quadrats contained either no sedges or one
or two plants In the 139.5-oz class of birth weights, we find only 201 out of
the total of 9465 babies recorded; that is, approximately only 2.1%of the infants
are in that birth weight class
We realize, of course, that these frequency distributions are only samples
from given populations The birth weights, for example, represent a population
of male Chinese infants from a given geographical area But if we knew our
sample to be representative of that population, we could make all sorts of
pre-dictions based upon the sample frequency distribution For instance, we could
the probability that the weight at birth of anyone baby in this population will
be in the 139.5-oz birth class is quite low.Ifall of the 9465 weights were mixed
up in a hat and a single one pulled out, the probability that we would pull out
would be much more probable that we would sample an infant of 107.5 or115.5 OZ, since the infants in these classes are represented by frequencies 2240and 2007, respectively Finally, if we were to sample from an unknown popula-tion of babies and find that the very first individual sampled had a birth weight
of 170 oz, we would probably reject any hypothesis that the unknown populationwas the same as that sampled in Box 3.2 We would arrive at this conclusion
had a birth weight that high Though it is possible that we could have sampledfrom the population of male Chinese babies and obtained a birth weight of 170
oz, the probability that the first individual sampled would have such a value
is very low indeed.Itseems much more reasonable to suppose that the unknownpopulation from which we are sampling has a larger mean that the one sampled
in Box 3.2
We have used this empirical frequency distribution to make certain tions (with what frequency a given event will occur) or to make judgments anddecisions (is it likely that an infant of a given birth weight belongs to thispopulation?) In many cases in biology, however, we shall make such predictionsnot from empirical distributions, but on the basis of theoretical considerationsthat in our judgment are pertinent We may feel that the data should be distrib-uted in a certain way because of basic assumptions about the nature of the
conform sufficiently to the values expected on the basis of these assumptions,
we shall have serious doubts about our assumptions This is a common use offrequency distributions in biology The assumptions being tested generally lead
to a theoretical frequency distribution known also as a prohahility distrihution.
This may be a simple two-valued distribution, such as the 3: 1 ratio in aMendelian cross; or it may be a more complicated function, as it would be if
the observed data do not fit the expectations on the basis of theory, we areoften led to the discovery of some biological mechanism causing this deviationfrom expectation The phenomena of linkage in genetics, of preferential matingbetween different phenotypes in animal behavior, of congregation of animals
at certain favored places or, conversely, their territorial dispersion are cases inpoint We shall thus make use of probability theory to test our assumptionsabout the laws of occurrence of certain biological phenomena We should point
outto the reader, however, that probability theory underlies the entire structure
of statistics, since, owing to the non mathematical orientation of this hook, thismay not be entirely obvious
In this chapter we shall first discuss probability, in Section 4.1, but only tothe extent necessary for comprehension of the sections that follow at the intendedlevel of mathematical sophistication Next, in Section 4.2, we shall take up the
Trang 3248 CHAPTER 4 / INTRODUCTION TO PROAARILITY DISTRInUTIONS 4 t / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 49
binomial frequency distribution, which is not only important in certain types
of studies, such as genetics, but also fundamental to an understanding of the
various kinds of probability distributions to be discussed in this book
The Poisson distribution, which folIows in Section 4.3, is of wide applicability
in biology, especially for tests of randomness of occurrence of certain events
Both the binomial and Poisson distributions are discrete probability
distribu-tions The most common continuous probability distribution is the normal
frequency distribution, discussed in Chapter 5
4.1 Probability, random sampling, and hypothesis testing
We shalI start this discussion with an example that is not biometrical or
biological in the strict sense We have often found it pedagogically effective to
introduce new concepts through situations thoroughly familiar to the student,
even if the example is not relevant to the general subject matter of biostatistics
Let us betake ourselves to Matchless University, a state institution
somewhere between the Appalachians and the Rockies Looking at its enrollment
figures, we notice the following breakdown of the student body: 70% of the
students are American undergraduates (AU) and 26% are American graduate
students (AG); the remaining 4% are from abroad Of these, I % are foreign
undergraduates (FU) and 3% are foreign graduate students (FG) In much of
our work we shall use proportions rather than percentages as a useful convention
Thus the enrollment consists of 0.70 AU's, 0.26 AG's, 0.01 FU's, and 0.03 FG's
The total student body, corresponding to 100%, is therefore represented by the
figure 1.0
we would intuitively expect that, on the average 3 would be foreign graduate
students The actual outcome might vary There might not be a single FG
student among the 100 sampled, or there might be quite a few more than 3
The ratio of the number of foreign graduate students sampled divided by the
total number of students sampled might therefore vary from zero to considerably
greater than 0.03 If we increased our sample size to 500 or 1000, it is less likely
that the ratio would fluctuate widely around 0.03 The greater the sample taken,
the closer the ratio of FG students sampled to the total students sampled will
approach 0.03 In fact, the probability of sampling a foreign student can be
defined as the limit as sample size keeps increasing of the ratio of foreign students
to the total number of students sampled Thus we may formally summarize
the situation by stating that the probability that a student at Matchless
Now let us imagine the following experiment: We try to sample a student
at random from among the student body at Matchless University This is not
as easy a task as might be imagined If we wanted to do this operation physically,
we would have to set up a colIection or trapping station somewhere on campus.And to make certain that the sample was truly random with respect to theentire student population, we would have to know the ecology of students oncampus very thoroughly We should try to locate our trap at some stationwhere each student had an equal probability of passing Few, if any, such places
frequented more by independent and foreign students, less by those living inorganized houses and dormitories Fewer foreign and graduate students might
be found along fraternity row Clearly, we would not wish to place our trapnear the International Club or House, because our probability of sampling aforeign student would be greatly enhanced In front of the bursar's window wemight sample students paying tuition But those on scholarships might not befound there We do not know whether the proportion of scholarships amongforeign or graduate students is the same as or different from that among theAmerican or undergraduate students Athletic events, political rallies, dances,and the like would alI draw a differential spectrum of the student body; indeed,
no easy solution seems in sight The time of sampling is equally important, inthe seasonal as well as the diurnal cycle
Those among the readers who are interested in sampling organisms fromnature will already have perceived parallel problems in their work Ifwe were
to sample only students wearing turbans or saris, their probability of beingforeign students would be almost 1 We could no longer speak of a randomsample In the familiar ecosystem of the university these violations of propersampling procedure are obvious to all of us, but they are not nearly so obvious
in real biological instances where we are unfamiliar with the true nature of theenvironment How should we proceed to obtain a random sample of leavesfrom a tree, of insects from a field, or of mutations in a culture? In sampling
at random, we are attempting to permit the frequencies of various eventsoccurring in nature to be reproduced unalteredly in our records; that is, wehope that on the average the frequencies of these events in our sample will bethe same as they are in the natural situation Another way of saying this is that
in a random sample every individual in the population being sampled has anequal probability of being included in the sample
.We might go about obtaining a random sample by using records
repre-~entmg the student body, such as the student directory, selecting a page from
an arbItrary number to each student, write each on a chip or disk, put these
in a large container, stir well, and then pull out a number
Imagine now that we sample a single student physically by the trappingmethod, after carefully planning the placement of the trap in such a way as tomake sampling random Wtat are the possible outcomes? Clearly, the student
could be either an AU, AG, FU or FG The set of these four possible outcomes
exhausts the possibilities of this experiment This set, which we can represent
as {AU, AG, FU, FG} is called the sample space Any single trial of the experiment
described above would result in only one ofthe four possible outcomes (elements)
Trang 3350 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.1 / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 51
in the set A single element in a sample space is called a simple event. It is
distinguished from an event, which is any subset of the sample.space Thus, in
the sample space defined above {AU}, {AG}, {FU}, and {FG} are each
sim-ple events The following sampling results are some of the possible events:
{AU, AG, FU}, {AU, AG, FG}, {AG, FG}, {AU, FG}, By the definition of
"event," simple events as well as the entire sample space are also events The
meaning of these events should be clarified Thus {AU, AG, FU} implies being
either an American or an undergraduate, or both
en-compasses all possible outcomes in the space yielding an American student
a graduate student The intersection of events A and B, written An B, describes
only those events that are shared by A and B Clearly only AG qualifies, as
can be seen below:
Thus, An B is that event in the sample space giving rise to the sampling of an
American graduate student When the intersection of two events is empty, as
there is no common element in these two events in the sampling space
We may also define events that are unions of two other events in the sample
graduate students, or American graduate students
Why are we concerned with defining sample spaces and events? Because
these concepts lead us to useful definitions and operations regarding the
probability of various outcomes If we can assign a numberp,where°.$p.$ 1,
to each simple event in a sample space such that the sum of these p's over all
simple events in the space equals unity, then the space becomes a (finIte)
probability space In our example above, the following numbers were associated
with the appropriate simple events in the sample space:
{AU,AG, FU, FG}
{O.70, 0.26, O.ol, 0.03}
Given this probability space, we are now able to make statements regarding
the probability of given events For example, what is the probability that a
student sampled at random will be an American graduate student? Clearly, It
or a graduate student? In terms of the events defined earlier, t IS IS
PLA u BJ = PL[AU, AG}J + P[[AG, FG}] - PUAG}]
= 0.96+0.29 0.26
0.99
because if we did not do so it would be included twice, once in P[A] and once
in P[B], and would lead to the absurd result of a probability greater than 1.Now let us assume that we have sampled our single student from the studentbody of Matchless University and that student turns out to be a foreign graduatestudent What can we conclude from this? By chance alone, this result would
we have sampled at random should probably be rejected, since if we accept thehypothesis of random sampling, the outcome of the experiment is improbable
Please note that we said improbable, not impossible. Itis obvious that we couldhave chanced upon an FG as the very first one to be sampled However, it isnot very likely The probability is 0.97 that a single student sampled will be anon-FG If we could be certain that our sampling method was random (aswhen drawing student numbers out of a container), we would have to decidethat an improbable event has occurred The decisions of this paragraph are allbased on our definite knowledge that the proportion of students at MatchlessUniversity is indeed as specified by the probability space.Ifwe were uncertainabout this, we would be led to assume a higher proportion of foreign graduatestudents as a consequence of the outcome of our sampling experiment
We shall now extend our experiment and sample two students rather thanjust one What are the possible outcomes of this sampling experiment? The newsampling space can best be depicted by a diagram (Figure 4.1) that shows theset of the 16 possible simple events as points in a lattice The simple events arethe following possible combinations Ignoring which student was sampled first,they are (AU, AU), (AU, AG), (AU, FU), (AU, FG), (AG, AG), (AG, FU),(AG, FG), (FU, FU), (FU, FG), and (FG, FG)
00:; 1'(; oo:!•J() O.007S• o.oom• O.OOO!I•
Trang 3452 CHAPTER4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.1 / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 53What are the expected probabilities of these outcomes? We know the
expected outcomes for sampling one student from the former probability space,
but what will be the probability space corresponding to the new sampling space
of 16 elements? Now the nature of the sampling procedure becomes quite
im-portant We may sample with or without replacement: we may return the first
student sampled to the population (that is, replace the first student), or we may
keep him or her out of the pool of the individuals to be sampled Ifwe do not
replace the first individual sampled, the probability of sampling a foreign
graduate student will no longer be exactly 0.03 This is easily seen Let us assume
that Matchless University has 10,000 students Then, since 3% are foreign
graduate students, there must be 300 FG students at the university After
sampling a foreign graduate student first, this number is reduced to 299 out of
9999 students Consequently, the probability of sampling an FG student now
original foreign student to the student population and make certain that the
population is thoroughly randomized before being sampled again (that is, give
the student a chance to lose him- or herself among the campus crowd or, in
drawing student numbers out of a container, mix up the disks with the numbers
on them), the probability of sampling a second FG student is the same as
before-O.03 [n fact, if we keep on replacing the sampled individuals in the
original population, we can sample from it as though it were an infinite-sized
population
Biological populations are, of course, finite, but they are frequently so large
that for purposes of sampling experiments we can consider them effectively
infinite whether we replace sampled individuals or not After all, even in this
relatively small population of 10,000 students, the probability of sampling a
second foreign graduate student (without replacement) is only minutely different
from 0.03 For the rest of this section we shall consider sampling to be with
replacement, so that the probability level of obtaining a foreign student does
not change
There is a second potential source of difficulty in this design We have to
assume not only that the probability of sampling a second foreign student is
equal to that of the first, but also that it is independent of it By independence
students, if we have sampled one foreign student, is it more or less likely that a
second student sampled in the same manner will also be a foreign student?
Inde-pendence of the events may depend on where we sample the students or on the
method of sampling.Ifwe have sampled students on campus, it is quite likely that
the events are not independent; that is, if one foreign student has been sampled,
the probability that the second student will be foreign is increased, since foreign
students tend to congregate Thus, at Matchless University the probability that
a student walking with a foreign graduate student is also an FG will be greater
than 0.03
Events D and E in a sample space will be defined as independent whenever
P[Dn E] = P[D]P[E]' The probability values assigned to the sixteen points
in the sample-space lattice of Figure 4.1 have been computed to satisfy theabove condition Thus, lettingP[D] equal the probability that the first studentwill be an AU, that is,P[{AUlAU2 ,AU lAG2 ,AU IFU2'AU IFG2 }],and letting
imposed upon all points in the probability space Therefore, if the samplingprobabilities for the second student are independent of the type of studentsampled first, we can compute the probabilities of the outcomes simply as theproduct of the independent probabilities Thus the probability of obtaining two
The probability of obtaining one AU and one FG student in the sampleshould be the product 0.70 x 0.03 However, it is in fact twice that proba-
students, namely, by sampling first one FG and then again another FG ilarly, there is only one way to sample two AU students However, samplingone of each type of student can be done by sampling first an AU and then an
Sim-FG or by sampling first an Sim-FG and then an AU Thus the probability is
2P[{AU}]P[{FG}] = 2 x 0.70 x 0.03 = 0.0420
Ifwe conducted such an experiment and obtain a sample of two FG students,
we would be led to the following conclusions Only 0.0009 of the samples(l~O
of I % or 9 out of 10,000 cases) would be expected to consist of two foreign
alone GivenP[{FG}] = 0.03 as a fact, we would therefore suspect that samplingwas not random or that the events were not independent (or that both as-sumptions random sampling and independence of events were incorrect).Random sampling is sometimes confused with randomness in nature Theformer is the faithful representation in the sample of the distribution of theevents in nature; the latter is the independence of the events in nature The first
of these generally is or should be under the control of the experimenter and isrelated to the strategy of good sampling The second generally describes aninnate property of the objects being sampled and thus is of greater biologicalinterest The confusion between random sampling and independence of eventsarises because lack of either can yield observed frequencies of events differingfrom expectation We have already seen how lack of independence in samplesofforeign students can be interpreted from both points of view in our illustrativeexample from Matchless University
The above account of probability is adequate for our present purposes butfar too sketchy to convey an understanding of the field Readers interested inextending their knowledge of the subject are referred to Mosimann (1968) for
a simple introduction
Trang 3554 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 55
is through the use of Pascal's triangle:
4.2 The binomial distribution
For purposes of the discussion to follow we shall simplify our sample space to
consist of only two elements, foreign and American students, and ignore whether
the students are undergraduates or graduates; we shall represent the sample
space by the set {F, A} Let us symbolize the probability space by{p, q},where
p= P[F], the probability that the student is foreign, and q= PEA], the
prob-ability that the student is American As before, we can compute the probprob-ability
space of samples of two students as follows:
k
1
2
34
samples of three students would be as follows:
{FFF,FFA,FAA,AAA}
Samples of three foreign or three American students can again be obtained in
only one way, and their probabilities are p3 and q3, respectively However, in
samples of three there are three ways of obtaining two students of one kind
and one student of the other As before, if A stands for American and F stands
for foreign, then the sampling sequence can be AFF, F AF, FFA for two foreign
students and one American Thus the probability of this outcome will be 3 p 2 q
Similarly, the probability for two Americans and one foreign student is 3pq 2.
A convenient way to summarize these results is by means of the binomial
expansion, which is applicable to samples of any size from populations in which
objects occur independently in only two classes students who may be foreign
or American, or individuals who may be dead or alive, male or female, black
or white, rough or smooth, and so forth This is accomplished by expanding
the binomial term(p +q)\ wherek equals sample size, pequals the probability
of occurrence of the first class, and q equals the probability of occurrence of
the second class By definition,p +q = 1; henceqis a function ofp: q = 1 - p.
For samples of I,(p +q)1 =P+q
For samples of 2,(p +q)2 = p2 +2pq +q2
discussed previously The coefficients (the numbers before the powers ofp and
q)express the number of ways a particular outcome is obtained An easy method
for evaluating the coefficients of the expanded terms of the binomial expression
Pascal's triangle provides the coefficients of the binomial expression-that is,the number of possible outcomes of the various combinations of events For
k= 1 the coefficients are 1 and 1 For the second line (k =2), write 1 at theleft-hand margin of the line The 2 in the middle of this line is the sum of thevalues to the left and right of it in the line above The line is concluded with a
1 Similarly, the values at the beginning and end of the third line are 1, andthe other numbers are sums of the values to their left and right in the lineabove; thus 3 is the sum of 1 and 2 This principle continues for every line Youcan work out the coefficients for any size sample in this manner The line for
k= 6 would consist of the following coefficients: 1, 6, 15, 20, 15, 6, I The p
imitate for any value ofk. We give it here for k= 4:
The power ofp decreases from 4 to 0(k to 0 in the general case) as the power
ofq increases from 0 to 4 (0 to k in the general case) Since any value to the
power 0 is 1 and any term to the power 1 is simply itself, we can simplify thisexpression as shown below and at the same time provide it with the coefficientsfrom Pascal's triangle for the case k=4:
Thus we are able to write down almost by inspection the expansion of thebinomial to any reasonable power Let us now practice our newly learned ability
to expand the binomial
Suppose we have a population of insects, exactly 40% of which are infectedwith a given virus X.Ifwe take samples ofk= 5 insects each and examine eachinsect separately for presence of the virus, what distribution of samples could
we expect if the probability of infection of each insect in a sample were
proportion infected, andq= 0.6,the proportion not infected It is assumed thatthe population is so large that the question of whether sampling is with orwithout replacement is irrelevant for practical purposes The expected propor-tions would be the expansion of the binomial:
(p +q)k =(0.4 +0.6)5
Trang 3656 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 57With the aid of Pascal's triangle this expansion is
or
(0.4)5 +5(0.4)4(0.6)+ 10(0.4)3(0.6)2 + 10(0.4)2(0.6)3+5(0.4)(0.6)4+(0.6)5
representing the expected proportions of samples of five infected insects, four
infected and one noninfected insects, three infected and two noninfected insects,
and so on
The reader has probably realized by now that the terms of the binomial
expansion actually yield a type of frequency distribution for these different
outcomes Associated with each outcome, such as "five infected insects," there
is a probability of occurrence-in this case(0.4)5 =0.01024 This is a theoretical
frequency distribution orprobability distribution of events that can occur in two
distribution described here is known as the binomial distribution, and the
bino-mial expansion yields the expected frequencies of the classes of the binobino-mial
distribution
A convenient layout for presentation and computation of a binomial
distribution is shown in Table4.1 The first column lists the number of infected
binomial coefficients from Pascal's triangle are shown in column (4) Therelative
TABLE 4.1
Expected frequencies of infected insects in samples of 5 insects sampled from an infinitety large
population with an assumed infection rate of 40%.
(I)
per sample of 0( Binomial Feq"",ncies Feq,,~ncje., freq"encies
expected frequencies, which are the probabilities of the various outcomes, are
shown in column(5) We label such expecred frequencies.l.el' They are simplythe product of columns (2), (3), and (4) Their sum is equal to 1.0, since theevents listed in column (1) exhaust the possible outcomes We see from column
infected insects, and25.9% are expected to contain I infected and 4 noninfectedinsects We shall test whether these predictions hold in an actual experiment
Experiment 4.1 Simulate the sampling of infected insects by using a table of random
numbers such as Table I in Appendix Ai These are randomly chosen one-digit numbers
in which each digit 0 through 9 has an equal probability of appearing The numbersare grouped in blocks of 25 for convenience Such numbers can also be obtained fromrandom number keys on some pocket calculators and by means of pseudorandomnumber-generating algorithms in computer programs (In fact, this entire experimentcan be programmed and performed automatically even on a small computer.) Sincethere is an equal probability for anyone digit to appear, you can let any four digits(say, 0, 1, 2, 3) stand for the infected insects and the remaining digits (4, 5, 6, 7, 8, 9)stand for the noninfected insects The probability that anyone digit selected from thetable will represent an infected insect (that is, will he a 0, 1,2 or 3) is therefore 40%,or0.4, since these are four of the ten possible digits Also, successive digits are assumed to
be independent of the values of previous digits Thus the assumptions of the binomialdistribution should be met in this experiment Enter the table of random numbers at
an arbitrary point (not always at the beginning!) and look at successive groups of fivedigits, noting in each group how many of the digits are 0, I, 2, or 3 Take as manygroups of five as you can find time to do, but no fewer than 100 groups
one year by a biostatistics class A total of2423 samples of five numbers wereobtained from the table of random numbers; the distribution of the four digitssimulating the percentage of infection is shown in this column The observedfrequencies are labeled f. To calculate the expected frequencies for this actualexample we multiplied the relative frequenciesl:cl of column (5) times n= 2423,
column (7) with the expected frequencies in column (6) we note general agreementbetween the two columns of figures The two distributions are also illustrated
in Figure 4.2 If the observed frequencies did not fit expected frequencies, wemight believe that the lack of fit was due to chance alone Or we might be led
to reject one or more of the following hypotheses: (I) that the true proportion
of digits 0, I, 2, and 3 is 0.4 (rejection of this hypothesis would normally not
be reasonable, for we may rely on the fact that the proportion of digits 0, I 2,and 3 in a table of random numbers is 0.4 or very close to it); (2) that sampling
These statements can be reinterpreted in terms of the original infectionmodel with which we started this discussion.If,instead of a sampling experiment
of digits by a biostatistics class, this had been a real sampling experiment ofinsects, we would conclude that the insects had indeed been randomly sampled
Trang 3758 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 59
100
0 0
(lh,;ef\'011 [re(!l1el1e;I'';
o Expeet<·d [n''!l1el1e;",
TABLE 4.2 Artificial distributions to illustrate clumping and repulsion Expected frequencies from Table 4.1.
infected insects expected (contagious) Deviation Repulsed Deviation
and that we had no evidence to reject the hypothesis that the proportion of
infected insects was 40%.Ifthe observed frequencies had not fitted the expected
frequencies, the lack of fit might be attributed to chance, or to the conclusion
that the true proportion of infection was not 0.4; or we would have had to
reject one or both of the following assumptions:(1)that sampling was at random
and (2) that the occurrence of infected insects in these samples was independent
events How could we simulate a sampling procedure in which the occurrences
of the digits 0, 1,2, and 3 were not independent? We could, for example, instruct
the sampler to sample as indicated previously, but, every time a 3 was found
among the first four digits of a sample, to replace the following digit with
another one of the four digits standing for infected individuals Thus, once a 3
was found, the probability would be 1.0 that another one of the indicated digits
would be included in the sample After repeated samples, this would result in
higher frequencies of samples of two or more indicated digits and in lower
frequencies than expected (on the basis of the binomial distribution) of samples
of one such digit A variety of such different sampling schemes could be devised
Itshould be quite clear to the reader that the probability of the second event's
occurring would be different from that of the first and dependent on it
How would we interpret a large departure of the observed frequencies from
expectation? We have not as yet learned techniques for testing whether observed
frequencies differ from those expected by more than can be attributed to chance
alone This will be taken up in Chapter 13 Assume that such a test has been
carried out and that it has shown us that our observed frequencies are
significantly different from expectation Two main types of departure from
ex-pectation can be characterized: (I) clumpinq and (2) repulsion, shown in fictitious
examples in Table 4.2 In actual examples we would have no a priori notionsabout the magnitude ofp,the probability of one of the two possible outcomes
Therefore, the hypotheses tested are whether the samples are random and theevents independent
The clumped frequencies in Table 4.2 have an excess of observations at thetails of the frequency distribution and consequently a shortage of observations
at the center Such a distribution is also said to be contagious (Remember that
the total number of items must be the same in both observed and expected quencies in order to make them comparable.) In the repulsed frequency distri-bution there are more observations than expected at the center of the distributionand fewer at the tails These discrepancies are most easily seen in columns (4)and (6) of Table 4.2, where the deviations of observed from expected frequenciesare shown as plus or minus signs
fre-What do these phenomena imply? In the clumped frequencies, more sampleswere entirely infected (or largely infected), and similarly, more samples were en-tirely noninfected (or largely noninfected) than you would expect if proba-bilities of infection were independent This could be due to poor sampling design
If, for example, the investigator in collecting samples of five insects always
such a result would likely appear But if the sampling design is sound, theresults become more interesting Clumping would then mean that the samples
of five are in some way related so that if one insect is infected, others in the
Trang 3860 CHAPTER4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 61same sample are more likely to be infected This could be true if they come
from adjacent locations in a situation in which neighbors are easily infected
Or they could be siblings jointly exposed to a source of infection Or possibly
the infection might spread among members of a sample between the time that
the insects are sampled and the time they are examined
The opposite phenomenon, repulsion, is more difficult to interpret
bio-logically There are fewer homogeneous groups and more mixed groups in such
a distribution This involves the idea of a compensatory phenomenon: if some
of the insects in a sample are infected, the others in the sample are less likely
im-munity to their associates in the sample, such a situation could arise logically,
but it is biologically improbable A more reasonable interpretation of such a
finding is that for each sampling unit, there were only a limited number of
pathogens available; then once several of the insects have become infected, the
others go free of infection, simply because there is no more infectious agent
This is an unlikely situation in microbial infections, but in situations in which
a limited number of parasites enter the body of the host, repulsion may be
more reasonable
From the expected and observed frequencies in Table 4.1, we may calculate
the mean and standard deviation of the number of infected insects per sample
These values arc given at the bottom of columns (5), (6), and (7) in Table 4.1
We note that the means and standard deviations in columns (5) and (6) are
almost identical and differ only trivially because of rounding errors Column(7),
being a sample from a population whose parameters are the same as those of
the expected frequency distribution in column (5) or (6), differs somewhat The
mean is slightly smaller and the standard deviation is slightly greater than in
the expected frequencies If we wish to know the mean and standard deviation
of expected binomial frequency distributions, we need not go through the
com-putations shown in Table 4.1 The mean and standard deviation of a binomial
frequency distribution are, respectively,
Ii =kp
Substituting the values II. = 5, p= 0.4, and q= 0.6 of the above example, we
from column (5) in Table 4.1 Note that we use the Greek parametric notation
here becauseII and(J arc parameters of an expected frequency distribution, not
sample statistics, as are the mean and standard deviation in column (7) The
should be distinguished from sample proportions In fact, in later chapters we
resort to pand 4for parametric proportions (rather than TC, which
convention-ally is used as the ratio of the circumference to the diameter of a circle) Here,
however, we prefer to keep our notation simple If we wish to express our
variable as a proportion rather than as a count·-that is, to indicate mean
incidence of infection in the insects as 0.4, rather than as 2 per sample of 5 we
can use other formulas for the mean and standard deviation in a binomial
distribution:
J1.=p
replused frequency distributions of Table 4.2 We note that the clumped bution has a standard deviation greater than expected, and that of the repulsedone is less than expected Comparison of sample standard deviations with theirexpected values is a useful measure of dispersion in such instances
distri-We shall now employ the binomial distribution to solve a biological lem On the basis of our knowledge of the cytology and biology of species A,
prob-we expect the sex ratio among its offspring to be 1: 1 The study of a litter innature reveals that of 17 offspring 14 were females and 3 were males What
of being a female offspring)=0.5 and that this probability is independent amongthe members of the sample, the pertinent probability distribution is the binomial
for sample size k= 17 Expanding the binomial to the power 17 is a formidabletask, which, as we shall see, fortunately need not be done in its entirety How-ever, we must have the binomial coefficients, which can be obtained either from
an expansion of Pascal's triangle (fairly tedious unless once obtained and storedfor future use) or by working out the expected frequencies for any given class of
Y from the general formula for any term of the binomial distribution
formed fromkitems taken Yat a time This can be evaluated ask!/[ Y!(k - Y)!].where! means "factorial." In mathematics k factorial is the product of all the
convention,O! = 1 In working out fractions containing factorials, note that anyfactorial will always cancel against a higher factorial Thus 5!/3! = (5 x4 x 3!)/
(5 x 4)/2 = 10
power 4) Since we require the probability of 14 females, we note that for the
and 4 males Calculating the relative expected frequencies in column (6), wenote that the probability of 14 females and 3 males is 0.005,188,40, a very smallvalue [fwe add to this value all "worse" outcomes-that is, all outcomes thatare even more unlikely than 14 females and 3 males on the assumption of a1: 1 hypothesis we obtain a probability of 0.006,363,42, still a very small value.(In statistics, we often need to calculate the probability of observing a deviation
as large as or larger than a given value.)
Trang 3962 CHAPTER 4 ; INTRODUCTION TO PROBABILITY DISTRIBUTIONS
TABLE 4.3
Some expected frequencies of males and females for samples of 17 offspring on the assumption that
the sex ratill is 1:1 [p.,=0.5, 4"=0.5;(p', +qd=(0.5+0.W7].
Relative expected
On the basis of these findings one or more of the following assumptions is
unlikely: (I) that the true sex ratio in species A is 1: 1, (2) that we have sampled
at random in the sense of obtaining an unbiased sample, or (3) that the sexes
of the offspring are independent of one another Lack of independence of events
may mean that although the average sex ratio is 1: 1, the individual sibships, or
litters, are largely unisexual, so that the offspring from a given mating would
tend to be all (or largely) females or all (or largely) males To confirm this
hypothesis, we would need to have more samples and then examine the
distri-bution of samples for clumping, which would indicate a tendency for unisexual
sibships
We must be very precise about the questions we ask of our data There
are really two questions we could ask about the sex ratio First, are the sexes
unequal in frequency so that females will appear more often than males? Second,
are the sexes unequal in frequency?Itmay be that we know from past experience
that in this particular group of organisms the males are never more frequent
than females; in that case, we need be concerned only with the first of these
two questions, and the reasoning followed above is appropriate However, if we
know very little about this group of organisms, and if our question is simply
whether the sexes among the offspring are unequal in frequency, then we have
to consider both tails of the binomial frequency distribution; departures from
the I: 1 ratio could occur in either direction We should then consider not only
the probability of samples with 14 females and 3 males (and all worse cases) but
also the probability of samples of 14 males and 3 females (and all worse cases
in that direction) Since this probability distribution is symmetrical (because
ob-tained previously, which results in 0.012,726,84 This new value is stili very small,
making it quite unlikely that the true sex ratio is 1: 1
This is your first experience with one of the most important applications of
statistics hypothesis testing A formal introduction to this field will be deferred
until Section 6.8 We may simply point out here that the two approaches
fol-lowed above are known appropriately as one-tailed tests and two-tailed tests,
respectively Students sometimes have difficulty knowing which of the two tests
to apply In future examples we shall try to point out in each case why a tailed or a two-tailed test is being used
one-We have said that a tendency for unisexual sibships would result in aclumped distribution of observed frequencies An actual case of this nature is aclassic in the literature, the sex ratio data obtained by Geissler (1889) fromhospital records in Saxony Table 4.4 reproduces sex ratios of 6115 sibships of
12 children each from the more extensive study by Geissler All columns of thetable should by now be familiar The expected frequencies were not calculated
on the basis of a 1: 1 hypothesis, since it is known that in human populationsthe sex ratio at birth is not 1:1 As the sex ratio varies in different humanpopulations, the best estimate of it for the population in Saxony was simplyobtained using the mean proportion of males in these data This can be obtained
by calculating the average number of males per sibship(Y= 6.230,58) for the
6115 sibships and converting this into a proportion This value turns out to be
devia-tions of the observed frequencies from the absolute expected frequencies shown
in column (9) of Table 4.4, we notice considerable clumping There are manymore instances of families with all male or all female children (or nearly so)than independent probabilities would indicate The genetic basis for this is notclear, but it is evident that there are some families which "run to girls" andsimilarly those which "run to boys." Evidence of clumping can also be seen fromthe fact thatS2is much larger than we would expect on the basis of the binomialdistribution (0'2 =kpq = 12(0.519,215)0.480,785= 2.995,57)
There is a distinct contrast between the data in Table 4.1 and those inTable 4.4 In the insect infection data of Table 4.1 we had a hypothetical propor-tion of infection based on outside knowledge In the sex ratio data of Table 4.4
we had no such knowledge; we used an empirical value of p obtained from the
whose importance will become apparent later In the sex ratio data of Table 4.3,
as in much work in Mendelian genetics, a hypothetical value ofp is used.
4.3 The Poisson distribution
In the typical application of the binomial we had relatively small samples(2 students, 5 insects, 17 offspring, 12 siblings) in which two alternative statesoccurred at varying frequencies (American and foreign, infected and nonin-fected, male and female) Quite frequently, however, we study cases in whichsample sizek is very large and one of the events (represented by probability q)isvery much more frequent than the other (represented by probabilitypl.We have
cases we are generally interested in one tail of the distribution only This is the
Trang 404.3 / THE POISSON DISTRIBUTION 65
tail represented by the terms
pOqk, C(k, l)plqk-t, C(k, 2)p2qk-2, C(k, 3)p3qk-3,
The first term represents no rare events and k frequent events in a sample ofk
The third term represents two rare events and k - 2 frequent events, and so
forth The expressions of the form C(k,i) are the binomial coefficients, sented by the combinatorial terms discussed in the previous section Althoughthe desired tail of the curve could be computed by this expression, as long assufficient decimal accuracy is maintained, it is customary in such cases tocompute another distribution, the Poisson distribution, which closely approxi-mates the desired results As a rule of thumb, we may use the Poisson distribu-tion to approximate the binomial distribution when the probability of the rare
repre-event p is less than 0.1 and the product kp (sample size x probability) is less
than 5
The Poisson distribution is also a discrete frequency distribution of thenumber of times a rare event occurs But, in contrast to the binomial distribu-tion, the Poisson distribution applies to cases where the number of times that
an event does not occur is infinitely large For purposes of our treatment here,
a Poisson variable will be studied in samples taken over space or time Anexample of the first would be the number of moss plants in a sampling quadrat
on a hillside or the number of parasites on an individual host; an example of atemporal sample is the number of mutations occurring in a genetic strain in thetime interval of one month or the reported cases of influenza in one townduring one week The Poisson variable Y will be the number of events persample It can assume discrete values from 0 on up To be distributed in Poissonfashion the variable must have two properties: (I) Its mean must be small relative
to the maximum possible number of events per sampling unit Thus the eventshould be "rare." But this means that our sampling unit of space or time must
be large enough to accommodate a potentially substantial number of events.For example, a quadrat in which moss plants are counted must be large enoughthat a substantial number of moss plants could occur there physically if thebiological conditions were such as to favor the development of numerous mossplants in the quadrat A quadrat consisting of a I-cm square would be far toosmall for mosses to be distributed in Poisson fashion Similarly, a time span
of I minute would be unrealistic for reporting new influenza cases in a town,but within I week a great many such cases could occur (2) An occurrence of theevent must be independent of prior occurrences within the sampling unit Thus,the presence of one moss plant in a quadrat must not enhance or diminish theprobability that other moss plants are developing in the quadrat Similarly, thefact that one influenza case has been reported must not affect the probability
of reporting subsequent influenza cases Events that meet these conditions(rare and randomevents) should be distributed in Poisson fashion
The purpose of fitting a Poisson distribution to numbers of rare events innature is to test whether the events occur independently with respect to each