1. Trang chủ
  2. » Khoa Học Tự Nhiên

Introduction to biostatistics 2nd ed r sokal, f rohlf (dover, 2009)

190 72 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 190
Dung lượng 15,95 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

1.2 The development of biostatistics 21.3 The statistical frame oj" mind 4 2.5 Frequency distribut ions 14 3.7 Sample statistics and parameters 37 3.S Practical methods jilr computiny me

Trang 1

INTRODUCTION TO

BIOSTATISTICS

SECOND EDITION

State University(~fNew York at Stony Brook

DOVER PUBLICATIONS, INC.

Mineola, New York

Trang 2

Copyright ((') 1969, 1973, 19RI 19R7 by Robert R Sokal and F James Rohlf

All rights reserved.

Bih/iographim/ Note

This Dover edition, first published in 2009, is an unabridged republication of

the work originally published in 1969 by W H Freeman and Company, New

York The authors have prepared a new Preface for this edition.

Lihrary01'Congress Cata/oging-in-Puhlimtio/l Data

SokaL Robert R.

Introduction to Biostatistics / Robert R Sokal and F James Rohlf.

Dovcr cd.

p cm.

Originally published: 2nd cd New York: W.H Freeman, 1969.

Includes bibliographical references and index.

Manufactured in the United Stales of America

Dover Puhlications, Inc., 31 East 2nd Street, Mineola, N.Y 11501

to Julie and Janice

Trang 3

1.2 The development of biostatistics 2

1.3 The statistical frame oj" mind 4

2.5 Frequency distribut ions 14

3.7 Sample statistics and parameters 37

3.S Practical methods jilr computiny mean and standard deviation 39

3.9 The coefficient oj" variation 43

Trang 4

4

CONTENTS

INTRODUCTION TO PROBABILITY DISTRIBUTIONS:

4.1 Probability, random sampling, and hypothesis testing 48

4.2 The binomial distribution 54

CONTENTS

9.2 Two-way anova: Significance testing 197 9.3 Two-way anOl'a without replication 199

10.3 Nonparametric methods in lieu of anova 220

5.

6

5.1 Frequency distributions of continuous variables 75

5.2 Derivation of the normal distribution 76

5.3 Properties of the normal distriblltion 78

5.4 ApplicatiollS of the normal distribution 82

5.5 Departures /rom normality: Graphic merhods 85

6.1 Distribution and variance of means 94

6.2 Distribution and variance oj' other statistics 101

6.3 I ntroduction to confidence limits 103

6.4 Student's t distriblllion 106

6.5 Confidence limits based 0/1 sllmple statistic.5 109

6.6 The chi-square distriburion 112

6.7 Confidence limits fur variances 114

6.8 Introducrion /(Ihyporhesis resting 115

6.9 Tests of simple hypotheses employiny the rdistriburion

6.10 Testiny the hypothesis 11 0 : fT2 = fT6 129

1/.3 The linear regression eqllation 235

J 1.4 More than one vallie of Y for each value of X 11.5 Tests of siyn!ficance in reqression 250

1/.7 Residuals and transformations in reyression 259 11.8 A nonparametric test for rewession 263

12.2 The product-moment correlation coefficient 270

/2.3 Significance tests in correlation 280 /2.4 Applications 0/correlation 284 /2.5 Kendall's coefficient of rank correlation 286

243

7.1 The variance.\ of samples and rheir meallS 134

7.3 The hypothesis II,,: fT; = fT~ 143

7.4 lIeteroyeneiry IInWn!l sample means 143

7.5 Parritio/li/l!l the rotal sum of squares UlU/ dewees o/freedom

295 301305

Trang 5

Preface to the Dover Edition

We are pleased and honored to see the re-issue of the second edition of our tion to Biostatistics by Dover Publications On reviewing the copy, we find there

Introduc-is little in it that needs changing for an introductory textbook of biostatIntroduc-istics for anadvanced undergraduate or beginning graduate student The book furnishes an intro-duction to most of the statistical topics such students are likely to encounter in theircourses and readings in the biological and biomedical sciences

The reader may wonder what we would change if we were to write this book anew.Because of the vast changes that have taken place in modalities of computation in thelast twenty years, we would deemphasize computational formulas that were designedfor pre-computer desk calculators (an age before spreadsheets and comprehensivestatistical computer programs) and refocus the reader's attention to structural for-mulas that not only explain the nature of a given statistic, but are also less prone torounding error in calculations performed by computers Inthis spirit, we would omitthe equation (3.8) on page 39 and draw the readers' attention to equation (3.7) instead.Similarly, we would use structural formulas in Boxes 3.1 and 3.2 on pages 4\ and 42,respectively; on page 161 and in Box 8.1 on pages 163/164, as well as in Box 12.1

on pages 278/279

Secondly, we would put more emphasis on permutation tests and resampling methods.Permutation tests and bootstrap estimates are now quite practical We have found thisapproach to be not only easier for students to understand but in many cases preferable

to the traditional parametric methods that are emphasized in this book

Robert R Sokal

F James RohlfNovember 2008

Trang 6

The favorable reception that the first edition of this book received from teachersand students encouraged us to prepare a second edition In this revised edition,

we provide a thorough foundation in biological statistics for the undergraduate

student who has a minimal knowledge of mathematics We intend Introduction

to Biostatistics to be used in comprehensive biostatistics courses, but it can also

be adapted for short courses in medical and professional schools; thus, weinclude examples from the health-related sciences

We have extracted most of this text from the more-inclusive second edition

of our own Biometry We believe that the proven pedagogic features of that

book, such as its informal style, will be valuable here

We have modified some of the features from Biometry; for example, in

Introduction to Biostatistics we provide detailed outlines for statistical

compu-tations but we place less emphasis on the compucompu-tations themselves Why?Students in many undergraduate courses are not motivated to and have fewopportunities to perform lengthy computations with biological research ma-terial; also, such computations can easily be made on electronic calculatorsand microcomputers Thus, we rely on the course instructor to advise students

on the best computational procedures to follow

We present material in a sequence that progresses from descriptive statistics

to fundamental distributions and the testing of elementary statistical hypotheses;

we then proceed immediately to the analysis of variance and the familiar t test

Trang 7

xiv PREFACE

(which is treated as a special case of the analysis of variance and relegated to

several sections of the book) We do this deliberately for two reasons: (I) since

today's biologists all need a thorough foundation in the analysis of variance,

students should become acquainted with the subject early in the course; and (2)

if analysis of variance is understood early, the need to use the t distribution is

reduced (One would still want to use it for the setting of confidence limits and

in a few other special situations.) All ttests can be carried out directly as

anal-yses of variance and the amount of computation of these analanal-yses of variance

is generally equivalent to that oft tests

This larger second edition includes the Kolgorov-Smirnov two-sample test,

non parametric regression, stem-and-Ieaf diagrams, hanging histograms, and the

Bonferroni method of multiple comparisons We have rewritten the chapter on

the analysis of frequencies in terms of theGstatistic rather than X 2

, because theformer has been shown to have more desirable statistical properties Also, be-

cause of the availability of logarithm functions on calculators, the computation

of the Gstatistic is now easier than that of the earlier chi-square test Thus, we

reorient the chapter to emphasize log-likelihood-ratio tests We have also added

new homework exercises

We call speciaL double-numbered tables "boxes." They can be used as

con-venient guides for computation because they show the computational methods

for solving various types of biostatistica! problems They usually contain all

the steps necessary to solve a problem from the initial setup to the final result

Thus, students familiar with material in the book can use them as quick

sum-mary reminders of a technique

We found in teaching this course that we wanted students to be able to

refer to the material now in these boxes We discovered that we could not cover

even half as much of our subject if we had to put this material on the

black-board during the lecture, and so we made up and distributed box'?" dnd asked

students to refer to them during the lecture Instructors who usc this book may

wish to usc the boxes in a similar manner

We emphasize the practical applications of statistics to biology in this book;

thus we deliberately keep discussions of statistical theory to a minimum

De-rivations are given for some formulas, but these are consigned to Appendix A I,

where they should be studied and reworked by the student Statistical tables

to which the reader can refer when working through the methods discussed in

this book are found in Appendix A2

We are grateful to K R Gabriel, R C Lewontin and M Kabay for their

E Russek-Cohen, and M Singh for comments on an early draft of this book

We also appreciate the work of our secretaries, Resa Chapey and Cheryl Daly,

with preparing the manuscripts, and of Donna DiGiovanni, Patricia Rohlf, and

Barbara Thomson with proofreading

Robert R Sokal

F Jamcs Rohlf

INTRODUCTION TO BIOSTATISTICS

Trang 8

CHAPTER 1

Introduction

This chapter sets the stage for your study of biostatistics In Section 1.1, wedefine the field itself We then cast a neccssarily brief glance at its historicaldevclopment in Section 1.2 Then in Section 1.3 we conclude the chapter with

a discussion of the attitudes that the person trained in statistics brings tobiological rcsearch

Wc shall define hiostatistics as the application of statisti("(ll methods to the lution of biologi("(ll prohlems.The biological problems of this definition are thosearising in the basic biological sciences as well as in such applied areas as thehealth-related sciences and the agricultural sciences Biostatistics is also called

so-biological statisticsor biometry.

The definition of biostatistics leaves us somewhat up in the air-"statistics"

layman The number of definitions you can find for it is limited only by thenumber of books you wish to consult We might define statistics in its modern

Trang 9

2 CHAPTER 1 / INTRODUCTION 1.2 / THE DEVELOPMENT OF BIOSTATISTICS 3

sense as the scientific study of numerical data based on natural phenomena All

parts of this definition are important and deserve emphasis:

validity of scientific evidence We must always be objective in presentation and

evaluation of data and adhere to the general ethical code of scientific

method-ology, or we may find that the old saying that "figures never lie, only statisticians

do" applies to us

hence it deals with quantities of information, not with a single datum Thus,th~

measurement of a single animal or the response from a single biochemical test

will generally not be of interest

N~merical: Unless data of a study can be quantified in one way or another,

they WIll not be amenable to statistical analysis Numerical data can be

mea-surements (the length or width of a structure or the amount of a chemical in

a body fluid, for example) or counts (such as the number of bristles or teeth)

those events in animate and inanimate nature that take place outside the control

of human beings, but also those evoked by scientists and partly under their

control, as in experiments Different biologists will concern themselves with

different levels of natural phenomena; other kinds of scientists, with yet different

ones But all would agree that the chirping of crickets, the number of peas in

a pod, and the age of a woman at menopause are natural phenomena The

heartbeat of rats in response to adrenalin, the mutation rate in maize after

may still be considered natural, even though scientists have interfered with the

phenomenon through their intervention The average biologist would not

con-sider the number of stereo sets bought by persons in different states in a given

year to be a natural phenomenon Sociologists or human ecologists, however,

might so consider it and deem it worthy of study The qualification "natural

phenomena" is included in the definition of statistics mostly to make certain

th.at the phenomena studied are not arbitrary ones that are entirely under the

an expenment

The word "statistics" is also used in another, though related, way It can

be the plural of the noun statistic, which refers to anyone of many computed

or estimated statistical quantities, such as the mean, the standard deviation, or

the correlation coetllcient Each one of these is a statistic

1.2 The development of biostatistics

Modern statistics appears to have developed from two sources as far back as

the seventeenth century The first source was political science; a form of statistics

developed as a quantitive description of the various aspects of the affairs of

a govcrnment or state (hence the term "statistics") This subject also became

known as political arithmetic Taxes and insurance caused people to become

interested in problems of censuses, longevity, and mortality Such considerationsassumed increasing importance, especially in England as the country prosperedduring the development of its empire John Graunt (1620-1674) and WilliamPetty (1623-1687) were early students of vital statistics, and others followed intheir footsteps

At about the same time, the second source of modern statistics developed:the mathematical theory of probability engendered by the interest in games

of chance among the leisure classes of the time Important contributions tothis theory were made by Blaise Pascal (1623-1662) and Pierre de Fermat(1601-1665), both Frenchmen Jacques Bernoulli (1654-1705), a Swiss, laid the

foundation of modern probability theory in Ars Conjectandi Abraham de

Moivre (1667-1754), a Frenchman living in England, was the first to combinethe statistics of his day with probability theory in working out annuity valuesand to approximate the important normal distribution through the expansion

of the binomial

A later stimulus for the development of statistics came from the science ofastronomy, in which many individual observations had to be digested into acoherent theory Many of the famous astronomers and mathematicians of theeighteenth century, such as Pierre Simon Laplace (1749-1827) in France andKarl Friedrich Gauss (1777 -1855) in Germany, were among the leaders in thisfield The latter's lasting contribution to statistics is the development of themethod of least squares

Perhaps the earliest important figure in biostatistic thought was AdolpheQuetelet (1796-1874), a Belgian astronomer and mathematician, who in hiswork combined the theory and practical methods of statistics and applied them

to problems of biology, medicine, and sociology Francis Galton (1822-1911),

a cousin of Charles Darwin, has been called the father of biostatistics andeugenics The inadequacy of Darwin's genetic theories stimulated Galton to try

to solve the problems of heredity Galton's major contribution to biology washis application of statistical methodology to the analysis of biological variation,particularly through the analysis of variability and through his study of regres-sion and correlation in biological measurements His hope of unraveling thelaws of genetics through these procedures was in vain He started with the mostditllcult material and with the wrong assumptions However, his methodologyhas become the foundation for the application of statistics to biology

Karl Pearson (1857 -1936), at University College, London, became ested in the application of statistical methods to biology, particularly in thedemonstration of natural selection Pearson's interest came about through theinfluence of W F R Weldon (1860- 1906), a zoologist at the same institution.Weldon, incidentally, is credited with coining the term "biometry" for the type

inter-of studies he and Pearson pursued Pearson continued in the tradition inter-of Galtonand laid the foundation for much of descriptive and correlational statistics.The dominant figure in statistics and hiometry in the twentieth century hasbeen Ronald A Fisher (1890 1962) His many contributions to statistical theorywill become obvious even to the cursory reader of this hook

Trang 10

4 CHAPTER 1 / INTRODUCTION 1.3 / THE STATISTICAL FRAME OF MIND 5

Statistics today is a broad and extremely active field whose applications

touch almost every science and even the humanities New applications for

sta-tistics are constantly being found, and no one can predict from what branch

of statistics new applications to biology will be made

1.3 The statistical frame of mind

A brief perusal of almost any biological journal reveals how pervasive the use

of statistics has become in the biological sciences Why has there been such a

marked increase in the use of statistics in biology? Apparently, because

biol-ogists have found that the interplay of biological causal and response variables

does not fit the classic mold of nineteenth-century physical science In that

century, biologists such as Robert Mayer, Hermann von Helmholtz, and others

tried to demonstrate that biological processes were nothing but

physicochemi-cal phenomena In so doing, they helped create the impression that the

experi-mental methods and natural philosophy that had led to such dramatic progress

in the physical sciences should be imitated fully in biology

Many biologists, even to this day, have retained the tradition of strictly

mechanistic and deterministic concepts of thinking (while physicists,

interest-ingly enough, as their science has become more refined, have begun to resort

to statistical approaches) In biology, most phenomena are affected by many

causal factors, uncontrollable in their variation and often unidentifiable

Sta-tistics is needed to measure such variable phenomena, to determine the error

of measurement, and to ascertain the reality of minute but important differences

A misunderstanding of these principles and relationships has given rise to

the attitude of some biologists that if differences induced by an experiment, or

observed by nature, are not clear on plain inspection (and therefore are in need

of statistical analysis), they are not worth investigating There are few legitimate

fields of inquiry, however, in which, from the nature of the phenomena studied,

statistical investigation is unnecessary

Statistical thinking is not really different from ordinary disciplined scientific

thinking, in which we try to quantify our observations In statistics we express

our degree of belief or disbelief as a probability rather than as a vague, general

statement For example, a statement that individuals of species A are larger

than those of species B or that women suffer more often from disease X than

do men is of a kind commonly made by biological and medical scientists Such

statements can and should be more precisely expressed in quantitative form

In many ways the human mind is a remarkable statistical machine,

absorb-ing many facts from the outside world, digestabsorb-ing these, and regurgitatabsorb-ing them

in simple summary form From our experience we know certain events to occur

frequently, others rarely "Man smoking cigarette" is a frequently observed

event, "Man slipping on banana peel," rare We know from experience that

Japanese are on the average shorter than Englishmen and that Egyptians are

on the average darker than Swedes We associate thunder with lightning almost

always, flies with garbage cans in the summer frequently, but snow with the

southern Californian desert extremely rarely All such knowledge comes to us

as a result of experience, both our own and that of others, which we learnabout by direct communication or through reading All these facts have beenprocessed by that remarkable computer, the human brain, which furnishes anabstract This abstract is constantly under revision, and though occasionallyfaulty and biased, it is on the whole astonishingly sound; it is our knowledge

of the moment

Although statistics arose to satisfy the needs of scientific research, the opment of its methodology in turn affected the sciences in which statistics isapplied Thus, through positive feedback, statistics, created to serve the needs

devel-of natural science, has itself affected the content and methods devel-of the biologicalsciences To cite an example: Analysis of variance has had a tremendous effect

in influencing the types of experiments researchers carry out The whole field ofquantitative genetics, one of whose problems is the separation of environmentalfrom genetic effects, depends upon the analysis of variance for its realization,and many of the concepts of quantitative genetics have been directly builtaround the designs inherent in the analysis of variance

Trang 11

2.1 / SAMPLES AND POPULA nONS 7

I !

In Section 2, I we explain the statistical meaning of the terms "sample" and

"population," which we shall be using throughout this book Then, in Section

2.2, we come to the types of observations that we obtain from biological research

material; we shall see how these correspond to the different kinds of variables

upon which we perform the various computations in the rest of this book In

Section 2.3 we discuss the degree of accuracy necessary for recording data and

the procedure for rounding olT hgures We shall then be ready to consider in

Section 2.4 certain kinds of derived data frequently used in biological

science -among them ratios and indices-and the peculiar problems of accuracy and

distribution they present us Knowing how to arrange data in frequency

distri-butions is important because such arrangements give an overall impression of

the general pattern of the variation present in a sample and also facilitate further

computational procedures Frequency distributions, as well as the presentation

of numerical data, are discussed in Section 2.5 In Section 2.6 we briefly describe

the computational handling of data

2.1 Samples and populations

We shall now define a number of important terms necessary for an

individual observations. They are observations or measurements taken on the smallest sampling unit.These smallest sampling units frequently, but not neces-sarily, are also individuals in the ordinary biological sense.Ifwe measure weight

in 100 rats, then the weight of each rat is an individual observation; the hundredrat weights together represent thesample of observations,defined asa collection

of individual observations selected by a specified procedure. In this instance, one

sense-that is, one rat However, if we had studied weight in a single rat over

a period of time, the sample of individual observations would be the weights

in a study of ant colonies, where each colony is a basic sampling unit, eachtemperature reading for one colony is an individual observation, and the sample

of observations is the temperatures for all the colonies considered.Ifwe consider

an estimate of the DNA content of a single mammalian sperm cell to be anindividual observation, the sample of observations may be the estimates of DNAcontent of all the sperm cells studied in one individual mammal

We have carefully avoided so far specifying what particular variable wasbeing studied, because the terms "individual observation" and "sample of ob-servations" as used above define only the structure but not the nature of the

sta-tistics is "variable." However, in biology the word "eharacter" is frequently usedsynonymously More than one variable can be measured on each smallestsampling unit Thus, in a group of 25 mice we might measure the blood pHand the erythrocyte count Each mouse (a biological individual) is the smallestsampling unit, blood pH and red cell count would be the two variables studied.the pH readings and cell counts are individual observations, and two samples

of 25 observations (on pH and on erythrocyte count) would result Or we mightspeak of a hil'ariate sampleof 25 observations each referring to a pH readingpaired with an erythrocyte count

Next we define population. The biological definition of this lerm is wellknown It refers to all the individuals of a given species (perhaps of a givenlife-history stage or sex) found in a circumscribed area at a given time Instatistics, population always means the totality0/indil'idual ohsenJatiolls ahout which in/ere/In's are 10 he frlLlde, exist illy anywhere in the world or at lcast u'ithill

a definitely specified sampling area limited in space alld time. If you take fivemen and study the number of Ieucocytes in their peripheral blood and youarc prepared to draw conclusions about all men from this sample of five thenthe population from which the sample has been drawn represents the leucocytecounts of all extant males of the species Homo sapiens. If on the other hand.you restrict yllursclf to a more narrowly specified sample such as five male

Trang 12

8 CHAPTER 2 ! DATA IN BIOSTATISTICS 2.2 / VARIABLES IN B10STATISTlCS 9Chinese, aged 20, and you are restricting your conclusions to this particular

group, then the population from which you are sampling will be leucocyte

numbers of all Chinese males of age 20.

A common misuse of statistical methods is to fail to define the statistical

population about which inferences can be made A report on the analysis of

a sample from a restricted population should not imply that the results hold

in general The population in this statistical sense is sometimes referred to as

A population may represent variables of a concrete collection of objects or

creatures, such as the tail lengths of all the white mice in the world, the leucocyte

counts of all the Chinese men in the world of age 20, or the DNA content of

all the hamster sperm cells in existence: or it may represent the outcomes of

experiments, such as all the heartbeat frequencies produced in guinea pigs by

injections of adrenalin In cases of the first kind the population is generally

finite Although in practice it would be impossible to collect count, and examine

all hamster sperm cells, all Chinese men of age 20, or all white mice in the world,

these populations are in fact finite Certain smaller populations, such as all the

whooping cranes in North America or all the recorded cases of a rare but easily

diagnosed disease X may well lie within reach of a total census By contrast,

an experiment can be repeated an infinite number of times (at least in theory)

A given experiment such as the administration of adrenalin to guinea pigs

could be repeated as long as the experimenter could obtain material and his

or her health and patience held out The sample of experiments actually

per-formed is a sample from an intlnite number that could be performed

Some of the statistical methods to be developed later make a distinction

between sampling from finite and from infinite populations However, though

populations arc theoretically finite in most applications in biology, they are

generally so much larger than samples drawn from them that they can be

con-sidered de facto infinite-sized populations

Each biologi<.:al discipline has its own set of variables which may indude

con-ventional morpholl.lgKal measurements; concentrations of <.:hemicals in body

Iluids; rates of certain biologi<.:al proccsses; frcquencies of certain events as in

gcndics, epidemiology, and radiation biology; physical readings of optical or

electronic machinery used in biological research: and many more

We have already referred to biological variables in a general way but we

have not yet defined them We shall define a I'ariahleas a property with respect

docs not ditTer wilhin a sample at hand or at least among lhe samples being

studied, it <.:annot be of statistical inlerL·st Length, height, weight, number of

teeth vitaminC content, and genolypcs an: examples of variables in ordinary,

genetically and phenotypically diverse groups of lHganisms Warm-bloodedness

in a group of m,lI11m,tls is not, since mammals are all alike in this regard,

although body temperature of individual mammals would, of course, be avariable

We can divide variables as follows:

Variables

Measurement variablesContinuous variablesDiscontinuous variablesRanked variables

Attributes

of values between any two fixed points For example, between the two lengthmeasurements 1.5 and 1.6 em there are an infinite number of lengths that could

be measured if one were so inclined and had a precise enough method ofcalibration Any given reading of a continuous variable, such as a length of1.57 mm, is therefore an approximation to the exact reading, which in practice

is unknowable Many of the variables studied in biology are continuous ables Examples are lengths, areas, volumes weights, angles, temperatures.periods of time percentages concentrations, and rates

vari-Contrasted with continuous variables are the discontilluous IJllriahlt's. alsoknown as meristicor discrete vilrilih/t's.These are variables that have only cer-tain fixed numerical values with no intermediate values possible in between.Thus the number of segments in a certain insect appendage may be 4 or 5 or

6 but never 51 or 4.3 Examples of discontinuous variahks arc numbers of agiven structure (such as segments, bristles teeth, or glands), numbers of ollspring,numbers of colonies of microorganisms or animals or numbers of plants in agiven quadrat

Some variables cannot he measured but at least can be ordered or ranked

by their magnitude Thus in an experiment one might record the rank ordn

of emergence of ten pupae without specifying the exact time at which each pupaemerged In such cases we code the data as a rallked mriahle, the order ofemergence Spe<.:ial methods for dealing with su<.:h variables have been devel-oped and several arc furnished in this book By expressing a variable as a series

of ranks, such as 1,2.3,4.5 we do not imply that the ditTeren<.:e in magnitudebetween, say, ranks I and 2 is identical to or even proportional tn the dif-feren<.:e between ranks 2 and 3

Variables that <.:annot be measured but must be expressed qualitatively arccalled altrihutes, or lIominal I'liriahies. These are all properties sudl as bla<.:k

or white pregnant or not pregnant, dead or alive, male or female When suchattributes are combined wilh frequen<.:ies, they can bc lrcated statistically Of

XO mi<.:e, we may, for instance state that four were hlad two agouti and the

Trang 13

10 CHAPTER2 / DATA IN BIOSTATISTICS 2.3 / ACCURACY AND PRECISION OF DATA 11rest gray When attributes are combined with frequencies into tables suitable

for statistical analysis, they are referred to as enumeration data Thus the

enu-meration data on color in mice would be arranged as follows:

In some cases attributes can be changed into measurement variables if this is

desired Thus colors can be changed into wavelengths or color-chart values

Certain other attributes that can be ranked or ordered can be coded to

be-come ranked variables For example, three attributes referring to a structure

as "poorly developed," "well developed," and "hypertrophied" could be coded

I, 2, and 3

A term that has not yet been explained is variate In this book we shall use

it as a single reading, score,or observation of a given variable Thus, if we have

measurements of the length of the tails of five mice, tail length will be a

con-tinuous variable, and each of the five readings of length will be a variate In

this text we identify variables by capital letters, the most common symbol being

Y. Thus Y may stand for tail length of mice A variate will refer to a given

length measurement; 1'; is the measurement of tail length of the ith mouse, and

Y 4 is the measurement of tail length of the fourth mouse in our sample

Color

BlackAgoutiGrayTotal number of mice

Frequency

4

2

7480

Most continuous variables, however, are approximate We mean by thisthat the exact value of the single measurement, the variate, is unknown andprobably unknowable The last digit of the measurement stated should implyprecision; that is, it should indicate the limits on the measurement scale betweenwhich we believe the true measurement to lie Thus, a length measurement of12.3 mm implies that the true length of the structure lies somewhere between

12.25 and 12.35 mm Exactly where between these implied limits the real length

is we do not know But where would a true measurement of 12.25 fall? Would

it not equally likely fall in either of the two classes 12.2 and 12.3-clearly anunsatisfactory state of affairs? Such an argument is correct, but when we record

a number as either 12.2 or 12.3, we imply that the decision whether to put itinto the higher or lower class has already been taken This decision was nottaken arbitrarily, but presumably was based on the best available measurement

Ifthe scale of measurement is so precise that a value of 12.25 would clearlyhave been recognized, then the measurement should have been recorded

originally to four significant figures Implied limits, therefore, always carry one more figure beyond the last significant one measured by the observer.

Hence, it follows that if we record the measurement as 12.32, we are implyingthat the true value lies between 12.315 and 12.325 Unless this is what we mean,there would be no point in adding the last decimal figure to our original mea-surements Ifwe do add another figure, we must imply an increase in precision

We see, therefore, that accuracy and precision in numbers are not absolute cepts, but are relative Assuming there is no bias, a number becomes increasinglymore accurate as we are able to write more significant figures for it (increase itsprecision) To illustrate this concept of the relativity of accuracy, consider thefollowing three numbers:

Meristic variates, though ordinarily exact, may be recorded approximatelywhen large numbers are involved Thus when counts are reported to the nearestthousand, a count of 36,000 insects in a cubic meter of soil, for example, impliesthat the true number varies somewhere from 35,500 to 36,500 insects

To how many significant figures should we record measurements? If we array

2.3 Accuracy and precision of data

"Accuracy" and "precision" are used synonymously in everyday speech, but in

statistics we define them more rigorously Accuracy is the closeness ola measured

or computed vallie to its true lJalue Precisio/l is the closeness olrepeated

measure-ments A biased but sensitive scale might yield inaccurate but precise weight By

chance, an insensitive scale might result in an accurate reading, which would,

however, be imprecise, since a repeated weighing would be unlikely to yield an

equally accurate weight Unless there is bias in a measuring instrument, precision

will lead to accuracy We need therefore mainly be concerned with the former

Precise variates are usually, but not necessarily, whole numbers Thus, when

we count four eggs in a nest, there is no doubt about the exact number of eggs

in the nest if we have counted eorrectly; it is 4, not 3 or 5, and clearly it could

not be 4 plus or minus a fractional part Meristic, or discontinuous, variables are

generally measured as exact numbers Seemingly, continuous variables derived

from meristic ones can under certain conditions also be exact numbers For

instance, ratios between exact numbers arc themselves also exact If in a colony

of animals there are I X females and 12 males, the ratio of females to males (a

193 192.8 192.76

192.5 193.5 192.75 192.85 192.755 192.765

Trang 14

12 CHAPTER2 / DATA IN BIOSTATISTICS 2.4 / DERIVED VARIABLES 13

one, an easy rule to remember is thatthe number of unit steps from the smallest

to the largest measurement in an array should usually be between 30 and 300.

Thus, if we are measuring a series of shells to the nearest millimeter and the

largest is 8 mm and the smallest is 4 mm wide, there are only four unit steps

between the largest and the smallest measurement Hence, we should measure

our shells to one more significant decimal place Then the two extreme

measure-ments might be8.2 mm and 4.1 mm, with 41 unit steps between them (counting

the last significant digit as the unit); this would be an adequate number of unit

steps The reason for such a rule is that an error of1in the last significant digit

of a reading of4mm would constitute an inadmissible error of25%,but an error

ofIin the last digit of4.1 is less than2.5%.Similarly, if we measured the height

of the tallest of a series of plants as 173.2 cm and that of the shortest of these

plants as 26.6 em, the difference between these limits would comprise 1466 unit

steps (of0.1 cm), which are far too many It would therefore be advisable to

record the heights to the nearest centimeter as follows: 173 cm for the tallest

and 27 cm for the shortest This would yield 146 unit steps Using the rule we

ha ve stated for the number of unit steps, we shall record two or three digits for

most measurements

The last digit should always be significant; that is, it should imply a range

for the true measurement of from half a "unit step" below to half a "unit step"

above the recorded score, as illustrated earlier This applies to all digits, zero

included Zeros should therefore not be written at the end of approximate

num-bers to the right of the decimal point unless they are meant to be significant

digits Thus 7.80 must imply the limits 7.795 to 7.805 If7.75 to 7.85 is implied,

the measurement should be recorded as 7.8

When the number of significant digits is to be reduced, we carry out the

process ofrOll/utin?} ofrnumbers The rules for rounding off are very simple A

digit to be rounded ofT is not changed if it is followed by a digit less than 5 If

the digit to be rounded off is followed by a digit greater than5 or by 5 followed

by other nonzero digits, it is increased by 1 When the digit to be rounded ofT

is followed by a 5standing alone or a 5followed by zeros, it is unchanged if it

is even but increased by I if it is odd The reason for this last rule is that when

sueh numbers are summed in a long series, we should have as many digits

raised as are being lowered, on the average; these changes should therefore

balance oul Practice the above rules by rounding ofT the following numbers to

the indicated number of significant digits:

Num"er Siyrli/icarlt di"its desired

Most pocket calculators or larger computers round off their displays using

a different rule: they increase the preceding digit when the following digit is a

5 standing alone or with trailing zeros However, since most of the machinesusable for statistics also retain eight or ten significant figures internally, theaccumulation of rounding errors is minimized Incidentally, if two calculatorsgive answers with slight differences in the final (least significant) digits, suspect

a different number of significant digits in memory as a cause of the disagreement

2.4 Derived variables

The majority of variables in biometric work are observations recorded as directmeasurements or counts of biological material or as readings that are the output

of various types of instruments However, there is an important class of variables

in biological research that we may call thederived or computed variables These

are generally based on two or more independently measured variables whoserelations are expressed in a certain way We are referring to ratios, percentages,concentrations, indices, rates, and the like

Aratio expresses as a single value the relation that two variables have, one

to the other In its simplest form, a ratio is expressed as in 64:24, which mayrepresent the number of wild-type versus mutant individuals, the number ofmales versus females, a count of parasitized individuals versus those not para-sitized, and so on These examples imply ratios based on counts Aratio bascd

on a continuous variable might be similarly expressed as 1.2: 1.8, which mayrepresent the ratio of width to length in a sclerite of an insect or the ratiobetween the concentrations of two minerals contained in water or soil Ratiosmay also be expressed as fractions; thus, the two ratios above could be expressed

as~:and U However, for computational purposes it is more useful to expressthe ratio as a quotient The two ratios cited would therefore be 2.666 and0.666 , respectively These are pure numbers, not expressed in measurementunits of any kind It is this form for ratios that we shall consider further

Percellta~je.~are also a type of ratio Ratios, percentages, and concentrationsare basic quantities in much biological research, widely used and generallyfamiliar

An index is the ratio ofthe valueof one varia hie to the value ofa so-called standard OIlC. A well-known example of an index in this sense is the cephalicindex in physical anthropology Conceived in the wide sense, an index could

be the average of two measurements-either simply, such as t(length of A +

length ofB), or in weighted fashion, such as :\[(2 x length ofA)+length ofBj

Rates are important in many experimental fields of biology The amount

of a substance liberated per unit weight or volume of biological material, weightgain per unit time, reproductive rates per unit population size and time (birthrates), and death rates would fall in this category

The use of ratios and percentages is deeply ingrained in scientific thought.Often ratios may be the only meaningful way to interpret and understand cer-tain types of biological problems If the biological process bcing investigated

Trang 15

14 CHAPTER 2 / DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 15

20

FIGURE 2.1 Sampling from a populatl<lI1 of hirth weights of infants (a continuous variahle) A A sample of 2';

B A sample of 100 C A sample of 500 D A sample of 2000.

operates on the ratio of the variables studied, one must examine this ratio to

understand the process Thus, Sinnott and Hammond (1935) found that

inter-preted through a form index based on a length-width ratio, but not through

the independent dimensions of shape By similar methods of investigation, we

should be able to find selection affecting body proportions to exist in the

evolu-tion of almost any organism

There are several disadvantages to using ratios First, they are relatively

inaccurate Let us return to the ratio ::~ mentioned above and recall from the

previous section that a measurement of 1.2 implies a true range of measurement

of the variable from 1.15 to 1.25; similarly, a measurement of 1.8 implies a range

from 1.75 to 1.85 We realize, therefore, that the true ratio may vary anywhere

4.2% if 1.2 is an original measurement: (1.25 - 1.2)/1.2; the corresponding

maxi-mal error for the ratio is 7.0%: (0.714 - 0.667)/0.667 Furthermore, the best

estimate of a ratio is not usually the midpoint between its possible ranges Thus,

in our example the midpoint between the implied limits is 0.668 and the ratio

based on U is 0.666 ; while this is only a slight difference, the discrepancy

may be greater in other instances

A second disadvantage to ratios and percentages is that they may not be

approximately normally distributed (see Chapter 5) as required by many

statis-tical tests This difficulty can frequently be overcome by transformation of the

variable (as discussed in Chapter 10) A third disadvantage of ratios is that

in using them one loses information about the relationships between the two

variables except for the information about the ratio itself

2.5 Frequency distributions

If we were to sample a population of birth weights of infants, we could represent

each sampled measurement by a point along an axis denoting magnitude of

birth weight This is illustrated in Figure 2.1 A, for a sample of 25 birth weights

If we sample repeatedly from the population and obtain 100 birth weights, we

shall probably have to place some of these points on top of other points in

order to reeord them all correctly (Figure 2.1H). As we continue sampling

assemblage of points will continue to increase in size but will assume a fairly

definite shape The outline of the mound of points approximates the distribution

of the variable Remember that a continuous variable such as birth weight can

assume an infinity of values between any two points on the abscissa The

refine-ment of our measurerefine-ments will determine how fine the number of recorded

divisions bctween any two points along the axis will be

The distribution of a variable is of considerable biological interest If we

find that the dislributioll is asymmetrical and drawn out in one direction, it tells

us that there is, perhaps, selectioll that causes organisms to fall preferentially

in one of the tails of the distribution, or possibly that the scale of measuremenl

Trang 16

The above is an example of a quantitative frequency distribution, since Y isclearly a measurement variable However, arrays and frequency distributionsneed not be limited to such variables We can make frequency distributions ofattributes, calledqualitative frequency distributions. In these, the various classesare listed in some logical or arbitrary order For example, in genetics we mighthave a qualitative frequency distribution as follows:

\'um),Pr of plants '1\\adrat

CHAPTER2 / DATA IN BIOSTATISTICS

FIGURE 2.2

ftacca in 500 quadrats Data from Table 2.2;

orginally from Archibald (1950).

2.5 / FREQUENCY DISTRIBUTIONS

Variable

y 9

87

4599 men and 47Xt> women.

This tells us that there are two classes of individuals, those identifed by theA

-phenotype, of which 86 were found, and those comprising the homozygote cessive aa, of which 32 were seen in the sample

re-An example of a more extensive qualitative frequency distribution is given

in Table 2.1, which shows the distribution of melanoma (a type of skin cancer)over body regions in men and women This table tells us that the trunk andlimbs are the most frequent sites for melanomas and that the buccal cavity, therest of the gastrointestinal tract, and the genital tract arc rarely atllicted by this

()/Jsel'l'ed)i-e4u ('IuT

Men Women

I

chosen is such as to bring about a distortion of the distribution If, in a sample

of immature insects, we discover that the measurements are bimodally

distrib-uted (with two peaks), this would indicate that the population is dimorphic

This means that different species or races may have become intermingled in

our sample Or the dimorphism could have arisen from the presence of both

sexes or of different instars

There are several characteristic shapes of frequency distributions The most

common is the symmetrical bell shape (approximated by the bottom graph in

Figure 2.1), which is the shape of the normal frequency distribution discussed

in Chapter 5 There are also skewed distributions (drawn out more at one tail

than the other), I.-shaped distributions as in Figure 2.2, U-shaped distributions,

and others, all of which impart significant information ahout the relationships

they represent We shall have more to say about the implications of various

types of distrihutions in later chapters and sections

After researchers have obtained data in a given study, they must arrange

the data in a form suitable for computation and interpretation We may assume

that variates are randomly ordered initially or are in the order in which the

measurements have been taken A simple arrangement would be an armr of

the data hy order of magnitude Thus for example, the variates 7, 6, 5, 7, X, 9,

6, 7, 4, 6, 7 could be arrayed in order of decreasing magnitude as follows: 9, X,

7 7, 7, 7, 6, 6, 6, 5, 4 Where there an: some variates of the same value such as

the 6\ and Ts in this lictitillllS example a time-saving device might immediately

have occurred to you namely to list a frequency for each of the recurring

variates; thus: 9, X, 7(4 x) ()(3xI,5,4 Such a shorthand notatioll is one way to

represent aFCII'h'IICI' disll'ihlllioll, which is simply an arrangement of thec1as~es

of variates with the frequency of I:ach class indicated ConventIOnally, a

tre-qUl:ncy distrihutioll IS stall:d III tabular form; for our exampk, this is dOlle as

Fyc Totall:ascs

Sourct'. Data (Ii}X~)

645

.1645

II 21')3371

47X6

Trang 17

18 CHAPTER 2 / DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 19

SouI'ce. Data from Archibald (t 950).

TABU: 2.2

A meristic frequency distribution.

Number of plants of the sedgeearn f/acca found in 500 quadrats.

type of cancer We often encounter other examples of qualitative frequency

distributions in ecology in the form of tables, or species lists, of the inhabitants

of a sampled ecological area Such tables catalog the inhabitants by species or

at a higher taxonomic level and record the number of specimens observed for

each The arrangement of such tables is usually alphabetical, or it may follow

a special convention, as in some botanical species lists

A quantitative frequency distribution based on meristic variates is shown

in Table 2.2 This is an example from plant ecology: the number of plants per

quadrat sampled is listed at the left in the variable column; the observed

fre-quency is shown at the right

Quantitative frequency distributions based on a continuous variable arc

the most commonly employed frequency distributions; you should become

thoroughly familiar with them An example is shown in Box 2.1 It is based on

25 femur lengths measured in an aphid population The 25 readings are shown

at the top of Box 2.1 in the order in which they were obtained as measurements

(They could have been arrayed according to their magnitude.) The data are

next set up in a frequency distribution The variates increase in magnitude by

unit steps of 0.1 The frequency distribution is prepared by entering each variate

in turn on the scale and indicating a count by a conventional tally mark When

all of the items have heen tallied in the corresponding class, the tallies are

con-verted into numerals indicating frequencies in the next column Their sum is

indicated by I.f.

What have we achieved in summarizing our data') The original 25 variates

arc now represented by only 15 classes We find that variates 3.6, 3.8, and 4.3

have the highest frequencies However, we also note that there arc several classes,

such as 3.4 or 3.7 that arc not represented by a single aphid This gives the

No of plallts per quadrat

y

o123

4

5

6

78Total

Observed fi-equellcy

f

181 118

97

54329

531

500

entire frequency distribution a drawn-out and scattered appearance The reasonfor this is that we have only 25 aphids, too few to put into a frequency distribu-tion with 15 classes To obtain a more cohesive and smooth-looking distribu-tion, we have to condense our data into fewer classes This process is known

described in the following paragraphs

We should realize that grouping individual variates into classes of widerrange is only an extension of the same process that took place when we obtainedthe initial measurement Thus, as we have seen in Section 2.3, when we measure

an aphid and record its femur length as 3.3 units, we imply thereby that thetrue measurement lies between 3.25 and 3.35 units, but that we were unable tomeasure to the second decimal place In recording the measurement initially as

3.3 units, we estimated that it fell within this range Had we estimated that itexceeded the value of 3.35, for example, we would have given it the next higherscore, 3.4 Therefore, all the measurements between 3.25 and 3.35 were in factgrouped into the class identified by the class mark 3.3 Our class intervalwas0.1 units Ifwe now wish to make wider class intervals, we are doing nothingbut extending the range within which measurements arc placed into one class.Reference to Box 2.1 will make this process clear We group the data twice

in order to impress upon the reader the flexibility of the process In the firstexample of grouping, the class interval has been doubled in width; that is, ithas been made to equal 0.2 units If we start at the lower end, the implied classlimits will now be from 3.25 to 3.45, the limits for the next class from 3.45 to3.65, and so forth

Our next task is to find the class marks This was quite simple in the quency distribution shown at the left side of Box 2.1, in which the original mea-surements were used as class marks However, now we are using a class intervaltwice as wide as before, and the class marks are calculated by taking the mid-point of the new class intervals Thus, to lind the class mark of the first class,

fre-we take the midpoint betfre-ween 3.25 and 3.45 which turns out to be 3.35 Wenote that the class mark has one more decimal place than the original measure-ments We should not now be led to believe that we have suddenly achievedgreater precision Whenever we designate a class interval whose lastsiqnijicant

digit is even (0.2 in this case), the class mark will carry one more decimal placethan the original measurements On the right side of the table in Box 2.1 thedata are grouped once again, using a class interval of 0.3 Because of the oddlast significant digit the class mark now shows as many decimal places as theoriginal variates, the midpoint hetween 3.25 and 3.55 heing 3.4

Once the implied class limits and the class mark for the lirst class havebeen correctly found, the others can bc writtcn down by inspection withoutany spccial comfJutation Simply add the class interval repeatedly to each ofthe values Thus, starting with the lower limit 3.25 by adding 0.2 we obtain3.45, 3.65 3,X5 and so forth; similarly for the class marks we ohtain 3,35,3.55.3.75, and so forth It should he ohvious that the wider the class intervals themore comp;let the data hecome hut also the less precise However, looking at

Trang 18

BOX 2.1

Preparation of frequency distribution and grouping into fewer classes with wider class intervals

Twenty-five femur lengths of the aphidPemphigus. Measurements are in mm x 10-1•

4.45-4.654.65-4.85

4.554.75

Source: Data from R R Sakal.

Histogram of the original frequency distribution shown above and of the grouped distribution with5classes Line below

abscissa shows class marks for the grouped frequency distribution Shaded bars represent original frequency distribution;

hollow bars represent grouped distribution.

Y (femur length, in units of 0.1 rom)

For a detailed account of the process of grouping, see Section 2.5

Trang 19

22 CHAPTER 2! DATA IN BIOSTATISTICS 2.5 / FREQUENCY DISTRIBUTIONS 23

When the shape of a frequency distribution is of particular interest, we maywish 10 present the distribution in graphic form when discussing the results.This is generally done by means of frequency diagrams, of which there arc twocommon types For a distribution of meristic data we employ a hal' dia!fl"il III ,

We then put the next digit of the first variate (a "leaf") at that level of the stemcorresponding to its leading digit(s) The first observation in our sample is 4.9

We therefore place a 9 next to the 4 The next variate is 4.6 It is entered byfinding the stem level for the leading digit 4 and recording a 6 next to the 9that is already there Similarly, for the third variate, 5.5, we record a 5next tothe leading digit 5 We continue in this way until all 15 variates have beenentered (as "leaves") in sequence along the appropriate leading digits of the stem.The completed array is the equivalent of a frequency distribution and has theappearance of a histogram or bar diagram (see the illustration) Moreover, itpermits the efficient ordering of the variates Thus, from the completed array

it becomes obvious that the appropriate ordering of the 15 variates is 2.3, 3.6,3.7,4.4.4.6,4.9,5.5,6.4,7.1,7.3,9.1.9.8,12.7, 16.3, 18.0.The median can easily

be read off the stem-and-Ieaf display It is clearly 6.4 For very large samples,stem-and-Ieaf displays may become awkward In such cases a conventionalfrequency distribution as in Box 2 I would be preferable

Coml'lc/cd array (,)'(cl' /5) ';/<,/,7

SIt'I':'

the frequency distribution of aphid femur lengths in Box 2 I, we notice that the

initial rather chaotic structure is being simplified by grouping When we group

the frequency distribution into five classes with a class interval of 0.3 units, it

becomes notably bimodal (that is, it possesses two peaks of frequencies)

In setting up frequency distributions, from 12to20classes should be

estab-lished This rule need not be slavishly adhered to, but it should be employed

with some of the common sense that comes from experience in handling

statis-tical data The number of classes depends largely on the size of the sample

studied Samples of less than 40 or 50 should rarely be given as many as 12

classes, since that would provide too few frequencies per class On the other

hand, samples of several thousand may profitably be grouped into more than

20classes If the aphid data of Box2.1 need to be grouped, they should probably

not be grouped into more than 6 classes

Ifthe original data provide us with fewer classes than we think we should

have, then nothing can be done if the variable is meristic, since this is the nature

of the data in question However, with a continuous variable a scarcity of classes

would indicate that we probably had not made our measurements with sufficient

precision.Ifwe had followed the rules on number of significant digits for

mea-surements stated in Section 2.3, this could not have happened

Whenever we come up with more than the desired number of classes,

group-ing should be undertaken When the data are meristic, the implied limits of

continuous variables are meaningless Yet with many meristic variables, such

as a bristle number varying from a low of 13 to a high of81,it would probably

be wise to group the variates into classes, each containing several counts This

can best be done by using an odd number as a class interval so that the class

mark representing the data will be a whole rather than a fractional number

Thus if we were to group the bristle numbers 13 14, 15, and 16 into one class,

the class mark would have to be 14.5, a meaningless value in terms of bristle

number It would therefore be better to use a class ranging over 3 bristles or

5 bristles giving the integral value 14or 15 as a class mark

Grouping data into frequency distributions was necessary when

compu-tations were done by pencil and paper Nowadays even thousands of variates

can be processed efficiently by computer without prior grouping However,

fre-quency distributions are still extremcly useful as a tool for data analysis This

is especially true in an age in which it is all too easy for a researcher to obtain

a numerical result from a computer program without ever really examining the

data for outliers or for other ways in which the sample may not conform to

the assumptions of the statistical methods

Rather than using tally marks to set up a frequency distribution, as was

done in Box 2.1, we can employ Tukey's stem-and-lea{ display. This technique

is an improvement, since it not only results in a frequency distribution of the

variates of a sample but also permits easy checking of the variates and ordering

them into an array (neither of which is possible with tally marks) This technique

will therefore be useful in computing the median of a sample (sec Section 3.3)

and in computing various tests that require ordered arrays of the sample variates

Trang 20

Birth weight (in oz.)

2.6 The handling of data

Data must be handled skillfully and expeditiously so that statistics can be

prac-ticed successfully Readers should therefore acquaint themselves with the

var-the variable (in our case, var-the number of plants per quadrat), and var-the ordinate

represents the frequencies The important point about such a diagram is that

the bars do not touch each other, which indicates that the variable is not

con-tinuous By contrast, continuous variables, such as the frequency distribution

of the femur lengths of aphid stem mothers, are graphed as a histogrum In a

histogram the width of each bar along the abscissa represents a class interval

of the frequency distribution and the bars touch each other to show that the

actual limits of the classes are contiguous The midpoint of the bar corresponds

to the class mark At the bottom of Box 2.1 are shown histograms of the

frc-quency distribution of the aphid data ungrouped and grouped The height of

each bar represents the frequency of the corresponding class

To illustrate that histograms are appropriate approximations to the

con-tinuous distributions found in nature, we may take a histogram and make the

class intervals more narrow, producing more classes The histogram would then

clearly have a closer fit to a continuous distribution We can continue this

pro-cess until the class intervals become infinitesimal in width At this point the

histogram becomes the continuous distribution of the variable

Occasionally the class intervals of a grouped continuous frequency

distri-hution arc unequal For instance, in a frequency distridistri-hution of ages we might

have more detail on the dilTerent ages of young individuals and less accurate

identilication of the ages of old individuals In such cases, the class intervals

I'm the older age groups would be wider, those for the younger age groups

nar-rower In representing such data the bars of the histogram arc drawn with

dilkrent widths

Figure 2.3 shows another graphical mode of representation of a frequency

distribution of a continuous variahle (in this case, birth weight in infants) As

we shall sec later the shapes of distrihutions seen in such frequency polygons

can reveal much about the biological situations alTecting the given variable

25

2.6 / THE HANDLING OF DATA

*I;(lr illhll"lll<ltilHl or t() {)nkr Cillltact 1':Xl'kr S()thvare Wehsitc:htlp:l/\\'\\.'\\".l'Xl'ICl"s()rtwarl'.\."il111 1'~ IIl:lil:

salcs(II'l'Xl'tnso!lwal"l·.l"llili Thl'se progralllS arc cOlllpatible with Willl!ows XI' allli Vista.

In this book we ignore "pencil-and-paper" short-cut methods for tions, found in earlier textbooks of statistics, since we assume that the studenthas access to a calculator or a computer Some statistical methods are veryeasy to use because special tables exist that provide answers for standard sta-tistical problems; thus, almost no computation is involved An example isFinney's table, a 2-by-2 contingency table containing small frequencies that isused for the test of independence (Pearson and Hartley, 1958, Table 38) Forsmall problems, Finney's table can be used in place of Fisher's method of findingexact probabilities, which is very tedious Other statistical techniques are soeasy to carry out that no mechanical aids are needed Some are inherentlysimple, such as the sign test (Section 10.3) Other methods are only approximatebut can often serve the purpose adequately; for example, we may sometimessubstitute an easy-to-evaluate median (defined in Section 3.3) for the mean(described in Sections 3.1 and 3.2) which requires eomputation

We can use many new types of equipment to perform statistical

computa-tions-many more than we eould have when Introduction to Biostutistics was

first published The once-standard electrically driven mechanical desk calculatorhas eompletely disappeared Many new electronic devices, from small pocketealculators to larger desk-top computers, have replaced it Such devices are sodiverse that we will not try to survey the field here Even if we did, the rate ofadvance in this area would be so rapid that whatever we might say would soonbecome obsolete

We cannot really draw the line between the more sophisticated electroniccalculators on the one hand, and digital computers There is no abrupt increase

in capabilities between the more versatile programmable calculators and thesimpler microcomputers, just as there is none as we progress from microcom-puters to minicomputers and so on up to the large computers that one associateswith the central computation center of a large university or research laboratory.All can perform computations automatically and be controlled by a set ofdetailed instructions prepared by the user Most of these devices, including pro-grammable small calculators, arc adequate for all of the computations described

in this book even for large sets of data

The material in this book consists or relatively standard statisticalcomputations that arc available in many statistical programs BIOMstatl is

a statistical software package that includes most or the statistical methodscovered in this hook

The use of modern data processing procedures has one inherent danger.One can all too easily either feed in erroneous data or choose an inappropriateprogram Users must select programs carefully to ensure that those programsperform the desired computations, give numerically reliable results, and arc asfree from error as possible When using a program for the lirst time, one shouldtest it using data from textbooks with which one is familiar Some programs

CHAPTER 2 / DATA IN BIOSTATISTICS

FIGURE 2.3

Frequency polygon Birth weights of 9465 males infants Chinese third-class patients in Singapore, 1950 and 1951 Data from Millis and Seng (1954).

Trang 21

26 CHAPTER 2 / DATA IN BIOSTATISTICS

are notorious because the programmer has failed to guard against excessive

rounding errors or other problems Users of a program should carefully check

the data being analyzed so that typing errors are not present In addition,

pro-grams should help users identify and remove bad data values and should provide

them with transformations so that they can make sure that their data satisfy

the assumptions of various analyses

Exercises

7815.01,2.9149 and 20.1500 What are the implied limits before and afler

round-ing? Round these same numbers to one decimal place.

ANS For the first value: 107; 106.545 -106.555; 106.5 -107.5; 106.6

(a) Statistical and biological populations (b) Variate and individual (c) Accuracy

and precision (repeatabilityl (dl Class Interval and class marl\ leI Bar diagram

and histogram tf) Abscissa and ordinate.

them into a frequency distribution? Give class limits as well as class marks.

do-mestic pigeons into a frequency distribution and draw its histogram (data from

Olson and Miller 1958) Measuremcnts are in millimeters.

How precisely should you measure the wing length of a species of mosquitoes

in a study of geographic variation if the smallest specimen has a length of anoul

Transform the 40 measurements in Exercise 2.4 into common logarithms (use a

table or calculator) and make a frcquency distribution of these transformed

variates Comment on the resulting change in the pattern of the frcquency

dis-tribution from that found before.

For the data (lfTahles 2.1 and 2.2 i<kntify the individual ohservatlons, samples,

populations, and varia nics.

Make a stem-and-Icaf display pf the data givcn In Exercise 2.4.

The distributIOn of ages of striped bass captured by book and line from the East

River and the Hudson River during 19XO were reported as follows (Young, 1981):

Show this distribution in the form of a bar diagram.

An early and fundamental stage in any science is the descriptive stage Untilphenomena can be accurately described, an analysis of their causes IS p:emature.The question "What?" comes before "How?" Unless we know so~ethmg a~out

pigs, as well as its fluctuations from day to day and within days, we shall beunable to ascertain the effect of a given dose of a drug upon thIS vanable ln

a sizable sample it would be tedious to obtain our knowledge of the material

by contemplating each individual observation We need some form of summary

to permit us to deal with the data in manageable form, as well as to be able

to share our findings with others in scientific talks and publications A togram or bar diagram of the frequency distribution would be one type ofsummary, However, for most purposes, a numerical summary is needed todescribe concisely, yet accurately, the properties of the observed frequency

sta-tistics. This chapter will introduce you to some of them and show how theyarc computed

Two kinds of descriptive statistics will be discussed in this chapter: statistics

of location and statistics of dispersion Thestatistics of location(also known as

Trang 22

28 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.1 / THE ARITHMETIC MEAN 29

measures of central tendency) describe the position of a sample along a given

dimension representing a variable For example, after we measure the length of

the animals within a sample, we will then want to know whether the animals

are closer, say, to 2 cm or to 20 cm To express a representative value for the

sample of observations-for the length of the animals-we use a statistic of

location But statistics of location will not describe the shape of a frequency

distribution The shape may be long or very narrow, may be humped or

U-shaped, may contain two humps, or may be markedly asymmetrical

Quanti-tative measures of such aspects of frequency distributions are required To this

end we need to define and study the statistics of dispersion.

The arithmetic mean, described in Section 3.1, is undoubtedly the most

important single statistic of location, but others (the geometric mean, the

harmonic mean, the median, and the mode) are briefly mentioned in Sections

3.2, 3.3, and 3.4 Asimple statistic of dispersion (the range) is briefly noted in

Section3.5,and the standard deviation, the most common statistic for describing

dispersion, is explained in Section 3.6 Our first encounter with contrasts

be-tween sample statistics and population parameters occurs in Section 3.7, in

connection with statistics of location and dispersion In Section 3.8 there is a

description of practical methods for computing the mean and standard

devia-tion The coefficient of variation (a statistic that permits us to compare the

relative amount of dispersion in different samples) is explained in the last section

(Section 3.9)

The techniques that will be at your disposal after you have mastered this

chapter will not be very powerful in solving biological problems, but they will

be indispensable tools for any further work in biostatistics Other descriptive

statistics, of both location and dispersion, will be taken up in later chapters

A/J important /Jote: We shall first encounter the use of logarithms in this

chapter To avoid confusion, common logarithms have been consistently

ab-breviated as log, and natural logarithms as In Thus, log \ means loglo x and

In x means log" x.

The most common statistic of location is familiar to everyone Itis the arithml'lic

mean, commonly called the mean or averaye The mean is calculated by summing

all the individual observations or items of a sample and dividing this sum by

the number of items in the sample For instance, as the result of a gas analysis

in a respirometer an investigator obtains the following four readings of oxygen

percentages and sums them:

is symbolized by 1';, which stands for the ith observation in the sample Four

observations could be written symbolically as follows:

Yt , Y z, Y 3 , Y4

We shall define n, the sample size, as the number of items in a sample In thisparticular instance, the sample size n is 4 Thus, in a large sample, we can

symbolize the array from the first to the nth item as follows:

When we wish to sum items, we use the following notation:

i=n

L Y;= Y\ + Y z+ + Y n

i=1

The capital Greek sigma,L, simply means the sum of the items indicated The

i = 1means that the items should be summed, starting with the first one and

ending with the nth one, as indicated by the i= /J above the L The subscriptand superscript are necessary to indicate how many items should be summed.The"i = "in the superscript is usually omitted as superfluous For instance, if

we had wished to sum only the first three items, we would have written Lf~ 1Y;

On the other hand, had we wished to sum all of them except the first one, wewould have written L7~ 2 Y; With some exceptions (which will appear in laterchapters), it is desirable to omit subscripts and superscripts, which generallyadd to the apparent complexity of the formula and, when they arc unnecessary,distract the student's attention from the important relations expressed by theformula Below are seen increasing simplifications of the complete summationnotation shown at the extreme left:

Trang 23

30 CHAPTER3 / DESCRIPTIVE STATISTICS 3.2 I OTHER MEANS 31

formula is written as follows:

of the logarithms of variable Y. Since addition of logarithms is equivalent tomultiplication of their antilogarithms, there is another way of representing thisquantity: it is

The computation of the geometric mean by Expression (3.4a) is lJ uite tedious

In practice, the geometric mean has to be computed by transforming the variatesinto logarithms

The reciprocal of the arithmetic mean of reciprocals is called the harmonic

mea/l. If we symbolize it by H y, the formula for the harmonic mean can bewritten in concise form (without subscripts and superscripts) as

The geometric mean permits us to become familiar with another operatorsymbol: capital pi n, which may be read as "product." Just as L symbolizes

the items that follow it The subscripts and superscripts have exactly the samemeaning as in the summation case Thus, Expression (3.4) for the geometricmean can be rewritten more compactly as follows:

their weighted average will be

This formula tells us, "Sum all the(n) items and divide the sum byn."

sheet of cardboard and then cut out the histogram and lay it flat against a

blackboard, supporting it with a pencil beneath, chances are that it would be

out of balance, toppling to either the left or the right If you moved the

sup-porting pencil point to a position about which the histogram would exactly

balance, this point of balance would correspond to the arithmetic mean

We often must compute averages of means or of other statistics that may

differ in their reliabilities because they arc based on different sample sizes At

other times we may wish the individual items to be averaged to have different

average. A general formula for calculating the weighted average of a set of

of Y; in such cases are unlikely to represent variates They are more likely to

be sample means ~ or some other statistics of different reliabilities

variates but are means Thus, if the following three means are based on differing

sample sizes, as shown,

Note that in this example, computation of the weighted mean is exactly

elJuiv-alent to adding up all the original measurements and dividing the sum by the

having the highest mean, will IIllluence the weighted average in proportion to

You may wish to convince yourself that thc geometric mean and the harmonicmean of the four oxygen percentages arc 14.65~~and 14.09·~,respectively Un-less the individual items do not vary, the geometric mean is always less thanthe arithmetic mean, and the harmonic mean is always less than the geometricmean

Some beginners in statistics have difficulty in accepting the fact that sures of location or central tendency other than the arithmetic mean are per-missible or even desirable They feel that the arithmetic mean is the "logical"

Trang 24

mea-32 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.4 / THE MODE 33

average and that any other mean would be a distortion This whole problem

relates to the proper scale of measurement for representing data; this scale is

not always the linear scale familiar to everyone, but is sometimes by preference

a logarithmic or reciprocal scale Ifyou have doubts about this question, we

shall try to allay them in Chapter 10, where we discuss the reasons for

trans-forming variables

3.3 The median

ThemedianM is a statistic of location occasionally useful in biological research

Itis defined as that L'alue of the variable(in an ordered array) that has an equal

/lumber of items on either side of it. Thus, the median divides a frequency

dis-tribution into two halves In the following sample of five measurements,

14, 15, 16 19,23

M '= 16, since the third observation has an equal number of observations on

both sides of it We can visualize the median easily if we think of an array

from largest to smallest-for example, a row of men lined up by their heights

The median individual will then be that man having an equal number of men

on his right and left sides His height will be the median height of the

sam-ple considered This quantity is easily evaluated from a samsam-ple array with

an odd number of individuals When the number in the sample is even, the

median is conventionally calculated as the midpoint between the (n/2)th and

the [(n/2) + IJth variate Thus, for the sample of four measurements

14, 15 16, 19the median would be the midpoint between the second and third items or 15.5

Whenever any onc value of a variate occurs morc than once, problems may

devclop in locating the median Computation of the median item becomes morc

involved because all the memhers of a given class in which the median item is

located will havc thc same class mark The median then is the (/I/2)th variate

in the frequency distribution It is usually computed as that point between the

class limits of the median class where thc median individual would be located

(assuml!1g the individuals in the class were evenly distributed)

The median is just one of a family of statistics dividing a frequency

dis-tribution into equal areas It divides the distribution into two halves The three

i/ullrli[l's cut the distribution at the 25 50 and 75'';, points that is, at points

dividing the distribution into first, second third, and fourth quarters by area

(and frequencies) The second quartile is of course, the median (There are also

quintiles deciles, and pcrcentiles dividing the distribution into 5 10 and 100

equal portions rcspectively.)

Medians are most often used for distributions that do not conform to the

standard probahility models, so that nonparametric methods (sec Chaptcr 10)

must be uscd Sometimcs the median is a more representative measure of

loca-tion than the arithmetic mean Such instances almost always involve asymmetric

distributions An often quoted example from economics would be a suitablemeasure of location for the "typical" salary of an employee of a corporation.The very high salaries of the few senior executives would shift the arithmeticmean, the center of gravity, toward a completely unrepresentative value Themedian, on the other hand, would be little affected by a few high salaries; itwould give the particular point on the salary scale above which lie 50% of thesalaries in the corporation, the other half being lower than this figure

In biology an example of the preferred application of a median over thearithmetic mean may be in populations showing skewed distribution, such asweights Thus a median weight of American males 50 years old may be a moremeaningful statistic than the average weight The median is also of importance

in cases where it may be difficult or impossible to obtain and measure all theitems of a sample For example, suppose an animal behaviorist is studyingthe time it takes for a sample of animals to perform a certain behavioral step.The variable he is measuring is the time from the beginning of the experimentuntil each individual has performed What he wants to obtain is an averagetime of performance Such an average time however, can be calculated onlyafter records have been obtained on all the individuals Itmay take a long timefor the slowest animals to complete their performance longer than the observerwishes to spend (Some of them may never respond appropriately, m:.tking thecomputation of a mean impossiblc.) Therefore a convenient statistic of 10catil)Jl

to describe these animals may be the median time of performance Thus solong as the observn knows what the total sample size is, he need not havemeasurements for the right-hand tail of his distribution Similar examples would

be the responses to a drug or poison in a group of individuals (the medianlethal or effective dose LD;;o or EDso )or the median time for a mutation toappear in a number of lines of a species

3.4 The modeThe lIIodl' refers to Ihl' I'(ill/I' rl'pl'esellled h.l' Ihe I/I'ealesl Ill/Ill/WI'of i/ldi/'idl/a/s.

Whcn seen on a frequency distribution the mode is the value of the variablc

at which the curvc pcaks In grouped frequcncy distrihutions the mode as apoint has little mcaning.Itusually sutlices to identify the modal class IIIbiology.the mode docs not have many applications

Distributions having two peaks (equal or unequal in height) are called

himoda/; those with more than two peaks are ml/[till1oda[. In those rarc tributions that arc U-shaped we refcr to the low point at the middle of thedistribution as an {l/lfilllOde.

dis-In evaluating the relative merits of the arithmctic mean the mCdl;1I1 andthe mode a numoer or considerati'lns have to be kept in mind The mean isgenerally preferred In statistics since it has a smaller standard error than otherstatistics of location (see Section 6.2), it is easier to work with mathcmatically.and it has an additional desirable property (explained in Section 6.1): it willtend be normally distriouted even if the original data arc not The mean is

Trang 25

f

I 10

8 6 4

2

0 10 8

; ,

" 6 c:

'"

"

0' 1:: 4

One simple measure of dispersion is the ralli/e, which is defined as Ihe diff£'re/lce herll'eell the /(I"!le~1lllld Ihe ~mlllleslilems ill IIslimp/e.Thus the range

of the four oxygen percentages listed earlier (Section 3.1) is

Range= 23.3 - 10.8 = 12S'~,

and the range of the aphid femur lengths (Box 2.1) is

Range= 4.7- 3.3 = 1.4 units of 0.1 millSince the range is a measure of the span of the variates along the scale of thevariable, it is in the same units as the original measurements The range isclearly affected hy even a single outlying value and for this reason is only a

" = 120

Ifi ~.II 'in

CHAPTER3 IJESCRIPTIVE STATISTICS

II

I I I I I

An asymmetrical frequency distribution (skewed to the right) showing location of the mean, median,

and mode Percent butterfat in 120 samples of milk (from a Canadian cattle breeders' record book).

markedly alli:cted by outlying observations; the median and mode arc not The

mean is generally more sensitive to changes in the shape of a frequency

distri-bution, and if it is desired to have a statistic reflecting such changes, the mean

may be preferred

In symmetrical, unimodal distributions the mean, the median, and the mode

are all identical A prime example of this is the well-known normal distribution

of Chapter 5 In a typical asymmetrical distrihution, such as the one shown in

Figure J I, the relative positions of the mode median, and mean arc generally

these: the mean is closest to the drawn-out tail of the distribution, the

Lmode i'sfarthest, and the median is hetween these An easy way to rememher this se-

quence is to recall that they occur in alphabetical order from the longer tail of

the distribution

J.S The range

We now turn to measures of dispersion Figure 3.2demonstrates that radically

dilkrent-Iooking distrihutions may possess the identical arithmetic mean It is

tl"'r,.fl\t-"I,h\/;.\II :,I'l-.t ,,,tl.,,.t· lll' •• ,.· f.r I ~·,.,., •• ;.7; r~~"t ;I •• 1~f·'" " " t L-" C~,"••• ;,1

Trang 26

36 CHAPTER3 / DESCRIPTIVE STATISTICS 3.7 / SAMPLE STATISTICS AND PARAMETERS 373.6 The standard deviation

We desire that a measure of dispersion take all items of a distribution into

consideration, weighting each item by its distance from the center of the

distri-bution We shall now try to construct such a statistic In Table 3.1 we show a

shows the vanates in the order in which they were reported The computation

of the mean is shown below the table The mean neutrophil count turns out to

be 7.713

The distance of each variate from the mean is computed as the following

deviation:

E.ach individual deviation, or deviate, is by convention computed as the

Deviates are symbolized by lowercase letters corresponding to the capital letters

manner

We now wish to calculate an average deviation that will sum all the deviates

and divide them by the number of deviates in the sample But note that when

TABLE 3.1 The standard deviation Long method, not recommended for hand or calculator computations but shown here to illus- trate the meaning or the standard deviation The data are hlood neutrophil counts (divided hy \()OO) per microliter, in

15 patients with non hematological tumors.

we sum our deviates, negative and positive deviates cancel out, as is shown

by the sum at the bottom of column (2); this sum appears to be unequal tozero only because of a rounding error Deviations from the arithmetic meanalways sum to zero because the mean is the center of gravity Consequently,

an average based on the sum of deviations would also always equal zero Youare urged to study Appendix AU, which demonstrates that the sum of deviationsaround the mean of a sample is equal to zero

Squaring the deviates gives us column (3) of Table 3.1 and enables us toreach a result other than zero (Squaring the deviates also holds other mathe-matical advantages, which we shall take up in Sections 7.5 and 11.3.) The sum

of the squared deviates (in this case, 308.7770) is a very important quantity instatistics It is called the sum of squares and is identified symbolically as I: y 2.

Another common symbol for the sum of squares is 55

resulting quantity is known as the variance, or the mean square:

The variance is a measure of fundamental importance in statistics and weshall employ it throughout this book At the moment, we need only rememberthat because of the squaring of the deviations, the variance is expressed insquared units To undo the etfect of the squaring, we now take the positivesquare root of the variance and obtain the standard deviation:

V~

Thus, standard dcviation is again cxprcssed ill the original units of ment, since it is a square root of the squared units of the variance

measurc-An important note: The technique just learned and illustrated in Table 3.1

is not the simplest for direct computation of a variance and standard deviation.However, it is often used in computer programs, where accuracy of computa-tions is an important consideration Alternativc and simpler computationalmethods are given in Section 3.8

The observant reader may havc noticed that we have avoided assigningany symbol to either the variance or the standard deviation We shall explainwhy in the next section

3.7 Sample statistics and parameters

Up to now we have calculated statistics from samples without giving too muchthought to what these statistics represent When correctly calculated, a meanand standard deviation will always be absolutely true mcasures of location anddispersion for the samples on which they are based Thus thc truc mcan of thefour oxygen percentagc readings in Section 3.1 is 15.325 ";, The standard devia-tion of the 15 ncutrophil counts is 4.537 Howevcr, only rardy in biology (or

,",t'lf;(.'f;f~I.:' n~""t1r'r'll\ ~

Trang 27

38 CHAPTER3 / DESCRIPTIVE STATISTICS 3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION 39

(3.6)

This formulation explains most clearly the meaning of the sum of squares, though it may be inconvenient for computation by hand or calculator, sinceone must first compute the mean before one can square and sum the deviations

al-A quicker computational formula for this quantity is

3.8 Practical methods for computing mean and standard deviation

Three steps are necessary for computing the standard deviation: (I) find I:y2,

the sum of squares; (2) divide by n - I to give the variance; and (3) take the

square root of the variance to obtain the standard deviation The procedureused to compute the sum of squares in Section 3.6 can be expressed by thefollowing formula:

(3.8)(3.7)

contrast to using these as estimates of the population parameters There arealso the rare cases in which the investigator possesses data on the entire popu-lation; in such cases division by nis perfectly justified, because then the inves-tigator is not estimating a parameter but is in fact evaluating it Thus thevariance of the wing lengths of all adult whooping cranes would be a parametricvalue; similarly, if the heights of all winners of the Nobel Prize in physics hadbeen measured, their variance would be a parameter since it would be based

on the entire population

Let us see exactly what this formula represents The first term on the right side

of the cquation, :Ey2, is the sum of all individual Y's, each squared, as follows:

V's; that is, all the Y's are first summcd and this sum is then squared.Ingeneral,this quantity is diffcrent from I:y2, which first squares the Y's and thcn sums

thcm These two terms are identical only if all the Y's are equal Ifyou are notcertain about this, you can convince yourself of this fact by calculating thesetwo quantities for a few numbers

The disadvantage of Expression (3.X) is that the quantitiesI:y2 and(I:Y)2/ 11

may both be quite large, so that accuracy may be lost in computing their ference unless one takes the precaution of carrying sufllcient significant ligures.Why is Expression (3.8) identical with Expression (3.7)? The proof of thisidentity is very simple and is given in Appendix A1.2 You are urged to work

)

308.7770

-only as descriptive summaries of the samples we have studied Almost always we

are interested in thepopulations from which the samples have been taken What

we want to know is not the mean of the particular four oxygen precentages,

but rather the true oxgyen percentage of the universe of readings from which

the four readings have been sampled Similarly, we would like to know the true

mean neutrophil count of the population of patients with nonhematological

tumors, not merely the mean of the 15 individuals measured When studying

dispersion we generally wish to learn the true standard deviations of the

popu-lations and not those of the samples These population statistics, however, are

unknown and (generally speaking) are unknowable Who would be able to

col-lect all the patients with this particular disease and measure their neutrophil

counts? Thus we need to usesample statistics as estimators of population

statis-tics or parameters.

Itis conventional in statistics to use Greek letters for population parameters

the parametric mean of the population Similarly, a sample variance, symbolized

byS2,estimates a parametric variance, symbolized by(f2. Such estimators should

from a population with a known parameter should give sample statistics that,

when averaged, will give the parametric value An estimator that does not do

so is called biased.

However, the sample variance as computed in Section 3.6 is not unbiased On

the average, it will underestimate the magnitude of the population variance(J2

To overcome this bias, mathematical statisticians have shown that when sums

of squares are divided byn - I rather than by II the resulting sample variances

will be unbiased estimators of the population variance For this reason, it is

formula for the standard deviation is therefore customarily given as follows:

-n - I

We note that this value is slightly larger than our previous estimate of 4.537

Of course, the greater the sample size, the less difference there will be between

it refers to a variance obtained by division of the sum of squares by the dewee5

Division of the sum of squares by II is appropriate only when the interest

of the investigator is limited to the sample at hand and to its variance and

Trang 28

40 CHAPTER 3 / DESCRIPTIVE STATISTICS

3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION 41

The resulting class marks are values slll:h as 0, 8, 16, 24, 32, and so on Theyare then divided by 8 which changes them to 0, I, 2 3, 4, and so on, which isthe desired formal The details of the computation can be learned from the box.When checking the results of calculations, it is frequently useful to have

an approximate method for estimating statistics so that gross errors in tation can be detected A simple method for estimating the mean is to averagethe largest and smallest observation to obtain the so-called midranye. For theneutrophil counts of Table 3.1, this value is (2.3+ 18.0)/2 = 10.15 (not a verygood estimate) Standard deviations can be estimated from ranges by appro-priate division of the range as follows:

compu-BOX 3.1

Cakulation orYandsfromunordereddata

Neutrophilconots, unordered as shown in Table 3.1

through it to build up your confidence in handling statistical symbols and

formulas

It is sometimes possible to simplify computations by recoding variates into

simpler form We shall use the term additive coding for the addition or

sub-traction of a constant (since subsub-traction is only addition of a negative number)

We shall similarly use multiplicative coding to refer to the multiplication or

division by a constant (since division is multiplication by the reciprocal of the

divisor) We shall use the term combination coding to mean the application of

both additive and multiplicative coding to the same set of data In Appendix

A 1.3 we examine the consequences of the three types of coding in the

com-putation of means, variances, and standard deviations

For the case ofmeans,the formula for combination coding and decoding is

the most generally applicable one Ifthe coded variable is ~=D(Y+C). then

Y= "- C D

where C is an additive code and D is a multiplicative code

On considering the effects of coding variates on the values ofvariances and

squares, variances, or standard deviations The mathematical proof is given in

Appendix A 1.3 but we can see this intuitively, because an additive code has

no effect on the distance of an item from its mean The distance from an item

of 15 to its mean of 10 would be 5 Ifwe were to code the variates by

sub-tracting a constant of 10, the item would now be 5 and the mean zero The

difference between them would still be 5 Thus if only additive coding is

em-ployed, the only statistic in need of decoding is the mean But multiplicative

coding docs have an effect on sums of squares varianccs and standard

devia-tions The standard deviations have to be divided by the multiplicative code

just as had to be done for the mean However, the sums of squares or variances

have to be divided by the multiplicative codes squared, because they are squared

terms, and the multiplicative factor becomes squared during the operations In

combination coding the additive code can be ignored

When the data are unordered the computation of the mean and standard

deviation proceeds as in Box 3.1, which is based on the unordered

neutrophiJ-count data shown in Table 3.1 We chose not to apply coding to these data

since it would not have simplilled the computations appreciably

When the data arc arrayed in a frequency distribution, the computations

can be made much simpler When computing the statistics, you can often avoid

the need for manual entry of large numbers of individual variates jf you first

set up a frequency distribution Sometimes the data will come to you already

in the form of a frequency distribution having been grouped by the researcher

The computation of Yand s from a frequency distribution is illustrated in

Box 3.2 The data are the birth weights of male Chinese children, first encountered

in Figure 2.3 The calculation is simplilled by coding to remove the awk ward

class marks This is done by subtracting 59.5, the lowest class mark of the array

Dil'idl' Ihl' ran'll' IJy

3 4 5

66~

Trang 29

42 CHAPTER 3 / DESCRIPTIVE STATISTICS 3.9 / THE COEFFICIENT OF VARIATION 43

BOX 3.2

Calculation ofY, s, ami V from a frequency distribution.

Birth weights of male Chinesein ounces

The range of the neutrophil counts is 15.7 When this value is divided by 4, weget an estimate for the standard deviation of 3.925, which compares with thecalculated value of 4.696 in Box 3.1 However, when we estimate mean andstandard deviation of the aphid femur lengths of Box 2.1 in this manner, weobtain 4.0 and 0.35, respectively These are good estimates of the actual values

of 4.004 and 0.3656, the sample mean and standard deviation

Having obtained the standard deviation as a measure of the amount of variation

in the data, you may be led to ask, "Now what'?" At this stage in our prehension of statistical theory, nothing really useful comes of the computations

com-we have carried out Hocom-wever, the skills just learned are basic to all later tical work So far, the only use that we might have for the standard deviation

statis-is as an estimate of the amount of variation in a population Thus we maywish to compare the magnitudes of the standard deviations of similar popula-

tions and see whether population A is more or less variable than population B.

When populations differ appreciably in their means the direct comparison

of their variances or standard deviations is less useful, since larger organismsusually vary more than smaller one For instance, the standard deviation ofthe tail lengths of elephants is obviously much greater than the entire tail1cngth

of a mouse To compare the relative amounts of variation in populations having

different means, the coefficient (!{ variation, symbolized by V (or occasionally

CV), has been developed This is simply the standard deviation expressed as apercentage of the mean Its formula is

V=

-~ y

3.9 The coefficient of variation

For example, the coeflicient of variation of the birth weights in Box J.2 IS12.37".:, as shown at the bottom of that box The coeflicient of variation ISindependent of the unit of measurement and is expressed as a percentage.Coefficients of variation are used when one wishes to compare the variation

of two populations without considering the magnitude of their means (It isprobably of little interest to discover whether the birth weights of the Chinesechildren are more or less variable than the femur lengths of the aphid stem

mothers However, we can calculate V for the latter as (0.3656 x 1(0)/4.004=

9.13%, which would suggest that the birth weights arc morc variable.) Often,

we shall wish to test whether a given biological sample is more variable for onecharacter than for another Thus, for a sample of rats, is hody weight morcvariable than blood sugar content'! A second, frequent typc of comparison,especially in systematics, is among different populations for the same character.Thus, we may have measured wing length in samples of hirds from severallocalities We wish to know whether anyone of these populations is more vari-able than the others An answer to this question can be obtained hyexaminingthe coeillcients of variation of wing length in these samples

Coding and decoding

Trang 30

44 CHAPTER3 / DESCRIPTIVE STATISTICS EXERCISES 45

3.1 Find Y, s, V, and the median for the following data (mg of glycine per mg of

creatinine in the urine of 37 chimpanzees; from Gartler, Firschein, and

Dob-zhansky, 1956) ANS Y= 0.1l5, s= 0.10404

3.2 Find the mean, standard deviation, and coefficient of variation for the pigeon

data given in Exercise 2.4 Group the data into ten classes, recompute Yand s,

and compare them with the results obtained from ungrouped data Compute

the median for the grouped data

3.3 The following are percentages of butterfat from 120 registered three-year-old

Ayrshire cows selected at random from a Canadian stock record book

(a) Calculate Y,s, and Vdirectly from the data _

(b) Group the data in a frequency distribution and again calculate Y, s, and V.

Compare the results with those of (a) How much precision has been lost by

grouping? Also calculate the median

.055.100.050.019

.135.120.080.100

.052.110.110.100

.077.100.110.116

.026.350.120

.440.100.133

.300.300.100

by 8.0 first and then added 5.2?

EstimateJ1and(fusing the midrange and the range (see Section 3.8) for the data

in Exercises 3.1, 3.2, and 3.3 How well do these estimates agree with the mates given by Yand s? ANS Estimates ofJ1and(ffor Exercise 3.2 are 0.224and 0.1014

esti-Show that the equation for the variance can also be written as

Ly 2- ny2

=._-~._ n-IUsing the striped bass age distribution given in Exercise 2.9, compute the fol-lowing statistics: Y, S2, s, V, median, and mode ANS Y= 3.043,S2= 1.2661,

s= 1.125, V=36.98%, median=2.948, mode=3

Use a calculator and compare the results of using Equations 3.7 and 3.8 tocomputeS2 for the following artificial data sets:

(a) 1, 2, 3, 4, 5(b) 9001, 9002, 9003, 9004, 9005(c) 90001, 90002, 90003, 90004, 90005(d) 900001,900002,900003,900004,900005Compare your results with those of one or more computer programs What isthe correct answer? Explain your results

3.4 What ciTed would adding a constant 5.2 t() all observations have upon the

Ill/merical values of the following statistics: Y. 1/, average deviation, median

Trang 31

INTRODUCTION TO PROBABILITY DISTRIBUTIONS 47

Introduction to Probability

Distributions: The Binomial and

Poisson Distributions

In Section 2.5 we first encountered frequency distributions For example, Table

2.2 shows a distribution for a meristic, or discrete (discontinuous), variable, the

number of sedge plants per quadrat Examples of distributions for continuous

variables are the femur lengths of aphids in Box 2.1 and the human birth weights

in Box 3.2 Each of these distributions informs us about the absolute frequency

of any given class and permits us to computate the relative frequencies of any

class of variable Thus, most of the quadrats contained either no sedges or one

or two plants In the 139.5-oz class of birth weights, we find only 201 out of

the total of 9465 babies recorded; that is, approximately only 2.1%of the infants

are in that birth weight class

We realize, of course, that these frequency distributions are only samples

from given populations The birth weights, for example, represent a population

of male Chinese infants from a given geographical area But if we knew our

sample to be representative of that population, we could make all sorts of

pre-dictions based upon the sample frequency distribution For instance, we could

the probability that the weight at birth of anyone baby in this population will

be in the 139.5-oz birth class is quite low.Ifall of the 9465 weights were mixed

up in a hat and a single one pulled out, the probability that we would pull out

would be much more probable that we would sample an infant of 107.5 or115.5 OZ, since the infants in these classes are represented by frequencies 2240and 2007, respectively Finally, if we were to sample from an unknown popula-tion of babies and find that the very first individual sampled had a birth weight

of 170 oz, we would probably reject any hypothesis that the unknown populationwas the same as that sampled in Box 3.2 We would arrive at this conclusion

had a birth weight that high Though it is possible that we could have sampledfrom the population of male Chinese babies and obtained a birth weight of 170

oz, the probability that the first individual sampled would have such a value

is very low indeed.Itseems much more reasonable to suppose that the unknownpopulation from which we are sampling has a larger mean that the one sampled

in Box 3.2

We have used this empirical frequency distribution to make certain tions (with what frequency a given event will occur) or to make judgments anddecisions (is it likely that an infant of a given birth weight belongs to thispopulation?) In many cases in biology, however, we shall make such predictionsnot from empirical distributions, but on the basis of theoretical considerationsthat in our judgment are pertinent We may feel that the data should be distrib-uted in a certain way because of basic assumptions about the nature of the

conform sufficiently to the values expected on the basis of these assumptions,

we shall have serious doubts about our assumptions This is a common use offrequency distributions in biology The assumptions being tested generally lead

to a theoretical frequency distribution known also as a prohahility distrihution.

This may be a simple two-valued distribution, such as the 3: 1 ratio in aMendelian cross; or it may be a more complicated function, as it would be if

the observed data do not fit the expectations on the basis of theory, we areoften led to the discovery of some biological mechanism causing this deviationfrom expectation The phenomena of linkage in genetics, of preferential matingbetween different phenotypes in animal behavior, of congregation of animals

at certain favored places or, conversely, their territorial dispersion are cases inpoint We shall thus make use of probability theory to test our assumptionsabout the laws of occurrence of certain biological phenomena We should point

outto the reader, however, that probability theory underlies the entire structure

of statistics, since, owing to the non mathematical orientation of this hook, thismay not be entirely obvious

In this chapter we shall first discuss probability, in Section 4.1, but only tothe extent necessary for comprehension of the sections that follow at the intendedlevel of mathematical sophistication Next, in Section 4.2, we shall take up the

Trang 32

48 CHAPTER 4 / INTRODUCTION TO PROAARILITY DISTRInUTIONS 4 t / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 49

binomial frequency distribution, which is not only important in certain types

of studies, such as genetics, but also fundamental to an understanding of the

various kinds of probability distributions to be discussed in this book

The Poisson distribution, which folIows in Section 4.3, is of wide applicability

in biology, especially for tests of randomness of occurrence of certain events

Both the binomial and Poisson distributions are discrete probability

distribu-tions The most common continuous probability distribution is the normal

frequency distribution, discussed in Chapter 5

4.1 Probability, random sampling, and hypothesis testing

We shalI start this discussion with an example that is not biometrical or

biological in the strict sense We have often found it pedagogically effective to

introduce new concepts through situations thoroughly familiar to the student,

even if the example is not relevant to the general subject matter of biostatistics

Let us betake ourselves to Matchless University, a state institution

somewhere between the Appalachians and the Rockies Looking at its enrollment

figures, we notice the following breakdown of the student body: 70% of the

students are American undergraduates (AU) and 26% are American graduate

students (AG); the remaining 4% are from abroad Of these, I % are foreign

undergraduates (FU) and 3% are foreign graduate students (FG) In much of

our work we shall use proportions rather than percentages as a useful convention

Thus the enrollment consists of 0.70 AU's, 0.26 AG's, 0.01 FU's, and 0.03 FG's

The total student body, corresponding to 100%, is therefore represented by the

figure 1.0

we would intuitively expect that, on the average 3 would be foreign graduate

students The actual outcome might vary There might not be a single FG

student among the 100 sampled, or there might be quite a few more than 3

The ratio of the number of foreign graduate students sampled divided by the

total number of students sampled might therefore vary from zero to considerably

greater than 0.03 If we increased our sample size to 500 or 1000, it is less likely

that the ratio would fluctuate widely around 0.03 The greater the sample taken,

the closer the ratio of FG students sampled to the total students sampled will

approach 0.03 In fact, the probability of sampling a foreign student can be

defined as the limit as sample size keeps increasing of the ratio of foreign students

to the total number of students sampled Thus we may formally summarize

the situation by stating that the probability that a student at Matchless

Now let us imagine the following experiment: We try to sample a student

at random from among the student body at Matchless University This is not

as easy a task as might be imagined If we wanted to do this operation physically,

we would have to set up a colIection or trapping station somewhere on campus.And to make certain that the sample was truly random with respect to theentire student population, we would have to know the ecology of students oncampus very thoroughly We should try to locate our trap at some stationwhere each student had an equal probability of passing Few, if any, such places

frequented more by independent and foreign students, less by those living inorganized houses and dormitories Fewer foreign and graduate students might

be found along fraternity row Clearly, we would not wish to place our trapnear the International Club or House, because our probability of sampling aforeign student would be greatly enhanced In front of the bursar's window wemight sample students paying tuition But those on scholarships might not befound there We do not know whether the proportion of scholarships amongforeign or graduate students is the same as or different from that among theAmerican or undergraduate students Athletic events, political rallies, dances,and the like would alI draw a differential spectrum of the student body; indeed,

no easy solution seems in sight The time of sampling is equally important, inthe seasonal as well as the diurnal cycle

Those among the readers who are interested in sampling organisms fromnature will already have perceived parallel problems in their work Ifwe were

to sample only students wearing turbans or saris, their probability of beingforeign students would be almost 1 We could no longer speak of a randomsample In the familiar ecosystem of the university these violations of propersampling procedure are obvious to all of us, but they are not nearly so obvious

in real biological instances where we are unfamiliar with the true nature of theenvironment How should we proceed to obtain a random sample of leavesfrom a tree, of insects from a field, or of mutations in a culture? In sampling

at random, we are attempting to permit the frequencies of various eventsoccurring in nature to be reproduced unalteredly in our records; that is, wehope that on the average the frequencies of these events in our sample will bethe same as they are in the natural situation Another way of saying this is that

in a random sample every individual in the population being sampled has anequal probability of being included in the sample

.We might go about obtaining a random sample by using records

repre-~entmg the student body, such as the student directory, selecting a page from

an arbItrary number to each student, write each on a chip or disk, put these

in a large container, stir well, and then pull out a number

Imagine now that we sample a single student physically by the trappingmethod, after carefully planning the placement of the trap in such a way as tomake sampling random Wtat are the possible outcomes? Clearly, the student

could be either an AU, AG, FU or FG The set of these four possible outcomes

exhausts the possibilities of this experiment This set, which we can represent

as {AU, AG, FU, FG} is called the sample space Any single trial of the experiment

described above would result in only one ofthe four possible outcomes (elements)

Trang 33

50 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.1 / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 51

in the set A single element in a sample space is called a simple event. It is

distinguished from an event, which is any subset of the sample.space Thus, in

the sample space defined above {AU}, {AG}, {FU}, and {FG} are each

sim-ple events The following sampling results are some of the possible events:

{AU, AG, FU}, {AU, AG, FG}, {AG, FG}, {AU, FG}, By the definition of

"event," simple events as well as the entire sample space are also events The

meaning of these events should be clarified Thus {AU, AG, FU} implies being

either an American or an undergraduate, or both

en-compasses all possible outcomes in the space yielding an American student

a graduate student The intersection of events A and B, written An B, describes

only those events that are shared by A and B Clearly only AG qualifies, as

can be seen below:

Thus, An B is that event in the sample space giving rise to the sampling of an

American graduate student When the intersection of two events is empty, as

there is no common element in these two events in the sampling space

We may also define events that are unions of two other events in the sample

graduate students, or American graduate students

Why are we concerned with defining sample spaces and events? Because

these concepts lead us to useful definitions and operations regarding the

probability of various outcomes If we can assign a numberp,where°.$p.$ 1,

to each simple event in a sample space such that the sum of these p's over all

simple events in the space equals unity, then the space becomes a (finIte)

probability space In our example above, the following numbers were associated

with the appropriate simple events in the sample space:

{AU,AG, FU, FG}

{O.70, 0.26, O.ol, 0.03}

Given this probability space, we are now able to make statements regarding

the probability of given events For example, what is the probability that a

student sampled at random will be an American graduate student? Clearly, It

or a graduate student? In terms of the events defined earlier, t IS IS

PLA u BJ = PL[AU, AG}J + P[[AG, FG}] - PUAG}]

= 0.96+0.29 0.26

0.99

because if we did not do so it would be included twice, once in P[A] and once

in P[B], and would lead to the absurd result of a probability greater than 1.Now let us assume that we have sampled our single student from the studentbody of Matchless University and that student turns out to be a foreign graduatestudent What can we conclude from this? By chance alone, this result would

we have sampled at random should probably be rejected, since if we accept thehypothesis of random sampling, the outcome of the experiment is improbable

Please note that we said improbable, not impossible. Itis obvious that we couldhave chanced upon an FG as the very first one to be sampled However, it isnot very likely The probability is 0.97 that a single student sampled will be anon-FG If we could be certain that our sampling method was random (aswhen drawing student numbers out of a container), we would have to decidethat an improbable event has occurred The decisions of this paragraph are allbased on our definite knowledge that the proportion of students at MatchlessUniversity is indeed as specified by the probability space.Ifwe were uncertainabout this, we would be led to assume a higher proportion of foreign graduatestudents as a consequence of the outcome of our sampling experiment

We shall now extend our experiment and sample two students rather thanjust one What are the possible outcomes of this sampling experiment? The newsampling space can best be depicted by a diagram (Figure 4.1) that shows theset of the 16 possible simple events as points in a lattice The simple events arethe following possible combinations Ignoring which student was sampled first,they are (AU, AU), (AU, AG), (AU, FU), (AU, FG), (AG, AG), (AG, FU),(AG, FG), (FU, FU), (FU, FG), and (FG, FG)

00:; 1'(; oo:!•J() O.007S• o.oom• O.OOO!I•

Trang 34

52 CHAPTER4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.1 / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING 53What are the expected probabilities of these outcomes? We know the

expected outcomes for sampling one student from the former probability space,

but what will be the probability space corresponding to the new sampling space

of 16 elements? Now the nature of the sampling procedure becomes quite

im-portant We may sample with or without replacement: we may return the first

student sampled to the population (that is, replace the first student), or we may

keep him or her out of the pool of the individuals to be sampled Ifwe do not

replace the first individual sampled, the probability of sampling a foreign

graduate student will no longer be exactly 0.03 This is easily seen Let us assume

that Matchless University has 10,000 students Then, since 3% are foreign

graduate students, there must be 300 FG students at the university After

sampling a foreign graduate student first, this number is reduced to 299 out of

9999 students Consequently, the probability of sampling an FG student now

original foreign student to the student population and make certain that the

population is thoroughly randomized before being sampled again (that is, give

the student a chance to lose him- or herself among the campus crowd or, in

drawing student numbers out of a container, mix up the disks with the numbers

on them), the probability of sampling a second FG student is the same as

before-O.03 [n fact, if we keep on replacing the sampled individuals in the

original population, we can sample from it as though it were an infinite-sized

population

Biological populations are, of course, finite, but they are frequently so large

that for purposes of sampling experiments we can consider them effectively

infinite whether we replace sampled individuals or not After all, even in this

relatively small population of 10,000 students, the probability of sampling a

second foreign graduate student (without replacement) is only minutely different

from 0.03 For the rest of this section we shall consider sampling to be with

replacement, so that the probability level of obtaining a foreign student does

not change

There is a second potential source of difficulty in this design We have to

assume not only that the probability of sampling a second foreign student is

equal to that of the first, but also that it is independent of it By independence

students, if we have sampled one foreign student, is it more or less likely that a

second student sampled in the same manner will also be a foreign student?

Inde-pendence of the events may depend on where we sample the students or on the

method of sampling.Ifwe have sampled students on campus, it is quite likely that

the events are not independent; that is, if one foreign student has been sampled,

the probability that the second student will be foreign is increased, since foreign

students tend to congregate Thus, at Matchless University the probability that

a student walking with a foreign graduate student is also an FG will be greater

than 0.03

Events D and E in a sample space will be defined as independent whenever

P[Dn E] = P[D]P[E]' The probability values assigned to the sixteen points

in the sample-space lattice of Figure 4.1 have been computed to satisfy theabove condition Thus, lettingP[D] equal the probability that the first studentwill be an AU, that is,P[{AUlAU2 ,AU lAG2 ,AU IFU2'AU IFG2 }],and letting

imposed upon all points in the probability space Therefore, if the samplingprobabilities for the second student are independent of the type of studentsampled first, we can compute the probabilities of the outcomes simply as theproduct of the independent probabilities Thus the probability of obtaining two

The probability of obtaining one AU and one FG student in the sampleshould be the product 0.70 x 0.03 However, it is in fact twice that proba-

students, namely, by sampling first one FG and then again another FG ilarly, there is only one way to sample two AU students However, samplingone of each type of student can be done by sampling first an AU and then an

Sim-FG or by sampling first an Sim-FG and then an AU Thus the probability is

2P[{AU}]P[{FG}] = 2 x 0.70 x 0.03 = 0.0420

Ifwe conducted such an experiment and obtain a sample of two FG students,

we would be led to the following conclusions Only 0.0009 of the samples(l~O

of I % or 9 out of 10,000 cases) would be expected to consist of two foreign

alone GivenP[{FG}] = 0.03 as a fact, we would therefore suspect that samplingwas not random or that the events were not independent (or that both as-sumptions random sampling and independence of events were incorrect).Random sampling is sometimes confused with randomness in nature Theformer is the faithful representation in the sample of the distribution of theevents in nature; the latter is the independence of the events in nature The first

of these generally is or should be under the control of the experimenter and isrelated to the strategy of good sampling The second generally describes aninnate property of the objects being sampled and thus is of greater biologicalinterest The confusion between random sampling and independence of eventsarises because lack of either can yield observed frequencies of events differingfrom expectation We have already seen how lack of independence in samplesofforeign students can be interpreted from both points of view in our illustrativeexample from Matchless University

The above account of probability is adequate for our present purposes butfar too sketchy to convey an understanding of the field Readers interested inextending their knowledge of the subject are referred to Mosimann (1968) for

a simple introduction

Trang 35

54 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 55

is through the use of Pascal's triangle:

4.2 The binomial distribution

For purposes of the discussion to follow we shall simplify our sample space to

consist of only two elements, foreign and American students, and ignore whether

the students are undergraduates or graduates; we shall represent the sample

space by the set {F, A} Let us symbolize the probability space by{p, q},where

p= P[F], the probability that the student is foreign, and q= PEA], the

prob-ability that the student is American As before, we can compute the probprob-ability

space of samples of two students as follows:

k

1

2

34

samples of three students would be as follows:

{FFF,FFA,FAA,AAA}

Samples of three foreign or three American students can again be obtained in

only one way, and their probabilities are p3 and q3, respectively However, in

samples of three there are three ways of obtaining two students of one kind

and one student of the other As before, if A stands for American and F stands

for foreign, then the sampling sequence can be AFF, F AF, FFA for two foreign

students and one American Thus the probability of this outcome will be 3 p 2 q

Similarly, the probability for two Americans and one foreign student is 3pq 2.

A convenient way to summarize these results is by means of the binomial

expansion, which is applicable to samples of any size from populations in which

objects occur independently in only two classes students who may be foreign

or American, or individuals who may be dead or alive, male or female, black

or white, rough or smooth, and so forth This is accomplished by expanding

the binomial term(p +q)\ wherek equals sample size, pequals the probability

of occurrence of the first class, and q equals the probability of occurrence of

the second class By definition,p +q = 1; henceqis a function ofp: q = 1 - p.

For samples of I,(p +q)1 =P+q

For samples of 2,(p +q)2 = p2 +2pq +q2

discussed previously The coefficients (the numbers before the powers ofp and

q)express the number of ways a particular outcome is obtained An easy method

for evaluating the coefficients of the expanded terms of the binomial expression

Pascal's triangle provides the coefficients of the binomial expression-that is,the number of possible outcomes of the various combinations of events For

k= 1 the coefficients are 1 and 1 For the second line (k =2), write 1 at theleft-hand margin of the line The 2 in the middle of this line is the sum of thevalues to the left and right of it in the line above The line is concluded with a

1 Similarly, the values at the beginning and end of the third line are 1, andthe other numbers are sums of the values to their left and right in the lineabove; thus 3 is the sum of 1 and 2 This principle continues for every line Youcan work out the coefficients for any size sample in this manner The line for

k= 6 would consist of the following coefficients: 1, 6, 15, 20, 15, 6, I The p

imitate for any value ofk. We give it here for k= 4:

The power ofp decreases from 4 to 0(k to 0 in the general case) as the power

ofq increases from 0 to 4 (0 to k in the general case) Since any value to the

power 0 is 1 and any term to the power 1 is simply itself, we can simplify thisexpression as shown below and at the same time provide it with the coefficientsfrom Pascal's triangle for the case k=4:

Thus we are able to write down almost by inspection the expansion of thebinomial to any reasonable power Let us now practice our newly learned ability

to expand the binomial

Suppose we have a population of insects, exactly 40% of which are infectedwith a given virus X.Ifwe take samples ofk= 5 insects each and examine eachinsect separately for presence of the virus, what distribution of samples could

we expect if the probability of infection of each insect in a sample were

proportion infected, andq= 0.6,the proportion not infected It is assumed thatthe population is so large that the question of whether sampling is with orwithout replacement is irrelevant for practical purposes The expected propor-tions would be the expansion of the binomial:

(p +q)k =(0.4 +0.6)5

Trang 36

56 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 57With the aid of Pascal's triangle this expansion is

or

(0.4)5 +5(0.4)4(0.6)+ 10(0.4)3(0.6)2 + 10(0.4)2(0.6)3+5(0.4)(0.6)4+(0.6)5

representing the expected proportions of samples of five infected insects, four

infected and one noninfected insects, three infected and two noninfected insects,

and so on

The reader has probably realized by now that the terms of the binomial

expansion actually yield a type of frequency distribution for these different

outcomes Associated with each outcome, such as "five infected insects," there

is a probability of occurrence-in this case(0.4)5 =0.01024 This is a theoretical

frequency distribution orprobability distribution of events that can occur in two

distribution described here is known as the binomial distribution, and the

bino-mial expansion yields the expected frequencies of the classes of the binobino-mial

distribution

A convenient layout for presentation and computation of a binomial

distribution is shown in Table4.1 The first column lists the number of infected

binomial coefficients from Pascal's triangle are shown in column (4) Therelative

TABLE 4.1

Expected frequencies of infected insects in samples of 5 insects sampled from an infinitety large

population with an assumed infection rate of 40%.

(I)

per sample of 0( Binomial Feq"",ncies Feq,,~ncje., freq"encies

expected frequencies, which are the probabilities of the various outcomes, are

shown in column(5) We label such expecred frequencies.l.el' They are simplythe product of columns (2), (3), and (4) Their sum is equal to 1.0, since theevents listed in column (1) exhaust the possible outcomes We see from column

infected insects, and25.9% are expected to contain I infected and 4 noninfectedinsects We shall test whether these predictions hold in an actual experiment

Experiment 4.1 Simulate the sampling of infected insects by using a table of random

numbers such as Table I in Appendix Ai These are randomly chosen one-digit numbers

in which each digit 0 through 9 has an equal probability of appearing The numbersare grouped in blocks of 25 for convenience Such numbers can also be obtained fromrandom number keys on some pocket calculators and by means of pseudorandomnumber-generating algorithms in computer programs (In fact, this entire experimentcan be programmed and performed automatically even on a small computer.) Sincethere is an equal probability for anyone digit to appear, you can let any four digits(say, 0, 1, 2, 3) stand for the infected insects and the remaining digits (4, 5, 6, 7, 8, 9)stand for the noninfected insects The probability that anyone digit selected from thetable will represent an infected insect (that is, will he a 0, 1,2 or 3) is therefore 40%,or0.4, since these are four of the ten possible digits Also, successive digits are assumed to

be independent of the values of previous digits Thus the assumptions of the binomialdistribution should be met in this experiment Enter the table of random numbers at

an arbitrary point (not always at the beginning!) and look at successive groups of fivedigits, noting in each group how many of the digits are 0, I, 2, or 3 Take as manygroups of five as you can find time to do, but no fewer than 100 groups

one year by a biostatistics class A total of2423 samples of five numbers wereobtained from the table of random numbers; the distribution of the four digitssimulating the percentage of infection is shown in this column The observedfrequencies are labeled f. To calculate the expected frequencies for this actualexample we multiplied the relative frequenciesl:cl of column (5) times n= 2423,

column (7) with the expected frequencies in column (6) we note general agreementbetween the two columns of figures The two distributions are also illustrated

in Figure 4.2 If the observed frequencies did not fit expected frequencies, wemight believe that the lack of fit was due to chance alone Or we might be led

to reject one or more of the following hypotheses: (I) that the true proportion

of digits 0, I, 2, and 3 is 0.4 (rejection of this hypothesis would normally not

be reasonable, for we may rely on the fact that the proportion of digits 0, I 2,and 3 in a table of random numbers is 0.4 or very close to it); (2) that sampling

These statements can be reinterpreted in terms of the original infectionmodel with which we started this discussion.If,instead of a sampling experiment

of digits by a biostatistics class, this had been a real sampling experiment ofinsects, we would conclude that the insects had indeed been randomly sampled

Trang 37

58 CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 59

100

0 0

(lh,;ef\'011 [re(!l1el1e;I'';

o Expeet<·d [n''!l1el1e;",

TABLE 4.2 Artificial distributions to illustrate clumping and repulsion Expected frequencies from Table 4.1.

infected insects expected (contagious) Deviation Repulsed Deviation

and that we had no evidence to reject the hypothesis that the proportion of

infected insects was 40%.Ifthe observed frequencies had not fitted the expected

frequencies, the lack of fit might be attributed to chance, or to the conclusion

that the true proportion of infection was not 0.4; or we would have had to

reject one or both of the following assumptions:(1)that sampling was at random

and (2) that the occurrence of infected insects in these samples was independent

events How could we simulate a sampling procedure in which the occurrences

of the digits 0, 1,2, and 3 were not independent? We could, for example, instruct

the sampler to sample as indicated previously, but, every time a 3 was found

among the first four digits of a sample, to replace the following digit with

another one of the four digits standing for infected individuals Thus, once a 3

was found, the probability would be 1.0 that another one of the indicated digits

would be included in the sample After repeated samples, this would result in

higher frequencies of samples of two or more indicated digits and in lower

frequencies than expected (on the basis of the binomial distribution) of samples

of one such digit A variety of such different sampling schemes could be devised

Itshould be quite clear to the reader that the probability of the second event's

occurring would be different from that of the first and dependent on it

How would we interpret a large departure of the observed frequencies from

expectation? We have not as yet learned techniques for testing whether observed

frequencies differ from those expected by more than can be attributed to chance

alone This will be taken up in Chapter 13 Assume that such a test has been

carried out and that it has shown us that our observed frequencies are

significantly different from expectation Two main types of departure from

ex-pectation can be characterized: (I) clumpinq and (2) repulsion, shown in fictitious

examples in Table 4.2 In actual examples we would have no a priori notionsabout the magnitude ofp,the probability of one of the two possible outcomes

Therefore, the hypotheses tested are whether the samples are random and theevents independent

The clumped frequencies in Table 4.2 have an excess of observations at thetails of the frequency distribution and consequently a shortage of observations

at the center Such a distribution is also said to be contagious (Remember that

the total number of items must be the same in both observed and expected quencies in order to make them comparable.) In the repulsed frequency distri-bution there are more observations than expected at the center of the distributionand fewer at the tails These discrepancies are most easily seen in columns (4)and (6) of Table 4.2, where the deviations of observed from expected frequenciesare shown as plus or minus signs

fre-What do these phenomena imply? In the clumped frequencies, more sampleswere entirely infected (or largely infected), and similarly, more samples were en-tirely noninfected (or largely noninfected) than you would expect if proba-bilities of infection were independent This could be due to poor sampling design

If, for example, the investigator in collecting samples of five insects always

such a result would likely appear But if the sampling design is sound, theresults become more interesting Clumping would then mean that the samples

of five are in some way related so that if one insect is infected, others in the

Trang 38

60 CHAPTER4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS 4.2 / THE BINOMIAL DISTRIBUTION 61same sample are more likely to be infected This could be true if they come

from adjacent locations in a situation in which neighbors are easily infected

Or they could be siblings jointly exposed to a source of infection Or possibly

the infection might spread among members of a sample between the time that

the insects are sampled and the time they are examined

The opposite phenomenon, repulsion, is more difficult to interpret

bio-logically There are fewer homogeneous groups and more mixed groups in such

a distribution This involves the idea of a compensatory phenomenon: if some

of the insects in a sample are infected, the others in the sample are less likely

im-munity to their associates in the sample, such a situation could arise logically,

but it is biologically improbable A more reasonable interpretation of such a

finding is that for each sampling unit, there were only a limited number of

pathogens available; then once several of the insects have become infected, the

others go free of infection, simply because there is no more infectious agent

This is an unlikely situation in microbial infections, but in situations in which

a limited number of parasites enter the body of the host, repulsion may be

more reasonable

From the expected and observed frequencies in Table 4.1, we may calculate

the mean and standard deviation of the number of infected insects per sample

These values arc given at the bottom of columns (5), (6), and (7) in Table 4.1

We note that the means and standard deviations in columns (5) and (6) are

almost identical and differ only trivially because of rounding errors Column(7),

being a sample from a population whose parameters are the same as those of

the expected frequency distribution in column (5) or (6), differs somewhat The

mean is slightly smaller and the standard deviation is slightly greater than in

the expected frequencies If we wish to know the mean and standard deviation

of expected binomial frequency distributions, we need not go through the

com-putations shown in Table 4.1 The mean and standard deviation of a binomial

frequency distribution are, respectively,

Ii =kp

Substituting the values II. = 5, p= 0.4, and q= 0.6 of the above example, we

from column (5) in Table 4.1 Note that we use the Greek parametric notation

here becauseII and(J arc parameters of an expected frequency distribution, not

sample statistics, as are the mean and standard deviation in column (7) The

should be distinguished from sample proportions In fact, in later chapters we

resort to pand 4for parametric proportions (rather than TC, which

convention-ally is used as the ratio of the circumference to the diameter of a circle) Here,

however, we prefer to keep our notation simple If we wish to express our

variable as a proportion rather than as a count·-that is, to indicate mean

incidence of infection in the insects as 0.4, rather than as 2 per sample of 5 we

can use other formulas for the mean and standard deviation in a binomial

distribution:

J1.=p

replused frequency distributions of Table 4.2 We note that the clumped bution has a standard deviation greater than expected, and that of the repulsedone is less than expected Comparison of sample standard deviations with theirexpected values is a useful measure of dispersion in such instances

distri-We shall now employ the binomial distribution to solve a biological lem On the basis of our knowledge of the cytology and biology of species A,

prob-we expect the sex ratio among its offspring to be 1: 1 The study of a litter innature reveals that of 17 offspring 14 were females and 3 were males What

of being a female offspring)=0.5 and that this probability is independent amongthe members of the sample, the pertinent probability distribution is the binomial

for sample size k= 17 Expanding the binomial to the power 17 is a formidabletask, which, as we shall see, fortunately need not be done in its entirety How-ever, we must have the binomial coefficients, which can be obtained either from

an expansion of Pascal's triangle (fairly tedious unless once obtained and storedfor future use) or by working out the expected frequencies for any given class of

Y from the general formula for any term of the binomial distribution

formed fromkitems taken Yat a time This can be evaluated ask!/[ Y!(k - Y)!].where! means "factorial." In mathematics k factorial is the product of all the

convention,O! = 1 In working out fractions containing factorials, note that anyfactorial will always cancel against a higher factorial Thus 5!/3! = (5 x4 x 3!)/

(5 x 4)/2 = 10

power 4) Since we require the probability of 14 females, we note that for the

and 4 males Calculating the relative expected frequencies in column (6), wenote that the probability of 14 females and 3 males is 0.005,188,40, a very smallvalue [fwe add to this value all "worse" outcomes-that is, all outcomes thatare even more unlikely than 14 females and 3 males on the assumption of a1: 1 hypothesis we obtain a probability of 0.006,363,42, still a very small value.(In statistics, we often need to calculate the probability of observing a deviation

as large as or larger than a given value.)

Trang 39

62 CHAPTER 4 ; INTRODUCTION TO PROBABILITY DISTRIBUTIONS

TABLE 4.3

Some expected frequencies of males and females for samples of 17 offspring on the assumption that

the sex ratill is 1:1 [p.,=0.5, 4"=0.5;(p', +qd=(0.5+0.W7].

Relative expected

On the basis of these findings one or more of the following assumptions is

unlikely: (I) that the true sex ratio in species A is 1: 1, (2) that we have sampled

at random in the sense of obtaining an unbiased sample, or (3) that the sexes

of the offspring are independent of one another Lack of independence of events

may mean that although the average sex ratio is 1: 1, the individual sibships, or

litters, are largely unisexual, so that the offspring from a given mating would

tend to be all (or largely) females or all (or largely) males To confirm this

hypothesis, we would need to have more samples and then examine the

distri-bution of samples for clumping, which would indicate a tendency for unisexual

sibships

We must be very precise about the questions we ask of our data There

are really two questions we could ask about the sex ratio First, are the sexes

unequal in frequency so that females will appear more often than males? Second,

are the sexes unequal in frequency?Itmay be that we know from past experience

that in this particular group of organisms the males are never more frequent

than females; in that case, we need be concerned only with the first of these

two questions, and the reasoning followed above is appropriate However, if we

know very little about this group of organisms, and if our question is simply

whether the sexes among the offspring are unequal in frequency, then we have

to consider both tails of the binomial frequency distribution; departures from

the I: 1 ratio could occur in either direction We should then consider not only

the probability of samples with 14 females and 3 males (and all worse cases) but

also the probability of samples of 14 males and 3 females (and all worse cases

in that direction) Since this probability distribution is symmetrical (because

ob-tained previously, which results in 0.012,726,84 This new value is stili very small,

making it quite unlikely that the true sex ratio is 1: 1

This is your first experience with one of the most important applications of

statistics hypothesis testing A formal introduction to this field will be deferred

until Section 6.8 We may simply point out here that the two approaches

fol-lowed above are known appropriately as one-tailed tests and two-tailed tests,

respectively Students sometimes have difficulty knowing which of the two tests

to apply In future examples we shall try to point out in each case why a tailed or a two-tailed test is being used

one-We have said that a tendency for unisexual sibships would result in aclumped distribution of observed frequencies An actual case of this nature is aclassic in the literature, the sex ratio data obtained by Geissler (1889) fromhospital records in Saxony Table 4.4 reproduces sex ratios of 6115 sibships of

12 children each from the more extensive study by Geissler All columns of thetable should by now be familiar The expected frequencies were not calculated

on the basis of a 1: 1 hypothesis, since it is known that in human populationsthe sex ratio at birth is not 1:1 As the sex ratio varies in different humanpopulations, the best estimate of it for the population in Saxony was simplyobtained using the mean proportion of males in these data This can be obtained

by calculating the average number of males per sibship(Y= 6.230,58) for the

6115 sibships and converting this into a proportion This value turns out to be

devia-tions of the observed frequencies from the absolute expected frequencies shown

in column (9) of Table 4.4, we notice considerable clumping There are manymore instances of families with all male or all female children (or nearly so)than independent probabilities would indicate The genetic basis for this is notclear, but it is evident that there are some families which "run to girls" andsimilarly those which "run to boys." Evidence of clumping can also be seen fromthe fact thatS2is much larger than we would expect on the basis of the binomialdistribution (0'2 =kpq = 12(0.519,215)0.480,785= 2.995,57)

There is a distinct contrast between the data in Table 4.1 and those inTable 4.4 In the insect infection data of Table 4.1 we had a hypothetical propor-tion of infection based on outside knowledge In the sex ratio data of Table 4.4

we had no such knowledge; we used an empirical value of p obtained from the

whose importance will become apparent later In the sex ratio data of Table 4.3,

as in much work in Mendelian genetics, a hypothetical value ofp is used.

4.3 The Poisson distribution

In the typical application of the binomial we had relatively small samples(2 students, 5 insects, 17 offspring, 12 siblings) in which two alternative statesoccurred at varying frequencies (American and foreign, infected and nonin-fected, male and female) Quite frequently, however, we study cases in whichsample sizek is very large and one of the events (represented by probability q)isvery much more frequent than the other (represented by probabilitypl.We have

cases we are generally interested in one tail of the distribution only This is the

Trang 40

4.3 / THE POISSON DISTRIBUTION 65

tail represented by the terms

pOqk, C(k, l)plqk-t, C(k, 2)p2qk-2, C(k, 3)p3qk-3,

The first term represents no rare events and k frequent events in a sample ofk

The third term represents two rare events and k - 2 frequent events, and so

forth The expressions of the form C(k,i) are the binomial coefficients, sented by the combinatorial terms discussed in the previous section Althoughthe desired tail of the curve could be computed by this expression, as long assufficient decimal accuracy is maintained, it is customary in such cases tocompute another distribution, the Poisson distribution, which closely approxi-mates the desired results As a rule of thumb, we may use the Poisson distribu-tion to approximate the binomial distribution when the probability of the rare

repre-event p is less than 0.1 and the product kp (sample size x probability) is less

than 5

The Poisson distribution is also a discrete frequency distribution of thenumber of times a rare event occurs But, in contrast to the binomial distribu-tion, the Poisson distribution applies to cases where the number of times that

an event does not occur is infinitely large For purposes of our treatment here,

a Poisson variable will be studied in samples taken over space or time Anexample of the first would be the number of moss plants in a sampling quadrat

on a hillside or the number of parasites on an individual host; an example of atemporal sample is the number of mutations occurring in a genetic strain in thetime interval of one month or the reported cases of influenza in one townduring one week The Poisson variable Y will be the number of events persample It can assume discrete values from 0 on up To be distributed in Poissonfashion the variable must have two properties: (I) Its mean must be small relative

to the maximum possible number of events per sampling unit Thus the eventshould be "rare." But this means that our sampling unit of space or time must

be large enough to accommodate a potentially substantial number of events.For example, a quadrat in which moss plants are counted must be large enoughthat a substantial number of moss plants could occur there physically if thebiological conditions were such as to favor the development of numerous mossplants in the quadrat A quadrat consisting of a I-cm square would be far toosmall for mosses to be distributed in Poisson fashion Similarly, a time span

of I minute would be unrealistic for reporting new influenza cases in a town,but within I week a great many such cases could occur (2) An occurrence of theevent must be independent of prior occurrences within the sampling unit Thus,the presence of one moss plant in a quadrat must not enhance or diminish theprobability that other moss plants are developing in the quadrat Similarly, thefact that one influenza case has been reported must not affect the probability

of reporting subsequent influenza cases Events that meet these conditions(rare and randomevents) should be distributed in Poisson fashion

The purpose of fitting a Poisson distribution to numbers of rare events innature is to test whether the events occur independently with respect to each

Ngày đăng: 14/05/2019, 11:04

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm