The first or lower quartile, Q1, is the median of the lower half of the data, and likewise the third Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2, inascending
Trang 2APPLIED STATISTICS FOR CIVIL AND ENVIRONMENTAL ENGINEERS
i
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 3SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 4APPLIED STATISTICS FOR CIVIL AND ENVIRONMENTAL ENGINEERS
Trang 5This edition first published 2008 C
2008 by Blackwell Publishing Ltd and 1997 by The McGraw-Hill Companies, Inc.
Blackwell Publishing was acquired by John Wiley & Sons in February 2007 Blackwell’s publishing programme has been merged with Wiley’s global Scientific, Technical, and Medical business to form Wiley-Blackwell.
www.wiley.com/wiley-blackwell.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Includes bibliographical references and index.
ISBN-13: 978-1-4051-7917-1 (hardback : alk paper) ISBN-10: 1-4051-7917-1 (hardback : alk paper) 1 Civil engineering–Statistical methods 2 Environmental engineering–Statistical methods 3 Probabilities I Rosso, Renzo II Kottegoda, N T Statistics, probability, and reliability for civil and environmental engineers III Title.
TA340.K67 2008 519.5024624–dc22 2007047496
A catalogue record for this book is available from the British Library.
Set in 10/12pt Times by Aptara Inc., New Delhi, India Printed in Singapore by Utopia Press Pte Ltd
1 2008
iv
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 6v
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 72.2 Measures of Probability 50
2.2.5 Conditional probability and multiplication rule 56
3.1.3 Cumulative distribution function of a discrete random
3.1.5 Cumulative distribution function of a continuous random
3.3.1 Joint probability distributions of discrete variables 1133.3.2 Joint probability distributions of continuous variables 118
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 85.3.1 Confidence interval estimation of the mean when the
5.3.2 Confidence interval estimation of the mean when the
5.3.4 Sampling distribution of differences and sums of statistics 2425.3.5 Interval estimation for the variance: chi-squared distribution 243
5.4.2 Probabilities of Type I and Type II errors and the
5.5.2 Wilcoxon signed-rank test for association of paired
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 95.5.3 Kruskal-Wallis test for paired observations ink samples 264
5.6.5 Other methods for testing the goodness-of-fit to a
5.8.3 Probability plotting for Gumbel or EV1 distribution 300
5.9.4 Estimation of probabilities of extreme events when outliers
6 Methods of Regression and Multivariate Analysis 326
6.1.3 Tests of significance and confidence intervals 337
6.2.2 Linear least squares solutions using the matrix method 3436.2.3 Properties of least squares estimators and error variance 346
Trang 106.4.3 Some semivariogram models and physical aspects 389
7.1.3 Expected value and variance of order statistics 411
7.2.4 Weibull distribution as an extreme value model 432
7.2.7 Use of other distributions as extreme value models 445
8.1.3 Sample size and accuracy of Monte Carlo experiments 495
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 118.2.1 Random outcomes from standard uniform variates 501
8.2.4 Random outcomes from jointly distributed variates 513
10 Bayesian Decision Methods and Parameter Uncertainty 623
Trang 12Contents xi
10.2.4 Inference with conditional binomial and prior beta 636
A.9 Wilcoxon signed-rank test: mean and variance of the test statistic 664
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 13SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 14To my parents To estimate the debt I owe them requires a lifespan of nibbanic extent To
A mamma Aria, a Donatella, ai due Riccardi della mia vita e al nostro indimenticabile
xiii
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 15Preface to the First Edition
Statistics, probability, and reliability are subject areas that are not commonly easy for dents of civil and environmental engineering Such difficulties notwithstanding, a greateremphasis is currently being made on the teaching of these methods throughout institutions
stu-of higher learning Many prstu-ofessors with whom we have spoken have expressed the needfor a single textbook of sufficient breadth and clarity to cover these topics
One might ask why it is necessary to write a new book specifically for civil and ronmental engineers Firstly, we see a particular importance of statistical and associatedmethods in our disciplines For example, some modes of failure, interactions, probabilitydistributions, outliers, and spatial relationships that one encounters are unique and requiredifferent approaches Secondly, colleagues have said that existing books are either old andoutdated or omit particularly important engineering problems, emphasizing instead areasthat may not be directly relevant to the practitioner
envi-We set ourselves several objectives in writing this book First, it was necessary to updatemuch of the older material, which have rightly stood for decades, even centuries Indeed
Second, we had to look at the engineer’s structures, waterways, and the like and bring in
as much material as possible for the tasks at hand We felt an urgent need to modernize,incorporate new concepts throughout, and reduce or eliminate the impact of some topics
We aimed to order the material in a logical sequence In particular we tried to adopt awriting style and method of presentation that are lively and without overrigorous drudgery
These had to be accomplished without compromising a deep and thorough treatment offundamentals
The layout of the book is sraightforward, so it can be used to suit one’s personal needs
We apologize to any readers who think we have strayed from the path of simplicity incertain parts, such as the associated variables and contagious distributions of Chapter 3and the order statistics of Chapter 7 One might wish to omit these sections on a firstreading The introductions to the chapters will be helpful for this purpose
The explanation of the theory is accompanied by the assumptions made Definitions areseparately highlighted In many places we point out the limitations and pitfalls or viola-tions There are warnings of possible misuses, misunderstandings, and misinterpretations
We provide guidance to the proper interpretation of statistical results
The numerous examples, for which we have for the most part used recorded tions, will be helpful to beginners as well as to mature students who will consult the text
observa-as a reference We hope these examples will lead to a better understanding of the materialand design variabilities, a prelude to the making of sound decisions
Each chapter concludes with extensive homework problems In many instances, as inChapter 1, they are based on real data not used elsewhere in the text We have not usedcards or dice or coins or black and red balls in any of the problems and examples Answers
to selected problems are summarized in Appendix D A detailed manual of solutions isavailable
Computers are continuously becoming cheaper and more powerful Newer ways ofhandling data are being devised At the inception, we seriously considered the use ofcommercial software packages to enhance the scope of the book However, the problem
of choosing one, from the many suitable packages acted as a deterrent Our concern was theserious limitations imposed by utilizing a source that necessitates corresponding purchase
xiv
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 16Preface to the First Edition xv
by an adopting school or by individual engineers Besides, the calculations illustrated
in the book can be made using worksheets available as standard software for personalcomputers As an aid, the data in Appendix E will be placed on the Internet
We have utilized the space saved (from jargon and notation of a particular software,output, graphs, and tables) to widen the scope, make our explanations more thorough,and insert additional illustrations and problems Readers also have an almost all-inclusiveindex, a comprehensive glossary of notation, additional mathematical explanations, andother material in the appendixes Furthermore, we hope that the extensive, annotated bibli-ographies at the end of each chapter, numerous citations and tables, will make this a usefulreference source
The book is written for use by students, practicing engineers, teachers, and researchers incivil and environmental engineering and applied statistics; female readers will find no hint
of male chauvinism here It is designed for a one- or two-semester course and is suitablefor final-year undergraduate and first-year graduate students The text is self-contained forstudy by engineers A background of elementary calculus and matrix algebra is assumed
ACKNOWLEDGMENTS
We acknowledge with thanks the work of the staff at Publication Services, Inc., in paign, IL Gianfausto Salvadori gave his time generously in reviewing the manuscript andproviding solutions to some homework problems Thanks are due again to Adri Buishandfor his elaborate and painstaking reviews Our publisher solicited other reviewers whosereports were useful Howard Tillotson and colleagues at the University of Birmingham,England, provided data and some student problems Discussions with Tony Lawrance atlunch in the University Staff House and the example problem he solved at Helsinki Airportare appreciated Valuable assistance was provided by Giovanni Solari and Giulio Ballio inwind and steel engineering, respectively In addition, Giovanni Vannuchi was consulted
Cham-on geotechnical engineering Research staff and doctoral students at the Politecnico di lano helped with the homework problems and the preparation of the index Dora Tartagliaworked diligently on revisions to the manuscript We thank the publishers, companies,and individuals who gave us permission to use their material, data, and tables; some of thetables were obtained through our own resources We shall be pleased to have any omissionsbrought to our notice The support and hospitality provided at the Universit`a degli Studi diPavia by Luigi Natale and others are acknowledged with thanks Most importantly, withoutthe patience and tolerance of our families this book could not have been completed
Mi-N T Kottegoda
R Rosso Milano, Italy
1 July 1996
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 17Preface to the Second Edition
Last year a senior European professor, who uses our book, was visiting us in Milano
When told of the revisions underway he expressed some surprise “There is nothing torevise,” he said But all books need revision sooner or later, especially a multidimensionalone The equations, examples, problems, figures, tables, references, and footnotes are allsubject to inevitable human fallibilities: typographical errors and errors of fact Our firstobjective was to bring the text as close to the ideal state as possible The second prioritywas to modernize
In Chapter 10, a new section is added on Markov chain Monte Carlo modeling; this haspopularized Bayesian methods in recent years; there is a full description and case study
on Gibbs sampling In Chapter 8 on simulation, we include a new section on sensitivityanalysis and uncertainty analysis; a clear and detailed distinction is made between epis-temic and aleatory uncertainties; their implications in decision-making are discussed InChapter 7 on Frequency Analysis of Extreme Events, natural hazards and flood hydrologyare updated In Chapter 6 on regression analysis, further considerations have been made onthe diagnostics of regression; there are new discussions on general and generalized linearmodels In Chapter 5 on Model Estimation and Testing we give special importance to theAnderson-Darling goodness-of-fit test because of its sensitivity to departures in the tailareas of a probability density function; we make applications to nonnormal distributionsusing the same data as in the estimation of parameters In Chapter 3 a section is added onthe novel method of copulas with particular emphasis on bivariate distributions We haverevised the problems following Basic Probability Concepts in Chapter 2 Other chaptersare also revised and modernized and the annotated references are updated
As before, we have kept in mind the scientific method of Claude Bernard, the Frenchmedical researcher of the nineteenth century This had three essential parts: observation ofphenomena in nature (seen in Appendix E, and in the examples and problems), observation
of experiments (as reported in each chapter), and the theoretical part (clear enough for theaudience in mind, but without over-simplification)
“Nobody trusts a model except the one who originated it; everybody trusts data exceptthose who record it.” Models and data are subject to uncertainty There is still a gapbetween models and data We attempt to bridge this gap
The title of the book has been abridged fromStatistics, Probability, and Reliability for Civil and Environmental Engineers to Applied Statistics for Civil and Environmental Engineers The applications and problems pertain almost equally to both disciplines and
all areas are included
Another aspect we emphasized before was that the calculations illustrated in the bookcan be made using worksheets available as standard software for personal computers
Alternatively, R which is now commonplace can be downloaded free of charge and adopted
to run some of the homework problems, if one so prefers Our decision not to recommendthe use of particular commercial software packages, by giving details of jargon, notation,and so on, seems to be justified We find that a specific version soon become obsolete withthe advent of a new version
A limited access solutions manual is available with the data from Appendix E on theWiley-Blackwell website [www.blackwellpublishing.com/kottegoda]
xvi
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 18Preface to the Second Edition xvii
We are grateful for the encouragement given by many users of the first edition, and
to the few who pointed out some discrepancies We thank the anonymous reviewers fortheir useful comments Gianfausto Salvadori, Carlo De Michele, Adri Buishand, and TonyLawrance assisted us again in the revisions Julia Burden and Lucy Alexander of BlackwellPublishing supported us throughout the project Universit`a degli Studi di Pavia is thankedfor continued hospitality The help provided by Fabrizio Borsa and Enrico Raiteri in thepreparation of some figures is acknowledged
N T Kottegoda
R Rosso Milano, Italy
14 September 2007
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 19SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 20As a wide-ranging discipline, statistics concerns numerous procedures for deriving mation from data that have been affected by chance variations On the basis of scientificexperiments, one may record and make summaries of observations, quantify variations,
infor-or other changes of significance, and compare data sequences by means of some numbers
or characteristics The use of statistics in this way is for descriptive purposes At a moresophisticated level of analysis and interpretation, one can, for instance, test hypothesesusing the inferential approach developed during the twentieth century Thus it may beascertained, for instance, whether the change of an ingredient affects the properties of
a concrete or whether a particular method of surfacing produces a longer-lasting road;
this approach often includes the estimation by means of observations of the parameters
of a statistical model Then inferences can be drawn from data and predictions made ordecisions taken When faced with uncertainty, this last phase is the principal aim of a civil
or environmental engineer acting as an applied statistician
In all activities, engineers have to cope with possible uncertainties Observations of soilpressures, tensile strengths of concrete, yield strengths of steel, traffic densities, rainfalls,river flows, and pollution loads in streams vary from one case to the next for apparentlyunknown reasons or on account of factors that cannot be assessed to any degree of accu-racy However, designs need to be completed and structures, highways, water supply, andsewerage schemes constructed Sound engineering judgment, in fact, springs from physi-cal and mathematical theories, but it goes far beyond that Randomness in nature must betaken into account Thus the onus of dealing with the uncertainties lies with the engineer
The appropriate methods of tackling the uncertainty vary with different circumstances
The key is often the dispersion that is commonly evidenced in available data sets Somephenomena may have negligible or low variability In such a case, the mean of past observa-tions may be used as a descriptor, for example, the elastic constant of a steel Nevertheless,the consequences of a possible change in the mean should also be considered Frequently,the variability in observations is found to be quite substantial In such situations, an engi-neer sometimes uses, rather conservatively, a design value such as the peak storm runoff
or the compressive strength of a concrete Alternatively, it has been the practice to expressthe ability of a component in a structure to withstand a specified loading without failure
or a permissible deflection by a so-called factor of safety; this is in effect a blanket tocover all possible contingencies However, we envisage some problems here in following
a purely deterministic approach because there are doubts concerning the consistency ofspecified strengths, flows, loads, or factors from one case to another These cannot belightly dismissed or easily compounded when the consequences of ignoring variabilityare detrimental or, in general, if the decision is sensitive to a particular uncertainty (Oftenthere are crucial economic considerations in these matters.) This obstacle strongly sug-gests that the way forward is by treating statistics and probability as necessary aids indecision making, thus coping with uncertainty through the engineering process
Note that statistical methods are in no way intended to replace the physical edge and experience of the engineer and his or her skills in experimentation The engineershould know how the measurements are made and recorded and how errors may arise frompossible limitations in the equipment There should be readiness to make changes and im-provements so that the data-gathering process is as reliable and representative as possible
knowl-1
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 21On this basis, statistics can be a complementary and a valuable aid to technology In prudenthands it can lead to the best practical assessment of what is partially known or uncertain.
The quantification of uncertainty and the assessment of its effects on design and plementation must include concepts and methods of probability, because statistics is built
im-on the foundatiim-on of probability theory In additiim-on, decisiim-on making under risk involvesthe use of applied probability Historically, probability theory arose as a branch of math-ematics concerned with the analysis of certain games of chance; it consequently foundapplications in the measurement and understanding of uncertainty in innumerable naturalphenomena and human activities The fundamental interrelationship between statisticsand probability is clearly evident in practice As seen in past decades, there has been anirreversible change in emphasis from descriptive to inferential statistics In this respect
we must note that statistical inferences and the risk and reliability of decision makingunder uncertainty are evaluated through applied probability, using frequentist or Bayesianestimation This applies to the most widely used methods Alternatives that come undergeneralized information theory are now available
The reliability of a system, structure, or component is the complement of its probability
of failure Risk and reliability analysis, however, entail many activities The survival ability of a system is usually stated in terms of the reliabilities of its components Themodeling process is an essential part of the analysis, and time can be an important factor
prob-Also, the risk factor that one computes may be inherent, additional, or composite Allthese points show that reliability design deserves special emphasis
Methods of reducing data, reviewed in Chapter 1, begin with tabulation and graphicalrepresentation, which are necessary first steps in understanding the uncertainty in data andthe inherent variability Numerical summaries provide descriptions for further analysis
Exploratory methods are followed by relationships between data observed in pairs Thusthe investigation begins The route is long and diverse, becausestatistics is the science and art of experimenting, collecting, analyzing, and making inferences from data This
opening chapter provides a route map of what is to follow so that one can gain insightinto the numerous tools statistics offers and realizes the variety of problems that can betackled In Chapters 2 and 3, we develop a background in probability theory for copingwith uncertainty in engineering Using basic concepts, we then discuss the total probabilityand Bayes’ theorems and define statistical properties of distributions used for estimationpurposes Chapter 4 examines various mathematical models of random processes There
is a wide-ranging discussion of discrete and continuous distributions; joint and derivedtypes are also given in Chapters 3 and 4; we introduce copulas that can effectively modeljoint distributions Model estimation and testing methods, such as confidence intervals,hypothesis testing, analysis of variance, probability plotting, and identification of outliers,are treated in Chapter 5 The estimation and testing are based on the principle that allsuppositions need to be carefully examined in light of experimentation and observation
Details of regression and multivariate statistical methods are provided in Chapter 6, alongwith principal component analysis and associated methods and spatial correlation Extremevalue analysis applied to floods, droughts, winds, earthquakes, and other natural hazards isfound in Chapter 7; some special types of models are included Simulation is the subject ofChapter 8, which comprises the use of simulation in design and for other practical purposes;
also, we discuss sensitivity analysis and uncertainty analysis of the aleatory and epistemictypes In Chapter 9, risk and reliability analysis and reliability design are developed indetail Chapter 10 is devoted to Bayesian and other types of economic decision making,used when the engineer faces uncertainty; we include here Markov chain Monte Carlomethods that have recently popularized the Bayesian approach
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 22Chapter 1
Preliminary Data Analysis
All natural processes, as well as those devised by humans, are subject to variability
Civil engineers are aware, for example, that crushing strengths of concrete, soil pressures,strengths of welds, traffic flow, floods, and pollution loads in streams have wide variations
These may arise on account of natural changes in properties, differences in interactionsbetween the ingredients of a material, environmental factors, or other causes To copewith uncertainty, the engineer must first obtain and investigate asample of data, such as
a set of flow data or triaxial test results The sample is used in applying statistics andprobability at the descriptive stage For inferential purposes, however, one needs to makedecisions regarding thepopulation from which the sample is drawn By this we mean the
total or aggregate, which, for most physical processes, is the virtually unlimited universe
of all possible measurements The main interest of the statistician is in the aggregation;
the individual items provide the hints, clues, and evidence
A data set comprises a number of measurements of a phenomenon such as the failureload of a structural component The quantities measured are termedvariables, each of
which may take any one of a specified set of values Because of its inherent randomnessand hence unpredictability, a phenomenon that an engineer or scientist usually encounters
is referred to as arandom variable, a name given to any quantity whose value depends
on chance.1Random variables are usually denoted by capital letters These are classified
by the form that their values can possibly take (or are assumed to take) The pattern ofvariability is called adistribution A continuous variable can have any value on a conti-
nuous scale between two limits, such as the volume of water flowing in a river per second
or the amount of daily rainfall measured in some city Adiscrete variable, on the contrary,
can only assume countable isolated numbers like integers, such as the number of vehiclesturning left at an intersection, or other distinct values
Having obtained a sample of data, the first step is its presentation Consider, for ample, the modulus of rupture data for a certain type of timber shown in Table E.1.1, inAppendix E The initial problem facing the civil engineer is that such an array of data byitself does not give a clear idea of the underlying characteristics of the stress values inthis natural type of construction material To extract the salient features and the particulartypes of information one needs, one must summarize the data and present them in somereadily comprehensible forms There are several methods of presentation, organization,and reduction of data Graphical methods constitute the first approach
ex-1.1 GRAPHICAL REPRESENTATION
If “a picture is worth a thousand words,” then graphical techniques provide an excellentmethod to visualize the variability and other properties of a set of data To the powerfulinteractive system of one’s brain and eyes, graphical displays provide insight into the form
1 The term will be formally defined in Section 3.1.
3
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 23and shape of the data and lead to a preliminary concept of the generating process Weproceed by assembling the data into graphs, scanning the details, and noting the importantcharacteristics There are numerous types of graphs Line and dot diagrams, histograms,relative frequency polygons, and cumulative frequency curves are given in this section.
Subsequently, exploratory methods, such as stem-and-leaf plots and box diagrams andgraphs depicting a possible association between two variables, are presented in Sections1.3 and 1.4 We begin with the simple task of counting
1.1.1 Line diagram or bar chart
The occurrences of a discrete variable can be classified on a line diagram or bar chart
In this type of graph, the horizontal axis gives the values of the discrete variable and theoccurrences are represented by the heights of vertical lines The horizontal spread of theselines and their relative heights indicate the variability and other characteristics of the data
Example 1.1 Flood occurrences Consider the annual number of floods of the Magra River
at Calamazza, situated between Pisa and Genoa in northwestern Italy, over a 34-year period,
as shown in Table 1.1.1
A flood in the river at the point of measurement means the river has risen above a specifiedlevel, beyond which the river poses a threat to lives and property The data are plotted inFig 1.1.1 as a line diagram
The data suggest a symmetrical distribution with a midlocation of four floods per year
In some other river basins, there is a nonlinear decrease in the occurrences for increasingnumbers of floods in a year commencing at zero, showing a negative exponential type ofvariation
Trang 24Preliminary Data Analysis 5
0 1 2 3 4 5 6 7 8 9
the first 15 items of data in Table E.1.1—which shows the modulus of rupture in N/mm2
for 50 mm× 150 mm Swedish redwood and whitewood—are available The abridged
data are ranked in ascending order and are given in Table 1.1.2 and plotted in Fig 1.1.2
The reader can see that the midlocation is close to 40 N/mm2but the wide spread makesthis location difficult to discern A larger sample should certainly be helpful
1.1.3 Histogram
If there are at least, say, 25 observations, one of the most common graphical forms is ablock diagram called thehistogram For this purpose, the data are divided into groups
according to their magnitudes The horizontal axis of the graph gives the magnitudes
Blocks are drawn to represent the groups, each of which has a distinct upper and lowerlimit The area of a block is proportional to the number of occurrences in the group
The variability of the data is shown by the horizontal spread of the blocks, and the mostcommon values are found in blocks with the largest areas Other features such as thesymmetry of the data or lack of it are also shown
The first step is to take into account therange r of the observations, that is, the difference
between the largest and smallest values
Example 1.2 Timber strength We go back to the timber strength data given in Table E.1.1.
They are arranged in order of magnitude in Table 1.1.3
There aren= 165 observations with somewhat high variability, as expected, becausetimber is a naturally variable material Here the ranger= 70.22 – 0.00 = 70.22 N/mm2
To draw a histogram, one divides the range into a number ofclasses or cells n c Thenumber of occurrences in each class is counted and tabulated These are calledfrequencies.
Table 1.1.2 The first 15 items of modulus of rupture data measuringtimber strengths in N/mm2, from Table E.1.1 (commencing with thetop row), ranked in increasing order
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 2525 30 35 40 45 50 55 60 65 70
Fig 1.1.2 Dot diagram for a short sample of timber strengths from Table 1.1.3
The width of the classes is usually made equal to facilitate interpretation For some worksuch as the fitting of a theoretical function to observed frequencies, however, unequal classwidths are used Care should be exercised in the choice of the number of classes,n c Toofew will cause an omission of some important features of the data; too many will not give
a clear overall picture because there may be high fluctuations in the frequencies A rule
of thumb is to maken c=√n or an integer close to this, but it should be at least 5 and not
greater than 25 Thus, histograms based on fewer than 25 items may not be meaningful
Sturges (1926) suggested the approximation
A more theoretically based alternative follows the work of Freedman and Diaconis (1981):2
n c=r n1/3
Here iqr is theinterquartile range To clarify this term, we must define Q2, or the
median This denotes the middle term of a set of data when the values are arranged in
ascending order, or the average of the two middle terms ifn is an even number The first
or lower quartile, Q1, is the median of the lower half of the data, and likewise the third
Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2, inascending ordera
aThe original data set is given in Table E.1.1;n= 165 The median is underlined.
2 See also Scott (1979).
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 26Preliminary Data Analysis 7
Table 1.1.4 Frequency computations for the modulus of rupture data ranked in Table 1.1.3a
aThe width of each class is 5 N/mm 2 in this example.
or upper quartile,Q3, is the median of the upper half of the data This definition will beused throughout.3Thus,
Example 1.3 Timber strength For the timber strength data of Table E.1.1, the median,
that is,Q2, is 39.05 N/mm2 AlsoQ3andQ1are 44.57 and 32.91 N/mm2, respectively, andhence iqr= 11.66 N/mm2 From the simple square-root rule, the number of classes,n c=12.84 However, by using Eqs (1.1.1) and (1.1.2), the number of classes are 8.32 and 16.52,respectively If these are rounded to 9 and 15 and the range is extended to 72 and 75 N/mm2for graphical purposes, the equal class widths become 8 and 5 N/mm2, respectively Let ususe these widths It is important to specify the class boundaries without ambiguity for thecounting of frequencies; for example, in the first case, these should be from 0 to 7.99, 8.00 to15.99, and so on As already mentioned, the vertical axis of a histogram is made to representthe frequency and the horizontal axis is used as a measurement scale on which the classboundaries are marked For each of these class widths, 8 and 5 N/mm2, class boundaries aremade and counting of frequencies is completed using Table 1.1.3; the lowest boundary is
at 0 and the highest boundaries are at 72 and 75 N/mm2, respectively Table 1.1.4 gives theabsolute and relative frequencies for class widths of 5 N/mm2
Rectangles are then erected over each of the classes, proportional in area to the classfrequencies When equal class widths are used, as shown here, the heights of the rectanglesrepresent the frequencies Thus, Figs 1.1.3 and 1.1.4 are obtained
The information conveyed by the two histograms seems to be similar The diagrams arealmost symmetrical with a peak in the class below 40 N/mm2and a steady decrease on eitherside This type of diagram usually brings out any possible imperfections in the data, such as
3 There are alternatives, such as rounding (n+ 1)/4 and (n + 1) × (3/4) to the nearest integers to calculate the
locations ofQ1 andQ3 , respectively The rounding is upward or downward, respectively, when the numbers fall exactly between two integers.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 270.0 0.1 0.2 0.3 0.4
Fig 1.1.3 Histogram for timber strength data with class width of 8 N/mm2
the gaps at the ends Further investigations are required to understand the true nature of thepopulation More on these aspects will follow in this and subsequent chapters
1.1.4 Frequency polygon
A frequency polygon is a useful diagnostic tool to determine the distribution of a variable
It can be drawn by joining the midpoints of the tops of the rectangles of a histogram afterextending the diagram by one class on both sides We assume that equal class widths areused If the ordinates of a histogram are divided by the total number of observations, then
a relative frequency histogram is obtained Thus, the ordinates for each class denote the
probabilities bounded by 0 and 1, by which we simply mean the chances of occurrence.
The resulting diagram is called the relative frequency polygon
Example 1.4 Timber strength Corresponding to the histogram of Fig 1.1.4, the values
of class center are computed and a relative frequency polygon is obtained; this is shown inFig 1.1.5
0.00 0.10 0.20 0.30
Fig 1.1.4 Histogram for timber strength data with class width of 5 N/mm2
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 28Preliminary Data Analysis 9
0.0 0.1 0.2 0.3
Fig 1.1.5 Relative frequency polygon for timber strength data with class width of 5 N/mm2
As the number of observations becomes large, the class widths theoretically tend to crease and, in the limiting case of an infinite sample, a relative frequency polygon becomes
de-a frequency curve This is in fde-act de-a probde-ability curve, which represents de-a mde-athemde-aticde-al
probability density function, abbreviated as pdf, of the population.4
1.1.5 Cumulative relative frequency diagram
If a cumulative sum is taken of the relative frequencies step by step from the smallest class
to the largest, then the line joining the ordinates (cumulative relative frequencies) at theends of the class boundaries forms a cumulative relative frequency or probability diagram
On the vertical axis of the graph, this line gives the probabilities of nonexceedance of valuesshown on the horizontal axis In practice, this plot is made by utilizing and displaying everyitem of data distinctly, without the necessity of proceeding via a histogram and the restric-tive categories that it entails For this purpose, one may simply determine (e.g., from theranked data of Table 1.1.3) the number of observations less than or equal to each value anddivide these numbers by the total number of observations This procedure is adopted here.5Thus, the probability diagram, as represented by the cumulative relative frequencydiagram, becomes an important practical tool This diagram yields the median and otherquartiles directly Also, one can find the 9 values that divide the total frequency into 10equal parts calleddeciles and the so-called percentiles, where the pth percentile is the
value that is greater thanp percent of the observations In general, it is possible to obtain
the (n− 1) values that divide the total frequency into n equal parts called the quantiles.
Hence a cumulative frequency polygon is also called aquantile or Q-plot; a Q-plot though
has quantiles on the vertical axis unlike a cumulative frequency diagram
Example 1.5 Timber strength Figure 1.1.6 is the cumulative frequency diagram obtained
from the ranked timber strength data of Table 1.1.3 using each item of data as just described
4 This function is discussed in Chapter 3 One of the first tasks in applying inferential statistics, as presented in Chapters 4 and 5, will be to estimate the mathematical function from a finite sample and examine its closeness
to the histogram.
5 Further aspects of this subject, as related to probability plots, are described in Chapter 5.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 290.0 0.2 0.4 0.6 0.8 1.0
Fig 1.1.6 Cumulative relative frequency diagram for timber strength data
The deciles and percentiles can be abstracted By convention a vertical probability orproportionality scale is used rather than one giving percentages (except in duration curves,discussed shortly) The 90th percentile, for instance, is 51 N/mm2 approximately and thevalue 40 N/mm2has a probability of nonexceedance of approximately 0.56
If the sample size increases indefinitely, the cumulative relative frequency diagram willbecome adistribution curve in the limit This represents the population by means of a
(mathematical) distribution function, usually called acumulative distribution function,
ab-breviated to cdf, just as a relative frequency polygon leads to a probability density function
As a graphical method of ascertaining the distribution of the population, the quantileplot can be drawn using a modified nonlinear scale for the probabilities, which representsone of several types of theoretical distributions.6 Also, as shown in Section 1.4, twodistributions can be compared using a Q-Q plot
1.1.6 Duration curves
For the assessment of water resources and for associated design and planning purposes,engineers find it useful to drawduration curves When dealing with flows in rivers, this type
of graph is known as aflow duration curve It is in effect a cumulative frequency diagram
with specific time scales The vertical axis can represent, for example, the percentage ofthe time a flow is exceeded; and in addition, the number of days per year or season duringwhich the flow is exceeded (or not) may be given The volume of flow per day is given onthe horizontal axis For some purposes, the vertical and horizontal axes are interchanged
as in a Q-plot One example of a practical use is the scaled area enclosed by the curve,
a horizontal line representing 100% of the time, and a vertical line drawn at a minimumvalue of flow, which is desirable to be maintained in the river This area represents theestimated supplementary volume of water that should be diverted to the river on an annualbasis to meet such an objective
Example 1.6 Streamflow duration Figure 1.1.7 gives the flow duration curve of the Dora
Riparia River in the Alpine region of northern Italy, calculated over a period of 47 years fromthe records at Salbertrand gauging station This figure is drawn using the same procedure
6 This method is demonstrated in Section 5.8.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 30Preliminary Data Analysis 11
0 73 146 219 292 365
Fig 1.1.7 Flow duration curve of Dora Riparia River at Salbertrand in the Alpine region of Italy
adopted for a cumulative relative frequency diagram, such as Fig 1.1.6 For instance, suppose
it is decided to divert a proportion of the discharges above 10 m3/s and below 20 m3/s from theriver Then the area bounded by the curve and the vertical lines drawn at these discharges, usingthe vertical scale on the left-hand side, will give the estimated maximum amount availablefor diversion during the year in m3after multiplication by the number of seconds in a day
This area is hatched in Fig 1.1.7 If such a decision were to be implemented over a term basis, it should be essential to use a long series of data and to estimate the distributionfunction
long-1.1.7 Summary of Section 1.1
In this section we have introduced some of the basic graphical methods Other proceduressuch as stem-and-leaf plots and scatter diagrams are presented in Sections 1.3 and 1.4,respectively More advanced plots are introduced in Chapters 5 and 6 In the next section
we discuss associated numerical methods
1.2 NUMERICAL SUMMARIES OF DATA
Useful graphical procedures for presenting data and extracting knowledge on ity and other properties were shown in Section 1.1 There is a complementary methodthrough which much of the information contained in a data set can be represented eco-nomically and conveyed or transmitted with greater precision This method utilizes a set
variabil-of characteristic numbers to summarize the data and highlight their main features Thesenumerical summaries represent several important properties of the histogram and the rel-ative frequency polygon The most important purpose of these descriptive measures is forstatistical inference, a role that graphs cannot fulfill Basically, there are three distinctivetypes: measures of central tendency, of dispersion, and of asymmetry, all of which can
be visualized through the histogram as discussed in Section 1.1 The additional measure
of “peakedness,” that is, the relative height of the peak, requires a large sample for itsestimation and is mainly relevant in the case of symmetric distributions
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 311.2.1 Measures of central tendency
Generally data from many natural systems, as well as those devised by humans, tend tocluster around some values of variables A particular value, known as the central value,can be taken as a representative of the sample This feature is called central tendencybecause the spread seems to take place about a center The definition of the central value isflexible, and its magnitude is obtained through one of the measures of its location Thereare three such well-known measures: the mean, the mode, and the median The choicedepends on the use or application of the central value
Thesample arithmetic mean is estimated from a sample of observations: x1,x2, ,
The population value of the mean is denoted byμ We reiterate our definition of
popu-lation with reference to a phenomenon such as that represented by the timber strength data
of Table E.1.1 A population is the aggregate of observations that might result by making
an experiment in a particular manner
The sample mean has a disadvantage because it may sometimes be affected by expectedly high or low values, calledoutliers Such values do not seem to conform to
un-the distribution of un-the rest of un-the data There may be physical reasons for outliers Theirpresence may be attributed to conditions that have perhaps changed from what were as-sumed, or because the data are generated by more than one process On the other hand,they may arise on account of errors of faulty instrumentation, measurement, observation,
or recording The engineer must examine any visible outliers and ascertain whether theyare erroneous or whether their inclusion is justifiable The occurrence of any improbablevalue requires careful scrutiny in practice, and this should be followed by rectification orelimination if there are valid reasons for doing so
Example 1.7 Timber strength A case in point is the value of zero in the timber strength
data of Table E.1.1 This value is retained here for comparative purposes The mean of the
165 items, which is 39.09 N/mm2, becomes 39.33 N/mm2without the value of zero
Example 1.8 Concrete test Table E.1.2 is a list of the densities and compressive strengths
at 28 days from the results of 40 concrete cube test records conducted in Barton-on-Trent,England, during the period 8 July 1991 to 21 September 1992, and arranged in reversechronological order
These have sample means of 2445 kg/m3and 60.14 N/mm2, respectively The two numbersare measures of location representing the density and compressive strength of concrete
With many discordant values at the extremes, atrimmed mean, such as a 5% trimmed
mean, may be calculated For this purpose, the data are ranked and the mean is obtainedafter ignoring 5% of the observations from each of the two extremities (see Problem 1.16)
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 32Preliminary Data Analysis 13
The technique ofcoding is sometimes used to facilitate calculations when the data
are given to several significant figures but the digits are constant except for the last few
For example, the densities in Table E.1.2 are higher than 2400 N/mm2 and less than
2500 N/mm2, so that the number 2400 can be subtracted from the densities The remainderswill retain the essential characteristics of the original set (apart from the enforced shift inthe mean), thus simplifying the arithmetic
In considering the entire data set, aweighted mean is obtained if the variables of a
sample are multiplied by numbers called weights and then divided by the sum of theweights It is used if some variables should contribute more (or less) to the average thanothers
Themedian is the central value in an ordered set or the average of the two central values
if the number of values,n, is even, as specified in Section 1.1.
Example 1.9 Concrete test The calculation of the median and other measures of location
will be greatly facilitated if the data are arranged in order of magnitude For example, thecompressive strengths of concrete given in Table E.1.2 are rewritten in ascending order inTable 1.2.1
The median of these data is 60.1 N/mm2, which is the average of 60.0 and 60.2 N/mm2
The median of the timber strength data of Table 1.1.3 is 39.05 N/mm2, as noted in thetable The median has an advantage over the mean It is relatively unaffected by outliersand is thus often referred to as aresistant measure For instance, the exclusion of the
zero value in Table 1.1.3 results only in a minor change of the median from 39.05 to39.10 N/mm2
One of the countless practical uses of the median is the application of a disinfectant
to many samples of bacteria Here, one seeks an association between the proportion ofbacteria destroyed and the strength of the disinfectant The concentration that kills 50% ofthe bacteria is themedian dose This is termed LD50 (lethal dose for 50%) and provides
an excellent measure
Themode is the value that occurs most frequently Quite often the mode is not unique
because two or more sets of values have equal status For this reason and for convenience,the mode is often taken from the histogram or frequency polygon
Example 1.10 Concrete test For the ranked compressive strengths of concrete in
Table 1.2.1, the mode is 60.5 N/mm2
Example 1.11 Timber strength From Fig 1.1.4, for example, the mode of the timber
strength data is 37.5 N/mm2, which corresponds to the midpoint of the class with the highestfrequency However, there is ambiguity in the choice of the class widths as already noted
On the other hand, in Table 1.1.3 there are nine values in the range 38.64–39.34 N/mm2, andthus 39 N/mm2 seems a more representative value, but this problem can only be resolvedtheoretically
As the sample size becomes indefinitely large, the modal value will correspond to thepeak of the relative frequency curve on a theoretical basis The mode may often havegreater practical significance than the mean and the median It becomes more useful as theasymmetry of the distribution increases For instance, if an engineer were to ask a personwho sits habitually on the banks of a river fishing to indicate the mean level of the river,
he or she is inclined to point out the modal level It is the value most likely to occur and it
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 33Table 1.2.1 Ordered data of density and compressive strength ofconcretea
aThe original data sets are given in Table E.1.2.
is not affected by exceptionally high or low values Clearly, the deletion of the zero valuefrom Table 1.1.3 does not alter the mode, as we have also seen in the case of the median
These positive attributes of the mode and median notwithstanding, the mean is pensable for many theoretical purposes Also in the same class as the sample arithmetic
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 34Preliminary Data Analysis 15
mean, there are two other measures of location that are used in special situations Theseare the harmonic and geometric means
Theharmonic mean is the reciprocal of the mean of the reciprocals Thus the harmonic
mean for a sample of observations,x1,x2, ,x n, is defined as
¯
It is applied in situations where the reciprocal of a variable is averaged
Example 1.12 Stream flow velocity A practical example of the harmonic mean is the
determination of the mean velocity of a stream based on measurements of travel times over agiven reach of the stream using a floating device For instance, if three velocities are calculated
as 0.20, 0.24, and 0.16 m/s, then the sample harmonic mean is
¯
(1/3)[(1/0.20) + (1/0.24) + (1/0.16)] = 0.19 m/s.
Thegeometric mean is used in averaging values that represent a rate of change Here the
variable follows an exponential, that is, a logarithmic law For a sample of observations,
x1,x2, ,x n, the geometric mean is the positiventh root of the product of the n values.
This is the same as the antilog of the mean of the logarithms:
Example 1.13 Population growth Consider the case of populations of towns and cities that
increase geometrically, which means that a future increase is expected that is proportional tothe current population Such information is invaluable for planning and designing urban watersupplies and sewerage systems Suppose, for example, that according to a census conducted
in 1970 and again in 1990 the population of a city had increased from 230,000 to 310,000
An engineer needs to verify, for purposes of design, the per capita consumption of water inthe intermediate period and hence tries to estimate the population in 1980 The central value
to use in this situation is the geometric mean of the two numbers which is
¯
x g = (230, 000 × 310, 000)1/2 = 267,021.
(Note that the sample arithmetic mean ¯x= 270,000.)
As we see in Example 1.13, the geometric mean is less than the arithmetic mean.7
1.2.2 Measures of dispersion
Whereas a measure of central tendency is obtained by locating a central or representativevalue, a measure of dispersion represents the degree of scatter shown by observations orthe inherent variability in a phenomenon under observation Dispersion also indicates theprecision of the data One method of quantification is through an order statistic, that is,one of ranked data.8 The simplest in the category is the range, which is the differencebetween the largest and smallest values, as defined in Section 1.1
7 This theoretical property is demonstrated in Example 3.10.
8 We shall discuss order statistics formally in Chapter 7; see also Chapter 5.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 35Example 1.14 Timber strength As noted before, the range of the timber strength data of
Table 1.1.3 is 70.22 – 0.00= 70.22 N/mm2
Example 1.15 Concrete test For the compressive strengths of concrete given in Table
E.1.2 and ranked in Table 1.2.1, the range isr = 69.5 − 49.9 = 19.6 N/mm2; the range ofthe concrete densities is 2488 – 2411= 77 kg/m3 These numbers provide a measure of thespread of the data in each case
The range, however, is a nondecreasing function of the sample size and thus terizes the population poorly Moreover, the range is unduly affected by high and lowvalues that may be somewhat incompatible with the rest of the data even though they maynot always be classified as outliers For this reason, the interquartile range, iqr, which isrelatively a resistant measure, is preferable As defined in Section 1.1, in a ranked set ofdata this is the difference between the median of the top half and the median of the bottomhalf
charac-Example 1.16 Concrete test For the compressive strengths of concrete, the iqr is 6.55
N/mm2
Example 1.17 Timber strength The timber strength data in Table 1.1.3 have an iqr of
11.66 and 11.47 N/mm2, respectively, with or without the zero value A similar and moregeneral measure is given by the interval between two symmetrical percentiles For example,the 90−10 percentile range for the timber strength data is approximately 52 – 28 = 24 N/mm2from Fig 1.1.6
The aforementioned measures of dispersion can be easily obtained However, theirshortcoming is that, apart from two values or numbers equivalent to them, the vast infor-mation usually found in a sample of data is ignored This criticism is not applicable if onedetermines the average deviation about some central value, thus including all the obser-vations For example, themean absolute deviation, denoted by d, measures the average
absolute deviation from the sample mean For a sample of observations,x1,x2, ,x n, it
Example 1.18 Annual rainfall If the annual rainfalls in a city are 50, 56, 42, 53, and
49 cm over a 5-year period, the absolute deviation with respect to the sample mean of 50 cm
the mean Indeed, this is the principal measure of dispersion (although the interquartile
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 36Preliminary Data Analysis 17
range is meaningful and expedient) For a sample of observations, x1,x2, , x n it isdefined by
by the largest and smallest values The standard deviation of the population is denoted by
σ It is common practice to replace the divisor n of Eq (1.2.5) by (n– 1) and denote the
left-hand side by ˆs Consequently, the estimate of the standard deviation is, on average,
closer to the population value because it is said to have smallerbias Therefore, Eq (1.2.5)
will, on average, give an underestimate ofσ except in the rare case in which μ is known.9
The required modification to Eq (1.2.6) is as follows:
ˆ
s=
1
This reduction inn can be justified by means of the concept of degrees of freedom It is a
consequence of the fact that the sum of then deviations (x1− ¯x), (x2 − ¯x), , (x n − ¯x)
is zero, which follows from Eq (1.2.1) for the mean Hence, regardless of the arrangement
of the data, if any (n− 1) terms are specified the remaining term is fixed or known, because
Example 1.19 Annual rainfall From the annual rainfall data in Example 1.18 (50, 56, 42,
53, and 49 cm), one can estimate the standard deviationσ by using Eq (1.2.5), as follows:
ˆ
s =
1
5[(50− 50)2+ (56 − 50)2+ (42 − 50)2+ (53 − 50)2+ (49 − 50)2]
=
1
5(0
2+ 62+ 82+ 32+ 12)=
110
4 = 5.24 cm.
9 Terms such as bias are discussed formally in Section 5.2 It is shown in Example 5.1 that ˆs2 is unbiased;
however, ˆs is known to have bias, though less than s on average.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 37Example 1.20 Timber strength By using Eq (1.2.7), the sample standard deviation of
the timber strength data of Table E.1.1 is 9.92 N/mm2(or 9.46 N/mm2 if the zero value isexcluded)
Example 1.21 Concrete test By using Eq (1.2.7), the sample standard deviation for the
density and compressive strength of concrete in Table E.1.2 are 15.99 kg/m3and 5.02 N/mm2,respectively
Dividing the standard deviation by the mean gives the dimensionless measure of persion called thesample coefficient of variation, v:
Example 1.22 Comparison of timber and concrete strength data From the values of
mean and standard deviation in Examples 1.7 and 1.20, the sample coefficient of variation
of the timber strength data is 25.3% (or 24.0% without the value of zero) Similarly, fromExamples 1.8 and 1.21 the density and compressive strength of concrete data have samplecoefficients of 0.65 and 8.24%, respectively The higher variation in the timber strength data
is a reflection of the variability of the natural material, whereas the low variation in the density
of the concrete is evidence of a uniform quality in the constituents and a high standard ofworkmanship, including care taken in mixing The variation in the compressive strength
of concrete is higher than that of its density This can be attributed to random factors thatinfluence strength, such as some subtle changes in the effectiveness of the concrete that donot alter its density
From the square of the sample standard deviation one obtains thesample variance, ˆs2,which is the mean of the squared deviations from the mean The population variance isdenoted byσ2 The variance, like the mean, is important in theoretical distributions
By squaring Eqs (1.2.6) and (1.2.7), two estimators of the population variance are found
Hereestimator refers to a method of estimating a constant in a parent population As in
all the foregoing equations, this term means the random variable of which the estimate is
a realization Anunbiased estimator is obtained from Eq (1.2.7) because on average (that
is by repeated sampling) the estimator tends to the population varianceσ2 In other words,theexpectation E, which is in effect the average from an infinite number of observations,
of the square of the right-hand side of Eq (1.2.7) is equal toσ2.There are also measures of dispersion pertaining to the mean of the deviations betweenthe observations.Gini’s mean difference, for example, is a long-standing method.10This
in which the observationsx1,x2, ,x nare arranged in ascending order
10 See, for example, Stuart and Ord (1994, p 58) for more details of this method originated by the Italian mathematician, Gini See also Problem 1.7.
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 38Preliminary Data Analysis 19 1.2.3 Measure of asymmetry
Another important property of the histogram or frequency polygon is its shape with respect
to symmetry (on either side of the mode) Thesample coefficient of skewness measures
the asymmetry of a set of data about its mean For a sample of observations,x1,x2, ,
Division by the cube of the sample standard deviation gives a dimensionless measure
A histogram is said to have positive skewness if it has a longer tail on the right, which
is toward increasing values, than on the left In this case the number of values less than themean is greater than the number that exceeds the mean Many natural phenomena tend tohave this property For a positively skewed histogram,
mode < median < mean.
This inequality is reversed if skewness is negative A symmetrical histogram suggests zeroskewness
Example 1.23 Comparison of timber and concrete strength data The coefficient of
skewness of the timber strength data of Table E.1.1 and the compressive strength data ofTable E.1.2 are 0.15 (or 0.53 after excluding the zero value) and 0.03, respectively Theseindicate a small skewness in the first case and a symmetrical distribution in the second case
The example indicates that this measure of skewness is sensitive to the tails of thedistribution
1.2.4 Measure of peakedness
The extent of the relative steepness of ascent in the vicinity and on either side of themode in a histogram or frequency polygon is said to be a measure of itspeakedness or tail weight This is quantified by the dimensionless sample coefficient of kurtosis, which
is defined for a sample of observations,x1,x2, ,x nby
g2=
n
Example 1.24 Comparison of timber and concrete strength data The kurtosis of the
timber strength data of Table E.1.1 is 4.46 (or 3.57 without the zero value) and that ofthe compressive strengths of Table E.1.2 is 2.33 One can easily see from Eq (1.2.11) thateven a small variation in one of the items of data may influence the kurtosis significantly
This observation warrants a large sample size, perhaps 200 or greater, for the estimation ofthe kurtosis Small sample sizes, particularly in the second case withn= 40, preclude theattachment of any special significance to these estimates
1.2.5 Summary of Section 1.2
Of the numerical summaries listed here, the mean, standard deviation, and coefficient ofskewness are the best representative measures of the histogram or frequency polygon, fromboth visual and theoretical aspects These provide economical measures for summarizingthe information in a data set Sample estimates for the data we have been discussing here,including the coefficients of variation and kurtosis, are given in Table 1.2.2
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 39Table 1.2.2 Sample estimates of numerical summaries of the timber strength data of Table 1.1.3and the concrete strength and density data of Table 1.2.1
Sample Standard Coefficient of Coefficient Coefficient
Timber strength—fullsample
1.3.1 Stem-and-leaf plot
The histogram is a highly effective graphical procedure for showing various characteristics
of data as seen in Section 1.1 However, for smaller samples, less than, say, 40 in size,
it may not give a clear indication of the variability and other properties of the data
Thestem-and-leaf plot, which resembles a histogram turned through a right angle, is a
useful procedure in such cases Its advantage is that the data are grouped without loss
of information because the magnitudes of all the values are presented Furthermore, itsintrinsic tabular form highlights extreme values and other characteristics that a histogrammay obscure As in a histogram, the data are initially ranked in ascending order but
a different approach is adopted in finding the number of classes The class widths arealmost invariably equal For the increments or class intervals (and hence class widths) oneuses 0.5, 1, or 2 multiplied by a power of 10, which means that the intervals are in unitssuch as 0.1 or 200 or 10,000, which are more tractable than, say, 0.13 or 140 or 12,000
The terminology is best explained through the following worked example
Example 1.25 Concrete test For the concrete strength data of Table E.1.2, the maximum
and minimum values are 69.5 and 49.9 N/mm2, respectively As a first choice, the data can
be divided into 21 classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51N/mm2, and so on, up to 69 N/mm2 For theordered stem-and-leaf plot of Fig 1.3.1, a
vertical line is drawn with the class boundaries marked in increasing order immediately toits left
The boundary values are called the leading digits and, together with the vertical line,
constitute thestem The trailing digits on the right represent the items of data in increasing
order when read jointly with the leading digits using the indicated units They are termed
leaves, and their counts are the class frequencies Thus the digits 49 (stem) and 9 (leaf)
constitute 49.9 It is useful to provide an additional column at the extreme left, as shownhere, giving the cumulative frequencies—calleddepths—up to each class This is completed
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir
Trang 40Preliminary Data Analysis 21
The diagram gives all the information in the data, which is its main advantage more, the range, median, symmetry, or gaps in the data, frequently occurring values, andany possible outliers can be highlighted In this example, a symmetrical distribution isindicated The plot may be redrawn with a smaller number of classes, perhaps for greaterclarity, using the guidelines for choosing the intervals stipulated previously The units ofdata in a plot can be rounded to any number of significant figures as necessary Also, thenumber of stems in a plot can be doubled by dividing each stem into two lines When
Further-1 multiplied by a power of Further-10 is used as an interval, for example, the first line, which
is denoted by an asterisk (∗), will thus have leaves 0 to 4, and the leaves of the second,represented by a period (.), will be from 5 to 9 Likewise, one may divide a stem into fivelines The stem-and-leaf plot is best suited for small to moderate sample sizes, say, lessthan 200
SOFTbank E-Book Center Tehran, Phone: 66403879,66493070 For Educational Use www.ebookcenter.ir