A random sample is a collection of values selected from a population of values insuch a way that each value in the population had an equal chance of being selectedOften the underlying po
Trang 1Zˇivorad R Lazic´
Design of Experiments
in Chemical Engineering
Design of Experiments in Chemical Engineering Zˇivorad R Lazic´
Copyright 2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Trang 2Further Titles of Interest:
Trang 3Zˇivorad R Lazic´
Design of Experiments
in Chemical Engineering
A Practical Guide
Trang 4Library of Congress Card No applied for.
British Library Cataloguing-in-Publication Data:
A catalogue record for this book is available from the British Library.
Bibliographic information published by Die Deutsche Bibliothek
Die Deutsche Bibliothek lists this publication
in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
<http://dnb.ddb.de>.
2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Printed on acid-free paper.
All rights reserved (including those of translation into other languages) No part of this book may be reproduced in any form – by photoprinting, micro- film, or any other means – nor transmitted or trans- lated into machine language without written permis- sion from the publishers Registered names, trade- marks, etc used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Composition Khn & Weyh, Satz und Medien, Freiburg
Printing Strauss GmbH, Mrlenbach Bookbinding Litges & Dopf Buchbinderei GmbH, Heppenheim
Printed in the Federal Republic of Germany ISBN 3-527-31142-4
Trang 5To Anica, Neda and Jelena
Design of Experiments in Chemical Engineering Zˇivorad R Lazic´ Copyright 2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Trang 6Preface IX
I Introduction to Statistics for Engineers 1
1.1 The Simplest Discrete and Continuous Distributions 7
1.7.1 Correlation in Linear Regression 148
1.7.2 Correlation in Multiple Linear Regression 152
II Design and Analysis of Experiments 157
2.0 Introduction to Design of Experiments (DOE) 157
2.1 Preliminary Examination of Subject of Research 166
2.1.1 Defining Research Problem 166
2.1.2 Selection of the Responses 170
2.1.3 Selection of Factors, Levels and Basic Level 185
2.1.4 Measuring Errors of Factors and Responses 191
VII
Contents
Design of Experiments in Chemical Engineering Zˇivorad R Lazic´
Copyright 2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Trang 72.2 Screening Experiments 196
2.2.1 Preliminary Ranking of the Factors 196
2.2.2 Active Screening Experiment-Method of Random Balance 203
2.2.3 Active Screening Experiment Plackett-Burman Designs 225
2.2.3 Completely Randomized Block Design 227
2.2.4 Latin Squares 238
2.2.5 Graeco-Latin Square 247
2.2.6 Youdens Squares 252
2.3 Basic Experiment-Mathematical Modeling 262
2.3.1 Full Factorial Experiments and Fractional Factorial Experiments 267
2.3.2 Second-order Rotatable Design (Box-Wilson Design) 323
2.3.3 Orthogonal Second-order Design (Box-Benken Design) 349
2.3.4 D-optimality, Bk-designs and Hartleys Second-order Designs 363
2.3.5 Conclusion after Obtaining Second-order Model 366
2.4 Statistical Analysis 367
2.4.1 Determination of Experimental Error 367
2.4.2 Significance of the Regression Coefficients 374
2.4.3 Lack of Fit of Regression Models 377
2.5 Experimental Optimization of Research Subject 385
2.5.1 Problem of Optimization 385
2.5.2 Gradient Optimization Methods 386
2.5.3 Nongradient Methods of Optimization 414
2.5.4 Simplex Sum Rotatable Design 431
2.6 Canonical Analysis of the Response surface 438
2.7 Examples of Complex Optimizations 443
III Mixture Design “Composition-Property” 465
3.1 Screening Design “Composition-Property” 465
3.1.1 Simplex Lattice Screening Designs 469
3.1.2 Extreme Vertices Screening Designs 473
3.2 Simplex Lattice Design 481
3.3 Scheffe Simplex Lattice Design 484
3.4 Simplex Centroid Design 502
3.5 Extreme Vertices Designs 506
3.6 D-optimal Designs 521
3.7 Draper-Lawrence Design 529
3.8 Factorial Experiments with Mixture 540
3.9 Full Factorial Combined with Mixture Design-Crossed Design 543
Appendix 567
A.1 Answers to Selected Problems 567
A.2 Tables of Statistical Functions 589
Index 607
Contents
VIII
Trang 8The last twenty years of the last millennium are characterized by complex zation of industrial plants Complex automatization of industrial plants means aswitch to factories, automatons, robots and self adaptive optimization systems Thementioned processes can be intensified by introducing mathematical methods intoall physical and chemical processes By being acquainted with the mathematicalmodel of a process it is possible to control it, maintain it at an optimal level, providemaximal yield of the product, and obtain the product at a minimal cost Statisticalmethods in mathematical modeling of a process should not be opposed to tradi-tional theoretical methods of complete theoretical studies of a phenomenon Thehigher the theoretical level of knowledge the more efficient is the application of sta-tistical methods like design of experiment (DOE)
automati-To design an experiment means to choose the optimal experiment design to beused simultaneously for varying all the analyzed factors By designing an experi-ment one gets more precise data and more complete information on a studied phe-nomenon with a minimal number of experiments and the lowest possible materialcosts The development of statistical methods for data analysis, combined with de-velopment of computers, has revolutionized the research and development work inall domains of human activities
Due to the fact that statistical methods are abstract and insufficiently known to allresearchers, the first chapter offers the basics of statistical analysis with actual exam-ples, physical interpretations and solutions to problems Basic probability distribu-tions with statistical estimations and with testings of null hypotheses are demon-strated A detailed analysis of variance (ANOVA) has been done for screening of fac-tors according to the significances of their effects on system responses For statisti-cal modeling of significant factors by linear and nonlinear regressions a sufficienttime has been dedicated to regression analysis
Introduction to design of experiments (DOE) offers an original comparison tween so-called classical experimental design (one factor at a time-OFAT) and statis-tically designed experiments (DOE) Depending on the research objective and sub-ject, screening experiments (preliminary ranking of the factors, method of randombalance, completely randomized block design, Latin squares, Graeco-Latin squares,Youdens squares) then basic experiments (full factorial experiments, fractional fac-Preface
be-Design of Experiments in Chemical Engineering Zˇivorad R Lazic´
Copyright 2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Trang 9The third section of the book has been dedicated to studies in the mixture designfield The methodology of approaching studies has been kept in this field too Onebegins with screening experiments (simplex lattice screening designs, extreme ver-tices designs of mixture experiments as screening designs) through simplex latticedesign, Scheffe's simplex lattice design, simplex centroid design, extreme verticesdesign, D-optimal design, Draper-Lawrence design, full factorial mixture design,and one ends with factorial designs of process factors that are combined with mix-ture design so-called "crossed" designs.
The significance of mixture design for developing new materials should be ularly stressed The book is meant for all experts who are engaged in research, devel-opment and process control
partic-Apart from theoretical bases, the book contains a large number of practical ples and problems with solutions This book has come into being as a product ofmany years of research activities in the Military Technical Institute in Belgrade Theauthor is especially pleased to offer his gratitude to Prof Dragoljub V Vukovic´,Ph.D., Branislav Djukic´, M.Sc and Paratha Sarathy, B.Sc For technical editing ofthe manuscript I express my special gratitude to Predrag Jovanic´, Ph.D., DragoJaukovic´, B.Sc., Vesna Lazarevic´, B.Sc., Stevan Rakovic´, machine technician,Dusˇanka Glavacˇ, chemical technician and Ljiljana Borkovic
X
Trang 10Natural processes and phenomena are conditioned by interaction of various factors
By dealing with studies of cause-factor and phenomenon-response relationships,science to varying degrees, has succeeded in penetrating into the essence of phe-nomena and processes Exact sciences can, by the quality of their knowledge, beranked into three levels The top level is the one where all factors that are part of anobserved phenomenon are known, as well as the natural law as the model by whichthey interact and thus realize the observed phenomenon The relationship of all fac-tors in natural-law phenomenon is given by a formula-mathematical model To give
an example, the following generally known natural laws can be cited:
of differential equations, used to define the flow of an ideal fluid:
to reach the noticed natural law
I
Introduction to Statistics for Engineers
Design of Experiments in Chemical Engineering Z ˇ ivorad R Lazic´
Copyright 2004 WILEY-VCH Verlag GmbH & Co KGaA, Weinheim
Trang 11I Introduction to Statistics for Engineers
As an example of this level of knowledge about a phenomenon we can cite thefollowing empirical dependencies Darcy-Weisbah’s law on drop of fluid pressurewhen flowing through a pipe [1]:
Dp ¼ kL
Dr
W22Ergun’s equation on drop of fluid pressure when flowing through a bed of solidparticles [1]:
0:33
lSTl
The lowest level in observing a phenomenon is when we are faced with a totallynew phenomenon where both factors and the law of changes are unknown to us,i.e outcomes-responses of the observed phenomenon are random values for us.This randomness is objectively a consequence of the lack of ability to simultaneouslyobserve all relations and influences of all factors on system responses Through itsdevelopment science continually discovers new connections, relationships and fac-tors, which brings about shifting up the limits between randomness and lawfulness.Based on the mentioned analysis one can conclude that stochastic processes arephenomena that are neither completely random not strictly determined, i.e randomand deterministic phenomena are the left and right limits of stochastic phenomena
In order to find stochastic relationships the present-day engineering practice uses,apart from others, experiment and statistical calculation of obtained results
Statistics, the science of description and interpretation of numerical data, began
in its most rudimentary form in the census and taxation of ancient Egypt and lon Statistics progressed little beyond this simple tabulation of data until the theo-retical developments of the eighteenth and nineteenth centuries As experimentalscience developed, the need grew for improved methods of presentation and analy-sis of numerical data
Baby-The pioneers in mathematical statistics, such as Bernoulli, Poisson, and Laplace,had developed statistical and probability theory by the middle of the nineteenth cen-tury Probably the first instance of applied statistics came in the application of prob-ability theory to games of chance Even today, probability theorists frequently choose
2
Trang 12I Introduction to Statistics for Engineers
a coin or a deck of cards as their experimental model Application of statistics inbiology developed in England in the latter half of the nineteenth century The firstimportant application of statistics in the chemical industry also occurred in a factory
in Dublin, Ireland, at the turn of the century Out of the need to approach solvingsome technological problems scientifically, several graduate mathematicians fromOxford and Cambridge, including W S Gosset, were engaged Having accepted thejob in l899, Gosset applied his knowledge in mathematics and chemistry to controlthe quality of finished products His method of small samples was later applied inall fields of human activities He published his method in 1907 under the pseudo-nym “Student”, known as such even these days This method had been applied to alimited level in industry up to 1920 Larger applications were registered duringWorld War Two in military industries Since then statistics and probability theoryare being applied in all fields of engineering
With the development of electronic computers, statistical methods began to thriveand take an ever more important role in empirical researches and system optimization.Statistical methods of researching phenomena can be divided into two basicgroups The first one includes methods of recording and processing-description ofvariables of observed phenomena and belongs to Descriptive statistics As a result ofapplying descriptive statistics we obtain numerical information on observed phe-nomena, i.e statistical data that can be presented in tables and graphs The secondgroup is represented by statistical analysis methods the task of which is to clarify theobserved variability by means of classification and correlation indicators of statisticseries This is the field of that Inferential statistics, however, cannot be strictly setapart from descriptive statistics
The subject of statistical researches are the Population (universe, statistical masses,basic universe, completeness) and samples taken from a population The populationmust be representative of a collection of a continual chemical process by some fea-tures, i.e properties of the given products If we are to find a property of a product,
we have to take out a sample from a population that, by mathematical statistics ory is usually an infinite gathering of elements-units
the-For example, we can take each hundredth sample from a steady process andexpose it to chemical analysis or some other treatment in order to establish a certainproperty (taking a sample from a chemical reactor with the idea of establishing theyield of chemical reaction, taking a sample out of a rocket propellant with the idea
of establishing mechanical properties such as tensile strength, elongation at break,etc.) After taking out a sample and obtaining its properties we can apply descriptivestatistics to characterize the sample However, if we wish to draw conclusions aboutthe population from the sample, we must use methods of statistical inference
What can we infer about the population from our sample? Obviously the samplemust be a representative selection of values taken from the population or else wecan infer nothing Hence, we must select a random sample
A random sample is a collection of values selected from a population of values insuch a way that each value in the population had an equal chance of being selectedOften the underlying population is completely hypothetical Suppose we makefive runs of a new chemical reaction in a batch reactor at constant conditions, and
3
Trang 13I Introduction to Statistics for Engineers
then analyze the product Our sample is the data from the five runs; but where isthe population? We can postulate a hypothetical population of “all runs made atthese conditions now and in the future” We take a sample and conclude that it will
be representative of a population consisting of possible future runs, so the tion may well be infinite
popula-If our inferences about the population are to be valid, we must make certain thatfuture operating conditions are identical with those of the sample
For a sample to be representative of the population, it must contain data over thewhole range of values of the measured variables We cannot extrapolate conclusions
to other ranges of variables A single value computed from a series of observations(sample) is called a “statistic”
Mean, median and mode as measures of location
By sample mean X we understand the value that is the arithmetic average of erty values X1; X2; X3; ; Xi When we say average, we are frequently referring to thesample mean, which is defined as the sum of all the values in the sample divided bythe number of values in the sample A sample mean-average is the simplest andmost important of all data measures of location
prop-X ¼X Xi
where:
X is the sample mean-average of the n-values,
Xiis any given value from the sample
The symbol X is the symbol used for the sample mean It is an estimate of thevalue of the mean of the underlying population, which is designated l We cannever determine l exactly from the sample, except in the trivial case where the sam-ple includes the entire population but we can quite closely estimate it based on sam-ple mean Another average that is frequently used for measures of location is themedian The median is defined as that observation from the sample that has thesame number of observations below it as above it Median is defined as the centralobservation of a sample where values are in the array by sizes
A third measure of location is the mode, which is defined as that value of the sured variable for which there are the most observations Mode is the most probablevalue of a discrete random variable, while for a continual random variable it is therandom variable value where the probability density function reaches its maximum.Practically speaking, it is the value of the measured response, i.e the property that
mea-is the most frequent in the sample The mean mea-is the most widely used, particularly
in statistical analysis The median is occasionally more appropriate than the mean
as a measure of location The mode is rarely used For symmetrical distributions,such as the Normal distribution, the mentioned values are identical
4
Trang 14I Introduction to Statistics for Engineers
If the given salaries are put in the array we get:
Measures of variability, the range, the mean deviation and variance
As we can see, mean or average, median and mode are measure of Location Havingdetermined the location of our data, we might next ask how the data are spread out aboutmean The simplest measure of variability is range or interval The range is defined asthe difference between the largest and smallest values in the sample
This measure can be calculated easily but it offers only an approximate measure
of variability of data as it is influenced only by the limit values of observed propertythat can be quite different from other values For a more precise measure of variabil-ity we have to include all property-response values, i.e from all their deviations fromthe sample mean, mostly the average As the mean of the values of deviation fromthe sample mean is equal to null, we can take as measures of variability the meandeviation The mean deviation is defined as the mean of the absolute values of devia-tion from the sample mean:
Trang 15I Introduction to Statistics for Engineers
A useful calculation formula is:
ly, the mean of the sample means will be nearly equal to the population mean
l Strictly speaking, our ten groups will not give us exact values for r2X and l Toobtain these, we would have to take an infinite number of groups, and hence oursample would include the entire infinite population, which is defined in statistics asGlivenko’s theorem [3]
6
Trang 161.1 The Simplest Discrete and Continuous Distributions
To illustrate the difference between values of sample estimates and population rameters, consider the ten groups of five numbers each as shown in the table Thesample means and sample standard deviations have been calculated from appropri-ate formulas and tabulated Usually we could calculate no more than that these val-ues are estimates of the population parameters l and r2X, respectively However inthis case, the numbers in the table were selected from a table of random numbersranging from 0 to 9 – Table A In such a table of random numbers, even of infinitesize, the proportion of each number is equal to 1/10 This equal proportion permits
pa-us to evaluate the population parameters exactly:
pop-What we have done in the table is to take ten random samples from the infinitepopulation of numbers from 0 to 9 In this case, we know the population parameters
so that we can get an idea of the accuracy of our sample estimates
& Problem 1.1
From the table of random numbers take 20 different sample data
with 10 random numbers Determine the sample mean and sample
variance for each sample Calculate the average of obtained
“statis-tics” and compare them to population parameters
1.1
The Simplest Discrete and Continuous Distributions
In analyzing an engineering problem, we frequently set up a mathematical modelthat we believe will describe the system accurately Such a model may be based onpast experience, on intuition, or on a theory of the physical behavior of the system.Once the mathematical model is established, data are taken to verify or reject it.For example, the perfect gas law (PV = nRT) is a mathematical model that has beenfound to describe the behavior of a few real gases at moderate conditions It is a
“law” that is frequently violated because our intuitive picture of the system is toosimple
In many engineering problems, the physical mechanism of the system is too complexand not sufficiently understood to permit the formulation of even an approximatelyaccurate model, such as the perfect gas law However, when such complex systemsare in question, it is recommended to use statistical models that to a greater or less-
er, but always well-known accuracy, describe the behavior of a system
7
Trang 17I Introduction to Statistics for Engineers
In this chapter, we will consider probability theory, which provides the simple tistical models needed to describe the drawing of samples from a population, i.e.simple probability models are useful in describing the presumed population under-lying a random sample of data Among the most important concepts of probabilitytheory is the notion of random variable As realization of each random event can benumerically characterized, the various values, which take those numbers as definiteprobabilities, are called random variables A random variable is often defined as afunction that to each elementary event assigns a number Thus, influenced by ran-dom circumstances a random variable can take various numerical values One can-not tell in advance which of those values the random variable will take, for its valuesdiffer with different experiments, but one can in advance know all the values it cantake To characterize a random variable completely one should know not only whatvalues it can take but also how frequently, i.e what the probability is of taking thosevalues The number of different values a random variable takes in a given experi-ment can be final If random variable takes a finite number of values with corre-sponding probabilities it is called a discrete random variable The number of defectiveproducts that are produced during a working day, the number of heads one getswhen tossing two coins, etc., are the discrete random variables The random variable
sta-is continuous if, with corresponding probability, it can take any numerical value in adefinite range Examples of continuous random variables: waiting time for a bus,time between emission of particles in radioactive decay, etc
The simplest probability model
Probability theory was originally developed to predict outcomes of games of chance.Hence we might start with the simplest game of chance: a single coin We intuitivelyconclude that the chance of the coin coming up heads or tails is equally possible.That is, we assign a probability of 0.5 to either event Generally the probabilities ofall possible events are chosen to total 1.0
If we toss two coins, we note that the fall of each coin is independent of the other.The probability of either coin landing heads is thus still 0.5 The probability of bothcoins falling heads is the product of the probabilities of the single events, since thesingle events are independent:
distribu-The Bernoulli distribution applies wherever there are just two possible outcomesfor a single experiment It applies when a manufactured product is acceptable ordefective; when a heater is on or off; when an inspection reveals a defect or does not.The Bernoulli distribution is often represented by 1 and 0 as the two possible out-
8
Trang 181.1 The Simplest Discrete and Continuous Distributions
comes, where 1 might represent heads or product acceptance and 0 would representtails or product rejection
Mean and variance
The tossing of a coin is an experiment whose outcome is a random variable tively we assume that all coin tosses occur from an underlying population where theprobability of heads is exactly 0.5 However, if we toss a coin 100 times, we may get
Intui-54 heads and 46 tails We can never verify our intuitive estimate exactly, althoughwith a large sample we may come very close
How are the experimental outcomes related to the population mean and variance?
A useful concept is that of the “expected value” The expected value is the sum of allpossible values of the outcome of an experiment, with each value weighted with aprobability of obtaining that outcome The expected value is a weighted average.The “mean” of the population underlying a random variable X is defined as theexpected value of X:
l¼ E Xð Þ ¼P
where:
lis the population mean;
E(X) is the expected value of X;
By appropriate manipulation, it is possible to determine the expected value of ious functions of X, which is the subject of probability theory For example, theexpected value of X is simply the sum of squares of the values, each weighted by theprobability of obtaining the value
var-The population variance of the random variable X is defined as the expected value
of the square of the difference between a value of X and the mean:
Trang 19I Introduction to Statistics for Engineers
1.1.1
Discrete Distributions
A discrete distribution function assigns probabilities to several separate outcomes of
an experiment By this law, the total probability equal to number one is distributed
to individual random variable values A random variable is fully defined when itsprobability distribution is given The probability distribution of a discrete randomvariable shows probabilities of obtaining discrete-interrupted random variable val-ues It is a step function where the probability changes only at discrete values of therandom variable The Bernoulli distribution assigns probability to two discrete out-comes (heads or tails; on or off; 1 or 0, etc.) Hence it is a discrete distribution.Drawing a playing card at random from a deck is another example of an experimentwith an underlying discrete distribution, with equal probability (1/52) assigned toeach card For a discrete distribution, the definition of the expected value is:
E Xð Þ ¼P
where:
Xiis the value of an outcome, and
piis the probability that the outcome will occur
The population mean and variance defined here may be related to the samplemean and variance, and are given by the following formulas:
E X
¼ E P
Xin
Trang 201.1 The Simplest Discrete and Continuous Distributions
and finally:
The definition of sample variance with an (n-1) in the denominator leads to anunbiased estimate of the population variance, as shown above Sometimes the sam-ple variance is defined as the biased variance:
of occurring in any one (Bernoulli) trial:
Here, n = 10; k = 2; p = 0.1, so that:
P X ¼ 2ð Þ ¼ 102 0:1ð Þ2 0:9ð Þ8¼ 0:1938
The chances are about 19 out of 100 that two out of ten in the sample are tive On the other hand, chances are only one out of ten billion that all ten would befound defective Values of P(X=k) for other values may be calculated and plotted togive a graphic representation of the probability distribution Fig 1.1
defec-11
Trang 21I Introduction to Statistics for Engineers
Figure 1.1 Binomial distribution for p = 0.1 and n = 10.
Table 1.1 Discrete distributions.
Bernoulli
x i = 1 with p ; x i = 0 with (1-p)
two possible outcomes
Heads or tails with a coin Binomial
nM N
nM NM ð Þ Nn ð Þ
N2ð N1 Þ M objects of one kind,
N objects of another kind k objects of kind M found in a drawing of n objects.
The n objects are drawn from the population without replacement after each drawing.
Number of defective items in a sample drawn without replacement from a finite population.
Geometric
PðX¼kÞ¼ p 1 p ð Þk
1p p
a constant parameter
Radioactive decay, ment breakdown 12
Trang 22equip-1.1 The Simplest Discrete and Continuous Distributions
One defective item in a sample is shown to be most probable; but this sampleproportion occurs less than four times out of ten, even though it is the same as thepopulation proportion In the previous example, we would expect about one out often sampled items to be defective We have intuitively taken the population propor-tion (p = 0.1) to be the expected value of the proportion in the random sample Thisproves to be correct It can be shown that for the binomial distribution:
Thus for the previous example:
l¼ 10 0:1 ¼ 1; r2¼ 10 0:1 0:9 ¼ 0:9
Example 1.4 [4]
The probability that a compression ring fitting will fail to seal properly is 0.1 What
is the expected number of faulty rings and their variance if we have a sample of 200rings?
Assuming that we have a binomial distribution, we have:
l¼ n p ¼ 200 0:1 ¼ 20; r2¼ np 1 pð Þ ¼ 200 0:1 0:9 ¼ 18
A number of other discrete distributions are listed in Table 1.1, along with themodel on which each is based Apart from the mentioned discrete distribution ofrandom variable hypergeometrical is also used The hypergeometric distribution isequivalent to the binomial distribution in sampling from infinite populations Forfinite populations, the binomial distribution presumes replacement of an itembefore another is drawn; whereas the hypergeometric distribution presumes no re-placement
1.1.2
Continuous Distribution
A continuous distribution function assigns probability to a continuous range of ues of a random variable Any single value has zero probability assigned to it Thecontinuous distribution may be contrasted with the discrete distribution, whereprobability was assigned to single values of the random variable Consequently, acontinuous random variable cannot be characterized by the values it takes in corre-sponding probabilities Therefore in a case of continuous random variable weobserve the probability P(x £ X £ Dx) that it takes values from the range (x,x+Dx),where Dx can be an arbitrarily small number The deficiency of this probability isthat it depends on Dx and has a tendency to zero when Dx fi0 In order to overcomethis deficiency let us observe the function:
Trang 23I Introduction to Statistics for Engineers
14
which does not depend on Dx and that is called the probability density function of tinuous random variable X The probability that the random variable lies betweenany two specific values in a continuous distribution is:
1
xf xð Þdx
26
372
(1.32)
The simplest continuous distribution is the uniform distribution that assigns aconstant density function over a region of values from a to b, and assigns zero prob-ability to all other values of the random variable Figure 1.2
The probability density function for the uniform distribution is obtained by grating over all values of x, with f(x) constant between a and b, and zero outside ofthe region between a and b:
Trang 241.1 The Simplest Discrete and Continuous Distributions
After integrating this relation, we get:
0
f(X)
Figure 1.2 Uniform Distribution
Next to follow is:
The random variable in this example is the time T until the next bus Assuming
no knowledge of the bus schedule, T is uniformly distributed from 0 to 15 min Here
we are saying that the probabilities of all times until the next bus are equal Then:
f tð Þ ¼ 1
150¼
115The average wait is:
Trang 25I Introduction to Statistics for Engineers
Waiting at least 10 min implies that T lies between 10 and 15, so that by
In only one case out of three will we need to wait 10 min or more The probabilitythat we will have to wait exactly 10 min is zero, since no probability is assigned tospecific values in continuous distributions Characteristics of several continuous dis-tributions are given in Table 1.2
Table 1.2 Continuous distributions
f x ð Þ ¼ kekx ; x > 0
1 k
1
k2
Describes distribution of time between the succes- sive events in a Poisson distribution
Time between emission
of particles in radioactive decay.
situa-tions Standard normal
f y ð Þ ¼ 1ffiffiffiffiffiffi
2p
2 2
! 0 1 A special case of thenormal distribution Many experimental situa-tions
k is referred to as “degrees
of freedom”
Statistical tests on assumed normal distri- bution.
1.1.3
Normal Distributions
The normal distribution was proposed by the German mathematician Gauss Thisdistribution is applied when analyzing experimental data and when estimating ran-dom errors, and it is known as Gauss’ distribution The most widely used of all con-tinuous distributions is the normal distribution, for the folowing reasons:
. many random variables that appear during an experiment have normal butions;
distri-. large numbers of random variables have approximately normal distributions;
16
Trang 261.1 The Simplest Discrete and Continuous Distributions if a random variable have not does a normal distribution, not even approxi-mately, it can then be transformed into a normal random variable by relative-
ly simple mathematical transformations;
. certain complex distributions can be approximated by normal distribution(binomial distribution);
. certain random variables that serve for verification of statistical tests havenormal distributions
Gauss assumed that any errors in experimental observations were due to a large ber of independent causes, each of which produced a small disturbance Under thisassumption the well-known bell-shaped curve has been obtained Although it ade-quately describes many real situations involving experimental observations, there is noreason to assume that actual experimental observations necessarily conform to theGaussian model For example, Maxwell used a related model in deriving a distributionfunction for molecular velocities in a gas; but the result is only a very rough approxima-tion of the behavior of real gases Error in experimental measurement due to com-bined effects of a large number of small, independent disturbances is the primaryassumption involved in the model on which the normal distribution is based Thisassumption leads to the experimental form of the normal distribution
num-The assumption of a normal distribution is frequently and often indiscriminatelymade in experimental work, because it is a convenient distribution on which manystatistical procedures are based Many experimental situations, subject to randomerror, yield data that can be adequately described by the normal distribution, but this
is not always the case
The terms l and r2are initially defined simply as parameters in the normal bution function The term l determines the value on which the bell-shaped curve iscentered and r2determines the “spread” in the curve Fig 1.3
distri-A large variance gives a broad, flat curve, while a small variance yields a tall, row curve with most probabilities concentrated on values near l
nar-The mean or expected value of the normal distribution is obtained by applying
Eq (1.28):
σ
σ σ
Figure 1.3: How the varaince affects normal distribution curve.
17
Trang 27I Introduction to Statistics for Engineers
!
dy þ lrþ1Ð
1exp y22
!dy
!
dy ¼ ffiffiffiffiffiffi2p
!dzThe integral is evaluated from Table 1.3:
P 1 z 1ð Þ ¼ P z 1ð Þ P z 1ð Þ ¼ 0:8413 0:1587 ¼ 0:6816
18
Trang 281.1 The Simplest Discrete and Continuous Distributions
z
P Z zð Þ ¼ Ðz
/
1ffiffiffiffiffiffi2p
p exp y
22
!dy
Table 1.3 Abbreviated table of standard normal distribution
0.0139 0.1151 0.4207
0.0107 0.0968 0.3821
0.0082 0.0808 0.3446
0.0062 0.0668 0.3085
0.0019 0.0287 0.1841 +0.0
0.5793 0.8849 0.9893
0.6179 0.9032 0.9893
0.6554 0.9192 0.9918
0.6915 0.9332 0.9938
0.8159 0.9713 0.9981
19
Trang 29I Introduction to Statistics for Engineers
Example 1.6
For example, suppose we have a normal population with l ¼ 6; r2¼ 4, and we want
to know what percentage of the population has values greater than 9?
By Eq (1.41), z ¼96
2 ¼ 1:5 From Table 1.3 P(z < 1.5) = 0.9332, and P(z > 1.5) =1–P(z < 1.5) = 1–0.9332 = 0.0668 Hence about 6.68 % of the population has valuesgreater than 9
& Problem 1.2 4
Using Table 1.3 i.e Table B for standard normal distribution,
deter-mine probabilities that correspond to the following Z intervals
a) 0 £ Z £ 1.4 ; b) –0.78 £ Z £ 0 ; c) –0.24 £ Z £ 1.9;
d) 0.75 £ Z £ 1.96; e) –¥Z £ 0.44; f) –¥Z £ 1.2;
g) –1 £ Z £ 1; h)–1.96 £ Z £ 1.96; i) –2.58 £ Z £ 2.58
Approximations to discrete distribution
It has already been mentioned that certain distributions can be approximated to anormal one As the size of the sample increases, the binomial distribution asympto-tically approaches a normal distribution This is a useful approximation for largesamples
Trang 301.1 The Simplest Discrete and Continuous Distributions
A sample of 36 observations was drawn from a normally distributed
population having a mean of 20 and variance of 9 What portion of
the population can be expected to have values greater than 26?
& Problem 1.5 [4]
The average particulate concentration in micrograms per cubic
meter for a station in a petrochemical complex was measured every
6 hours for 30 days The resulting data are given in table
A new air pollution regulation requires that the total particulate
concentration be kept below 70+5 mg m3
a) What is the probability that the particulate concentration on any day will fallwithin this allowed range?
b) What is the probability of exceeding the upper limit?
c) What is the probability of operating in the absolutely safe region below 65mg/m3?
Let us suppose that the body weights of 800 students have a normal
distribution with mean l = 66 kg and standard deviation r = 5 kg
Find the number of students whose weight is:
a) between 65 and 75 kg;
b) over 72 kg
21
Trang 31I Introduction to Statistics for Engineers
& Problem 1.8
A machine is mounted for production of metal rods 24 cm long,
with a tolerance rate of e = 0.05 cm Based on long-time observation
it is established that r = 0.03 On the assumption that lengths of X
metal rods have a normal distribution calculate the percentage of
metal rods that will be placed in tolerance range How large should
this tolerance be so that 95 per cent of produced metal rods should
be within these tolerance limits?
& Problem 1.9
In the lab mixer 20 batches of a composite rocket propellant were
mixed under identical conditions, all of them having the same
com-position Test strands were taken out of the obtained propellant and
their burning rates were measured at 70 bar of pressure The
burn-ing rate average is XX=8.5 mm/s , and the calculated variance is
r2=0.30 What number of strands has the burning rate:
con-. Decide whether the average yield from several runs at constant operatingconditions equals that required by economic factors;
. Determine whether one set of operating conditions gives a significantly
high-er yield than anothhigh-er;
. Estimate the average yield to be expected in further runs at specified ing conditions;
operat-. Find a quantitative equation that can be used to predict the yield at variousoperating conditions
We can use various methods of statistical inference to arrive at these conclusions.Because the original data are subject to experimental error and may not exactly fitour presumed model, we can draw a conclusion only within specified limits of cer-tainty We can never make completely unequivocal inferences about a population byusing statistical procedures
Statistical inference may be devided into two broad categories:
. hypothesis testing;
. statistical estimation
22
Trang 321.2 Statistical Inference
In the first case, we set up a hypothesis about the population and then eitheraccept or reject it by a test using sample data The first two examples in the firstparagraph above involve hypothesis testing In the first example, the hypothesiswould be that the average yield equals the required yield
Estimation involves the calculation of numerical values for the various populationparameters (mean, variance, and so on) These numerical values are only estimates
of the actual parameters, but statistical procedures permit us to establish the racy of the estimate
accu-1.2.1
Statistical Hypotheses
A statistical hypothesis is simply a statement concerning the probability distribution
of a random variable Once the hypothesis is stated, statistical procedures are used
to test it, so that it may be accepted or rejected Before the hypothesis is formulated,
it is almost always necessary to choose a model that we assume adequately describesthe underlying population The choice of a model requires the specification of theprobability distribution of the population parameters of interest to us When a statis-tical hypothesis is set up, then the corresponding statistical procedure is used toestablish whether the proposed hypothesis should be accepted or rejected Generallyspeaking, we are not able to answer the question whether a statistical hypothesis isright or wrong If the information from the sample taken supports the hypothesis,
we do not reject it However, if those data do not back the statistical hypothesis set
up, we reject it
In principle, two hypotheses are set up:
. primary or null hypothesis H0;
Types of errors
When testing statistical hypotheses, two types of error may be defined, together withtheir probability of occurrence
. Type I error: Rejecting H0when it is true Let a equal probability of rejecting
H0when it is true This term is also referred to as the “level of significance”
of the test
. Type II error: Accepting H0 when it is false (that is, when H1 is true) Let bequal the probability of accepting H0when it is false
23
Trang 33I Introduction to Statistics for Engineers
One can generally say that a and b are risks of accepting false hypotheses Ideally
we would prefer a test that minimized both types of errors Unfortunately, as adecreases, b tends to increase, and vice versa Apart from the terms mentioned weshould introduce the new term power of a test The power of a test is defined as theprobability of rejecting H0when it is false Symbolically it is: power of a test = 1-b orprobability of making a correct decision
The test statistic
To make a test of the hypothesis, sample data are used to calculate a test statistic.Depending upon the value of the test statistic, the primary hypothesis H0 isaccepted or rejected The critical region is defined as the range of values of the teststatistic that requires a rejection H0 The test statistic is determined by the specificprobability distribution and by the parameter selected for testing
Procedure for testing a hypothesis
The general procedure for testing a statistical hypothesis is:
1 Choose a probability model and a random variable associated with it Thischoice may be based on previous experience or intuition
2 Formulate H0and H1 These must be carefully formulated to permit a ingful conclusion
mean-3 Specify the test statistic
4 Choose a level of significance a for the test
5 Determine the distribution of the test statistic and the critical region for thetest statistic
6 Calculate the value of the test statistic from a random sample of data
7 Accept or reject H0by comparing the calculated value of the test statistic withthe critical region
The following examples illustrate the procedure for a statistical test [7] In thefirst, we consider a very simple test on a single observation The second applies theseven-step procedure to a test on the mean of a binomial population using a normalapproximation Here, and in the third example, we introduce the idea of one-sidedand two-sided tests, while in the fourth example we illustrate the calculation of Type
II error, and the power function of a test
Example 1.8
A single observation is taken on a population that is believed to be normally uted with a mean of 10 and a variance of 9 The observation is X=16 Can we con-clude that the observation is from the presumed population?
distrib-To answer this question, we follow the seven-step procedure:
1 The probability model is a normal distribution: l=10; r2=9 The random able is the value of X
vari-24
Trang 341.2 Statistical Inference
2 The primary hypothesis H0:X=16 , is from a population that is normally tributed with N (10; 9)
dis-The alternate hypothesis H1: X=16, is not from the presumed population
3 Since we have only one observation for X, the test statistic is simply its dardized value, Z ¼ X lð Þ
stan-rX The standard normal tables give the bution of this statistic
distri-4 Choice of the level of significance a: is arbitrary We will use a=0.01 anda=0.05 because one of these values is commonly used
5 The test statistic is distributed normally with l=10 and r2X=9, if H0 is true Avalue of X that is too far above or below the mean should be rejected, so weselect a critical region at each end of the normal distribution As illustrated inFig 1.5, a fraction 0.025 of the total area under the curve is cut off at eachend for a=0.05 From the tables (Table B), we determine that the limits corre-sponding to these areas are Z=-1.96 and Z=1.96; so that if our single observa-tion falls between these values, we accept H0 The corresponding values fora=0.01 are Z=–2.58
6 Find the value for:
Figure 1.5 Critical region for two-sided test
If we are willing to risk rejecting H0 when it is true five times out of 100, we canreject it here; but if we wish to reduce the risk of rejecting a true hypothesis to onechance out of 100, then we must accept H for the example Normally, we do not
25
Trang 35I Introduction to Statistics for Engineers
carry along two values for a A single desired level of significance is chosen initially,and the decision is based on it This almost trivial example illustrates the test proce-dure and demonstrates the meaning of the level of significance of a test
Example 1.9
Experience in tire manufacture at a given plant shows that an average of 4.8% of thetires are rejected as imperfect In a recent day of operation, 60 out of 1000 tires wererejected Is there any reason to believe that the manufacturing process is function-ing improperly?
This problem can be reduced to a test on the sample value We may use the data
on long-range operation to estimate the population parameter (p=0.048), since wehave no other way of knowing it We again follow the seven-step method:
1 This is a binomial distribution with two outcomes: acceptance or rejection.However, since the sample size is large, we may use the normal approxima-tion Since p=0.048, we know that the population mean is: l=np=1000 0.048=48
and the population variance is:
r2=np (1-p)=1000 0.048(1-0.048)=45.7
2 As the problem is stated, the primary hypothesis is not clearly shown Wechoose to compare the mean of the population that is presumed to underliethe day’s production to the long-range population mean:
H0:l£48; H1:l>48
The question then is whether the day’s rejection rate of 60 is sufficiently high
to reject the hypothesis that the rejection rate is 48 or less
3 Since we are using a normal approximation, the test statistic is simply thestandard normal variable:
Z ¼X0:5l0
rThe subtraction of 0.5 from the sample value improves the normal approxi-mation It is called the “continuity correction” The numerator of the equation
is the deviation of the sample value from the population mean The nator is simply the standard deviation of the presumed population
denomi-Thus, Z is the number of standard deviations away from the mean at which
we find the sample value X Here we have used the sample value X as anestimate of the population mean presumed to underlie the day’s production
4 Let a=0.05 That is, we will risk rejecting the true hypothesis that l=48 is onechance in 20
5 To determine the critical region, we must know the distribution of the teststatistic In this case, Z is distributed as the standard normal distribution.With H0:l<48 and a=0.05, we determine that the critical region will include5% of the area on the high end of the standard normal curve Fig 1.6 TheZ-value that cuts off 5% of the curve is found to be 1.645, from a table of
26
Trang 367 Since 1.70 >1.645, we reject H0and calculate that a rejection rate of 60 out of
1000 tires is significantly higher than the 4.8 % rejection rate We might nowproceed to seek the cause of this change by checking the manufacturing pro-cess
f(X)
X 0
Figure 1.6 Critical region for one-sided test
Suppose instead of finding 60 imperfect tires in 1000, we had found 6 in a sample
of 100 Then our solution gives:
Z ¼60:54:8ffiffiffiffiffiffiffiffiffi
4:57
p ¼ 0:327; a=0.05
Therefore, with a= 0.05 as before, we would accept H0 Rejection of 6 tires out of
100 is not significantly different from the population proportion of 0.048, but 60 out
of 1000 is significantly different This illustrates the effect of sample size on cal tests A smaller sample is more influenced by random fluctuations than is a larg-
statisti-er one; so that the same proportionate diffstatisti-erence in a largstatisti-er sample is statisticallymore significant than in a smaller one
This example illustrates a one-sided test, that is the critical region is on one side ofthe probability distribution because of the way the hypotheses are stated
27
Trang 37I Introduction to Statistics for Engineers
Example 1.10
Suppose we had chosen to test whether the daily rejection rate of 60 out of 1000 wassignificantly different from the population proportion of 0.048, rather than signifi-cantly higher than the population proportion as in the previous example In thiscase, the hypotheses would be:
H0: l=48; H1: l<48 or l>48
This is two-sided test The critical region is split into two parts, one rejecting valuesthat are too low and the other rejecting values that are too high With a = 0.05, eachcritical region has an area of a/2=0.25 As shown in Example 1.8 Fig 1.5, the corre-sponding Z-values are Z=–1.96 We will reject H0 if the calculated Z is less than-1.96 or greater than 1.96 With a two-sided test, H0 is accepted for 60/1000 and6/100 For this problem, we would probably be more concerned with a rejection ratethat was too high, so that the one-sided test would be more appropriate than thetwo-sided one
45:7
p
¼ 34:7 is the upper limit of the lower criticalregion Therefore if X lies between 34.7 and 61.3, H0 is accepted regardless ofwhether it is true or not
Here we have omitted the continuity correction This is permissible for large ple sizes In addition, the method used here is only approximate because it assumesthe variance is constant regardless of H1 If H1:l¼50 were true and H0 therebyfalse, the curve for H1 in Fig 1.7 would be the correct one; but the limits 34.7 and6l,3 still define the region of acceptance of H0 Because H0would be accepted when
sam-H1 is true if X falls between (34.7; 61.3), then the area under the curve H1 betweenthese limits is b, which is the probability of a type II error The area is labeled inFig 1.7 To determine this area, we use the original limits with the curve for H1 todetermine the standardized limits of the b region
Trang 381.2 Statistical Inference f(X)
38
Figure 1.7 Evaluation of type II error
From the normal tables, Table B we have:
Suppose now that we repeat the calculation of the power for other specific values
of H1 and plot them as shown in Fig 1.8 Inspection of Fig 1.8 shows that thepower would vary from a value of a where H1:l=48 to a value of 1.0 where H1:–¥.Such calculations yield the power function curve, as shown in Fig 1.8 As expected,the further l1is removed from l0=48, the higher is the probability of rejecting thefalse hypothesis H0 Inspection of Fig 1.7 shows that b decreases as a increases, sothat we could obtain a higher power at the sacrifice of the level of significance Ahigher power at the same a is possible if a large sample size is used
In this part, we have considered the fundamentals of statistical tests and haveseen that no test is free from possible error We can reduce the probability of reject-ing a true hypothesis only by running a greater risk of accepting a false hypothesis
We note that a larger sample size reduces the probability of error
A demand for a specially high level of significance of at least 0.9999 is present inrocket technology and spacecraft industry In order to reach the mentioned level ofsignificance, there has to be an almost disappearing level of b so that there is nochance of mounting a defective part into the mentioned crafts
29
Trang 39I Introduction to Statistics for Engineers
Figure 1.8 Power function for Example 1.10
1.3
Statistical Estimation
Engineers are often faced with the problem of using a set of data to calculate ties that they hope will describe the behavior of the process from which the datawere taken Because the measured process variable may be subject to random fluc-tuations as well as to random errors of measurement, the engineers calculated esti-mate is subject to error, but how much? Here is where the method of statistical esti-mation can help
quanti-Statistical estimation uses sample data to obtain the best possible estimate of ulation parameters The p value of the Binomial distribution, the l value in Poison’sdistribution, or the l and r values in the normal distribution are called parameters.Accordingly, to stress it once again, the part of mathematical statistics dealing withparameter distribution estimate of the probabilities of population, based on samplestatistics, is called estimation theory In addition, estimation furnishes a quantitativemeasure of the probable error involved in the estimate As a result, the engineer notonly has made the best use of this data, but he has a numerical estimate of the accu-racy of these results
pop-Estimates of two kinds can be made, point estimate and interval estimate
Point estimate uses the sample data to calculate a single best value, which mates a population parameter The point estimate is one number, a point on anumeric axis, calculated from the sample and serving as approximation of theunknown population distribution parameter value from which the sample was tak-
esti-en Such a point estimate alone gives no idea of the error involved in the estimation
If parameter estimates are expressed in ranges then they are called interval estimates
30
Trang 401.3 Statistical Estimation
J Neuman calls these intervals confidence intervals, for as parameter-interval mates, ranges with known confidence level are chosen An interval estimate gives arange of values that can be expected to include the correct value with a certain speci-fied percentage of the time This provides a measure of the error involved in the esti-mate The wider the range of the interval estimate, the poorer the point estimate
esti-1.3.1
Point Estimates
The best point estimate depends upon the criteria by which we judge the estimate.Statistics provides many possible ways to estimate a given population parameter,and several properties of estimates have been defined to help us choose which isbest for our purposes
Sup-on it
We might use the median (60) as the best estimate of future yields, since the ian does not weight the lowest value unduly On the other hand, we were able toobtain 63% yield 4 times out of 11 Perhaps this value (the mode) is the best esti-mate of future operation at carefully controlled conditions Obviously, statistics can-not make the judgments required in this example If we can say that the samplemean (58.4) is the best estimate of the population under certain conditions, provid-ing the data come from a random sample of the population If the 32% yield is aslikely to occur as any other value, then the mean is the best estimate
med-Several properties of estimates help us to determine which estimate is best forour purposes We will consider three here:
31