Notwithstanding, even for theincorrect empirical model, this bias may be nil so long as we are within thebounds of our original data set coded to ±1.. influ-For x >> 1, we need to add man
Trang 1or other nonideal sources with much less desirable statisticalproperties But even poorly designed (or nondesigned) experi-ments usually contain recoverable information On rarer occa-sions, we may not be able to draw firm conclusions, but even this
is preferable to concluding falsehoods unawares
We begin our analysis with plant data With the advent of thedistributed control systems (DCSs), plant data are ubiquitous.However, they almost certainly suffer from maladies that lead to
correlated rather than independent errors Also, bias due to an
improper experimental design or model can lead to nonrandomerrors In such cases, a mechanical application of ANOVA and
statistical tests will mislead; F ratios will be incorrect; coefficients
will be biased Since furnaces behave as integrators, we look
briefly at some features of moving average processes and lag plots for serial correlation, as well as other residuals plots The chapter shows how to orthogonalize certain kinds of data sets using source and target matrices and, more importantly, eigenvalues and eigen- vectors Additionally, we discuss canonical forms for interpreting
multidimensional data and overview a variety of helpful statistics
to flag troubles Such statistics include the coefficient of tion (r2), the adjusted coefficient of determination (r A2), the prediction sum of squares (PRESS) statistic and a derivative, r P2, and variance inflation factors (VIFs) for multicollinear data We also introduce the hat matrix for detecting hidden extrapolation.
determina-In other cases, the phenomena are so complex or theory solacking that we simply cannot formulate a credible theoretical or
Trang 2even semiempirical model In such a case, it is preferable to producesome kind of model For this purpose, we shall use purely empiricalmodels, and we show how to derive them beginning with a Taylorseries approximation to the true but unknown function.
This chapter also examines categorical factors and shows how to analyze designs with restricted randomization such as nested and split-plot designs This requires rules for deriving expected mean squares, and we provide them On occasion, the reader may need
to fit parameters for categorical responses, and we touch on this
subject as well
The last part of the chapter concerns mixture designs for fuel
blends and how to simulate complex fuels with many fewer
com-ponents This requires a brief overview of fuel chemistry, which
we present We conclude by showing how to combine mixtureand factorial designs and fractionate them
Plant data typically exhibit serial correlation, often strongly Serial correlation
indicates errors that correlate with run order rather than the random errors
we subsume in our statistical tests Consider a NOx analyzer attached to amunicipal solid waste (MSW) boiler, for example Suppose it takes 45 min-utes for the MSW to go from trash to ash, after which the ash leaves the
Then the natural burning cycle of the unit is roughly 45 minutes or so If
we pull an independent NOx sample every 4 hours, it is unlikely that therewill be any correlation among the data Except in the case of an obviousmalfunction, the history of the boiler 4 hours earlier will have no measurableeffect on the latest sample However, let us investigate what will happen bymerely increasing the sampling frequency
4.1.1 Problem 1: Events Too Close in Time
DCS units provide a steady stream of continual (and correlated) information.Suppose we analyze NOx with a snapshot every hour Will one reading becorrelated with the next? How about every minute? What about every sec-ond? Surely, if the previous second’s analysis shows high NOx, we wouldexpect the subsequent second to be high as well In other words, data that
are very close in time exhibit positive serial correlation Negative serial correlation
is possible, but rarer in plant environments However, it can occur in theplant when one effect inhibits another Nor is this the only cause of serialcorrelation
boiler (Figure 4.1).
Trang 34.1.2 Problem 2: Lurking Factors
Lurking factors are an important cause of serial correlation For example, O2concentration affects both NOx and CO emissions If we were so nạve as toneglect to measure the O2 level, we could easily induce a serial correlation.For example, air temperature correlates inversely to airflow, and the formerrelates to a diurnal cycle Therefore, we can also expect airflow with fixeddamper positions, e.g., most refinery burners, to also show a diurnal cycle.Every effect must have a cause If we account for all the major sources offixed variation, then the multiple minor and unknown sources should dis-tribute normally according to the central limit theorem and collect in ourerror term Therefore, it behooves us to find every major cause for ourresponse because major fixed effects in the errors can result in correlatedrather than normally distributed errors
4.1.3 Problem 3: Moving Average Processes
If we consider the boiler furnace as an integrator, then flue gas emissions andcomponents comprise a moving average process — and moving averages are
FIGURE 4.1
A municipal solid waste boiler It takes roughly 45 minutes for the trash-to-ash cycle This
particular unit is equipped with ammonia injection to reduce NOx (From Baukal, C.E., Jr., Ed.,
The John Zink Combustion Handbook, CRC Press, Boca Raton, FL, 2001.)
UNDERGRATE COMBUSTION AIR
Trang 4highly and positively correlated To see this, consider a random distribution —
a plot of x k against the next data point in time, x k+1
The first plot shows 100 nearest neighbors from a uniform random bution plotted one against the other The data were generated with theExcel™ function RAND( )-0.5, representing a uniform distribution withzero mean between –0.5 and 0.5 The nearest-neighbor plot shows no corre-
distri-lation to speak of (r2 = 0.009), the mean is essentially zero ( = 0.04), and the
standard deviation is s = 0.28 These are very close to the expected values
for these statistics, and it is not so surprising that random data show notrend when plotted against nearest neighbors
formed a moving average using the 10 nearest neighbors:
where k indexes each point sequentially Note that the correlation of ξ k with
ξk+1 in Figure 4.2b has an r2 of 84.0% despite being drawn from an originally
FIGURE 4.2
uniform random number generator, –0.5 < x < 0.5 The graph plots each data point against its
the same data as 10-point moving averages Plotting the moving average data in the same
fashion gives noticeably less dispersion (s = 0.09 vs 0.28) and high correlation, despite the fact
that the moving averages comprise uniform random data In the same way, integrating processes such as combustion furnaces can have emissions with serially correlated errors.
(a) Nearest Neighbor Plot, Uniform Random Distribution (b) Nearest Neighbor Plot, Moving Average
A moving average with random data Figure 3.7a shows data from 100 points generated by a
vs x ) The correlation is, as expected, nearly zero (r = 0.009) Figure 3.7b shows
(Figure 4.2a)
But Figure 4.2b tells a different story To create the second plot, we
Trang 5per-uniform random population with zero mean Also note that the standard ation of the process has fallen by a factor of three (from 0.28 to 0.094) Thedeflation of the standard deviation by a factor of three is not a coincidence, forthe denominator in the calculation of standard deviation is , or
devi- However, the mean values for both data sets are virtually tical at ~0.0
iden-Since the mean values are unaffected, we may perform regressions andgenerate accurate values for the coefficients However, as the moving average
process deflates s, our F test will errantly lead us to score insignificant effects
as significant ones That is, failure to account for serial correlation in the data
set before analysis will result in inflated F tests An analysis showing many
factors to be statistically significant is a red flag for the deflation of variancefrom whatever cause
4.1.4 Some Diagnostics and Remedies
Here are a few things we can do to warn of serial correlation and remedy it:
1 Always check for serial correlation as revealed by an x k vs x k+1 plotand time-ordered residuals
2 Make sure that the data are sufficiently separate in time and eachrun condition sufficiently long to ensure that the samples are inde-pendent
3 Carefully consider the process, not just the data Since the seriallycorrelated data have both fixed and random components, the prob-
lem becomes assessing which are which One could make an a priori
estimate for a moving average process using a well-stirred model of
sufficiently large
4.1.5 Historical Data and Serial Correlation
For historical data, we do not have the privilege of changing how the datawere collected Therefore, we must do our best to note serial correlation anddeal with it after the fact Once we recognize serial correlation, the problembecomes recovering independent errors from correlated ones and using only
the former in our F tests As we have noted, most serial correlation will
evaporate if we can identify lurking factors or the actual cause for thecorrelation We then put that cause into a fixed effect in the model
If there are cyclical trends, an analysis of batch cycles within the plant maylead to the discovery of a lurking factor Failing this, one may be able to usetime series analysis to extract the actual random error term from the correlated
n−1
10 1− =3
the furnace per the transient mass balance for the boiler in Chapter
2 Using such results, we could adjust the sampling period to be
Trang 6one.1,2 This is not so easy Such models fall into some subset of an gressive-integrated-moving average (ARIMA) model, with the moving aver-age (MA) model being the most likely Time series analysis is a dedicateddiscipline in its own right Often one will have to do supplemental experi-ments to arrive at reasonable estimates and models.
The main subject of this text is semiempirical models, i.e., theoreticallyderived models with some adjustable parameters These are always prefer-able to purely empirical models for a variety of reasons, including a greaterrange of prediction, a closer relation to the underlying physics, and a require-ment for the modeler to think about the system being modeled But in somecases, we know so little about the system that we are at a loss to know how
to begin In such cases, we shall use a purely empirical model
For the time being, let us presume that we have no preferred form for themodel That is, we have sufficient theoretical knowledge to suspect certainfactors, but not their exact relationships to the response For example, sup-
pose we know that oxygen (x1), air preheat temperature (x2), and furnace
temperature (x3) affect NOx We may write the following implicit relation:
(4.1)
where ξ represents the factors in their original metric and φ is the functionalnotation Although we do not know the explicit form of the model, we canuse a Taylor series to approximate the true but unknown model Equation4.2 represents a general Taylor series:
(4.2)
Here ξ refers to the factors, subscripted to distinguish among them We
reference the Taylor series to some coordinate center in factor space (a1, a2, … ,
ap), where each coordinate is subscripted per its associated factor The farther
we move from the coordinate center, the more Taylor series terms we require
to maintain accuracy For Equation 4.1, the Taylor series of Equation 4.2,truncated to second order, gives the following equation:
y= φ ξ ξ ξ( ,1 2, 3)
k a k
k p
1
2!
Trang 7Now if we code the factors to ±1 with the transforms given earlier, the Taylorseries becomes the simpler Maclaurin series, which by definition is centered
For nonlinear models, when n < ∞, the series is no longer exact but
approx-imate In such a case we replace the equality (=) by an approximate equality(≈) We illustrate the use of Equations 4.3 and 4.4 with an example
for Two Factors
Problem statement: Use Equations 4.3 and 4.4 to derive theMaclaurin and Taylor series for , truncated to thirdorder What would the corresponding fitted equation look like?
0 0
22 2 0 2 1
3 1
3 2
k k
1
y= φ( ,x x1 2)
Trang 8Solution: For f = 2 and n = 3, Equation 4.3 becomes
Proceeding step by step, we have the following:
If we were to evaluate the above equation numerically from adata set, we could fit the third-order model
Here, we have grouped the terms in parentheses by overall order
We may derive the Taylor series in the same manner, replacing x k
by ξk – a k and 0 by a k
y
k k
1 0
1
1 0 2 1
1 0
1
1 0 21
2 1 2 00 1
1 2 0
1 2 2 2 2 0 2 22
13
2 0 1 2 2 3
1 2 2 0
x x x x x x x x xx x1 2 x x
2 3 2 3 0 2 3+ ∂
22 2
1 2 0
1 2 2 2 2 0 1 2
1 2
2 0 1 2 2 3
1 2 2 0
2 3 2 3 0 2 3
x x
112 1 2
2 122 1 2
2
222 2 3
111 1 3
112 1 2
2 122 1 2
2
222 2 3
Trang 94.2.1 Model Bias from an Incorrect Model Specification
In the previous section, we constructed a model comprising a finite number
of terms by truncating an infinite Taylor series; therefore, if higher-orderderivatives exist, then they will bias the coefficients We introduced the
explore additional considerations For example, let us suppose that Equation4.5 gives the true model for NOx:
(4.5)
where y is the NOx, A and b are constants, and T is the furnace temperature.
Further, suppose that due to our ignorance or out of convenience or ever, we fit the following (wrong) model:
So long as the series remains infinite, there is a one-to-one correspondencebetween the coefficients and the evaluated derivatives However, once wetruncate the model, this is no longer strictly true: higher-order derivatives
2 3 3 0
0 2 2 2 0 3 3 3 0
12
13
reader to this concept in Chapter 3 beginning with Section 3.4 Here we
Trang 10will bias the lower-order coefficients Yet, near zero, higher-order terms will
vanish more quickly than lower-order ones So, if x is close to zero then the
model has little bias We refer to the error caused by using an incorrect
mathematical expression as model bias.
At x = 1 each term is weighted by its Maclaurin series coefficient As x
grows beyond 1, then the higher-order terms exert larger and larger ence; so mild extrapolation leads quickly to erroneous results This wouldnot be the case if the model were correct Notwithstanding, even for theincorrect empirical model, this bias may be nil so long as we are within thebounds of our original data set (coded to ±1)
influ-For x >> 1, we need to add many additional terms for the empirical model
to adequately approximate the true model As x grows larger and larger, we
need more and more empirical terms This is so, despite the fact that thetrue model comprises only two terms This is why it is much more preferable
to generate a theoretical or semiempirical form rather than a wholly ical one Nonetheless, an empirical model of second order at most (andusually less) is sufficient for interpolation In other words, empirical modelsare very good interpolators and very poor extrapolators This is true for allmodels in the sense that we may never have exactly the right model form,but it is especially so for empirical models
empir-Suppose that we could expand our model to comprise an infinite number
of terms (which would require an infinite data set to fit) Then we couldevaluate the coefficients for Equation 4.7, generating the following normalequations:
(4.8)
Because we centered x, the sum of the odd powers is zero, but the sum of
the even powers is not Since our approximate model comprises only two
terms — a0 and a1 of Equation 4.6 — the higher-order terms will bias them
A careful examination of Equation 4.8 shows that the even terms bias a0and the odd terms bias a1 We are actually fitting an equation something like
0 1 2 3
1 1 3
3 5 5
Trang 11where b k and c k are constants accounting for the contributions of the
higher-order derivatives Again, for 0 < x < 1 the sum of the higher powers will
likely be negligible, and for this reason, empirical models are excellent polators Nonetheless, a good theoretical model would eliminate this biasand would require fewer terms for an adequate fit to the data
inter-4.2.2 Design Bias
We have seen from the previous section that an improper model specification
is a problem if we extrapolate beyond the bounds of the experimental design.The proper model derived from theoretical considerations ameliorates thisproblem We have also seen that a purely empirical model will do a verygood job within the design boundaries even if it is wrong However, evenwith the proper model, an improper experimental design may still bias the
coefficients We refer to errors introduced by a less than ideal X matrix as
design bias Conversely, proper experimental design can eliminate this bias.
Consider a classical one-factor-at-a-time design given in Table 4.1 Here,
x1 is the excess oxygen concentration in the furnace, x2 is the air preheat
temperature (APH) of the combustion air, and x3 is the furnace temperature,measured at the bridgewall of the furnace (BWT) Let us represent this factor
Trang 12This is not a promising start, as STS contains not a single zero value;everything mutually biases everything else Coding will zero some of theoff-diagonal values Using the coding transforms, we have
(4.11a)
(4.11b)
(We show a merely to give the coefficient references.) These coded data are
depicts the classical design
It forms a right-angled tetrahedron in factor space Since it is neither scalednor centered, the edges are not equal lengths, nor does the design center(centroid of the tetrahedron) coincide with the center of the factor space(centroid of the cubic region)
Since it is not centered, the design center is not coincident with the center
of the factor space
centers are now coincident However, the design is still not orthogonalbecause it is not balanced about the coordinate center
scaled, and since it is balanced about the origin, it also gives an orthogonal
matrix Let us represent this factor space by T Then,
a a aa a
2 3
better At least a is unbiased, but a to a still bias one another Figure 4.3a
Figure 4.3b is the same design scaled to 0/1 coordinates, but not centered
Figure 4.3c shows the design in ±1 coordinates The design and coordinate
Figure 4.3d is an example of a fractional factorial design It is centered and
Trang 13If we could transform our design to these coordinates, we would have anorthogonal design In fact, we can.
We have two basic remedies to make designs orthogonal: we can eitherchange the design or morph the factor space Changing the design meansthat before we begin the experiment, we think about what factors are impor-tant and how we can arrange the test matrix to be orthogonal This generates
a balanced design having an equal number of high and low values for eachfactor equidistant from zero in each factor direction, e.g., factorial designs.The advantage of using orthogonal designs is that one can examine inde-pendent factors with clear meaning and perform a number of statistical tests,etc The only “disadvantage” is that it requires up-front thinking RememberWestheimer’s discovery: “a couple of months in the laboratory will save you
a couple of hours at the library.”
FIGURE 4.3
Graphical representation of various experimental designs (a) The classical design in the original
coordinates The coordinate center does not coincide with the center of the design (b) The design coded to 0/1 coordinates This conformally shrinks the factor directions to uniform dimension (c) The design in ±1 coordinates The design and coordinate center are now coin- cident (X) (d) A design that is orthogonal and centered in the new coordinates.
(a) Design in Original Coordinates
(distorted right-angle tetrahedron)
BWT: 1000 – 2000 F
A Design Point (1 of 4)
Trang 144.3.1 Source and Target Matrices: Morphing Factor Space
Suppose we want to convert a source matrix (S) that is nonorthogonal but
full rank and square, such as Matrix 4.10a, into an orthogonal target matrix
(T), such as Matrix 4.12a We could postmultiply by some transformation (F) matrix:
(4.15)
Again, we could just have easily used the original matrix, the above 0/1coding, or the traditional ±1 coding But as this is a classical design, one-factor-at-a-time investigations usually proceed from some origin, which ismore conveniently coded as the coordinate center
Trang 15We would like to transform S in Matrix 4.16a into T of Matrix 4.12a We
will do this with a transformation matrix:
Before the transformation, we have something like in
s1 ·s2·s3 factor space After the transformation, we have
in t1·t2·t3 factor space This latter function is orthogonal in t1, t2, and t3 Inother words, if and SF = T, (where F maps s1, s2, and s3 onto t1,
t2 , and t3) then So, on the one hand, we have gained independentcoefficients On the other hand, we are not sure what they mean In other
words, we are trading a non-orthogonal design in orthogonal s1·s2·s3 factor
space for an orthogonal design in distorted t1·t2·t3 space If the distorted spacehas no physical meaning, we have gained little
We see that after the fact, it may be possible to find combinations of theoriginal factors that represent an orthogonal design However, this is a muchweaker approach than conducting a proper design in the first place, becausethe factor combinations often have no real meaning
On the other hand, sometimes a linear combination of factors does havemeaning and the linear combination may actually be the penultimate factor
y= +b0 b s1 1+b s2 2+b s3 3
y= +a0 a t1 1+a t2 2+a t3 3
y=Ta=Sb SFa=Sb
Trang 16For example, kinetic expressions (those determining the rate of appearance
or disappearance of a species like NOx or CO) are really a function of
collision frequency (Z) But it is not possible to directly observe molecular collisions and hence Z However, Z is related to the temperature (T), pressure (P), and concentration (C) — all increase the collision frequency Suppose,
for the sake of argument, that the actual production rate of an important
species, y = f(ζ), were actually a function of the log of the collision frequency,
ζ = ln(Z), and that Z is given by Equation 4.20:
(4.20)Then
where a0 = ln(b0), x1 = ln(P), , and x3 = ln(C) So for y = φ(ζ), the most parsimonious model would actually be a linear combination of x1, x2, and
x3 In such a case, orthogonal components may be useful to spot such tions in the data However, we do not want to distort the original factors
rela-We seek only to rotate the axes to expose these relations Eigenvectors andeigenvalues can do this for us
4.3.2 Eigenvalues and Eigenvectors
One may use eigenvalues and eigenvectors to decompose a matrix intoorthogonal components, and they are the best alternative for that purposebecause they do not distort the factor space as the source–target method may
do Eigenvalues (ΛΛ) and eigenvectors (K) are defined for a square matrix
(M) of full rank as follows:
λ
1 2'
n
Trang 17Theoretically, eigenvalues are solutions of an nth-order polynomial
(charac-teristic) equation, where n is the number of rows in the starting matrix,
presuming it is nonsingular and square Matrix algebra texts give the cedure.3 However, the mechanics can become unwieldy and dedicated soft-ware is really a must for this procedure Regrettably, Excel does not have astandard function for this, but software such as MathCAD™ does Dedicatedstatistical software is the best option The procedure can be done in a spread-sheet, but it is tedious, as we show now
pro-We may make use of the trace of the matrix to find the eigenvalues The
trace of a matrix is the sum of the diagonal elements We may also define
traces for higher-order square matrices
In the above equation, we are relying on context to obviate any
equivoca-tion for n, (for M n the superscript is an authentic exponent)
Thus, M2 = MM However, for t n and m n
kk , the superscript n is mere clature Once we have t n, the characteristic equation and its solutions follow:
n k k
if
othherwise
Trang 18Example 4.2 The Characteristic Equation
Using the Trace Operator
Problem statement: Given Matrix 4.16b, find the characteristicequation and the eigenvalues
Solution: Matrix 4.16b is a full-rank (nonsingular) matrix having
four rows (n = 4) We solve for t n in the following manner: Let
Trang 19Solving for the coefficients of the characteristic matrix according
to Equation 4.24, we have
(the nth coefficient of the characteristic equation is always 1)
and the characteristic equation is 1 – 7λ + 12λ2 – 7λ3 + λ4 nately, this equation factors as (λ2 – 5λ + 1)(λ- 1)2 = 0 with the solu-tions
Fortu-λ =
Since these are solutions for a single variable, one may also use
numerical procedures such as the goal seek algorithm in Excel to
solve for them Also, the rational roots (if they exist) will always
be factors of the constant In our case, the constant is 1, so wewould try ±1, finding 1 to be a double root, as shown above This
rational roots procedure can often help to factor the equation and
reduce the order of the remainder, simplifying the final solution
Trang 20Analytically, one can always find the solutions for polynomials
up to fourth order using various procedures.*
Each eigenvalue has an associated eigenvector such that
where k is an eigenvector The eigenvectors are not unique in the sense that
any scalar multiple of an eigenvector will itself be an eigenvector To resolvethis problem, we shall reduce the eigenvectors to unit magnitude, i.e.,
For real, symmetric matrices (the only kind we need to consider in thistext), the eigenvectors are always orthogonal That is,
where j and k are any two different vectors in the K matrix.
For the case at hand, Equation 4.25 reduces to
(4.27)
We illustrate the procedure for one of the eigenvalues in the next example
Problem statement: For the Matrix 4.27, we have shown that thecharacteristic equation is (λ2 – 5λ + 1)(λ- 1)2 = 0, having solutions
λ =
Find the eigenvector associated with the eigenvalue =
* Any standard mathematical text will have solutions for up to fourth-order polynomials See,
for example, Gellert, W et al., Eds., The VNR Concise Encyclopedia of Mathematics, American
Edi-tion, Van Nostrand Reinhold Company, New York, 1977, pp 80-101 General equations of fifth order and higher have been proven impossible to solve, though many special equations of arbi-
trary order are solvable; e.g., the triquadratic equation ax6 + bx3 + c = 0 may be reduced to a quadratic equation with the substitution u = x3
kj
j
n
2 0
11
λλ
λλ
0 1 2 3
0000
Trang 21Solution: We can find the eigenvector numerically using a sheet.
spread-Step 1: First, we substitute a selected eigenvalue, e.g., λ = 0.2097:
Step 2: Now we arbitrarily set k3 = 1,
and reduce the matrix by one column and the eigenvector and lution vector by one row, so that the system becomes soluble
so-Step 3: Premultiplying by the inverse of the matrix we have
But we had arbitrarily set k3 = 1, so the full vector is
and this is an eigenvector associated with λ = 0.2097.
0 1 2 3
11
0000
0 79111
⎟⎟
k k k k
0 1 2 3
0 791111
Trang 22Step 4: Normalizing this by root of the sum of squares,
, we obtain the unit eigenvectorassociated with λ = 0.2097:
where the subscript denotes the column of the column vector in the
eigenvectors’ matrix K.
So long as the eigenvectors are distinct, this method will lead to the associatedeigenvectors The major advantage of this method is that spreadsheets can doall the calculations However, if the eigenvectors are not distinct (e.g., multipleroots), we will end up with a problem — two different eigenvectors associatedwith two identically valued eigenvalues We can continue without problem toobtain an eigenvector associated with
But we run into trouble almost immediately, solving for the eigenvectorsassociated with the double root, λ = {1, 1}, generating the matrix
It reduces to the following equations: 3k0 + k1 + k2 = –1 and k0 = 0 Substituting
one into the other, we obtain k1 + k2 = –1, from which we may evaluate theremaining two eigenvectors
Trang 23Here a and b are undetermined coefficients We use a and b because the
remaining two eigenvectors cannot be the same; eigenvectors for real metric matrices are always mutually orthogonal Note that we have not yet
sym-normalized the first two vectors in K to unit magnitude, so for now we label the eigenvector matrix as K′′′′ rather than K Now if the first two column vectors in K′′′′ (let us call them k0 and k1) are mutually orthogonal, then k0Tk1
= 0, giving ab + (a + 1)(b + 1) + 1 = 0, which reduces to
Arbitrarily choosing b = 1 gives a = –1 Substituting into the matrix gives
Normalizing the first two vectors to unit magnitude gives
MathCAD gives the following solution, which the reader may verify isequally correct, yielding the relations given in Equations 4.21, 4.22, and 4.25.(Multiple roots do not have unique associated eigenvectors.)
(4.28)
At any rate, once we obtain the eigenvalues and eigenvectors, we can move
on to making real symmetric matrices orthogonal Least squares solutionsalways generate real symmetric matrices; thus, they are amenable to thistreatment
Recall that for real symmetric matrices, eigenvectors are orthogonal in thestrictest sense And so it follows for our example that
Trang 244.3.3 Using Eigenvectors to Make Matrices Orthogonal
Premultiplying Equation 4.21 by KT gives
KTKΛΛ ≡ Λ ≡ KTMK (4.29)
Given y = Xa, we seek another system of factors giving linear combinations
of X such that y = Ub, and also where UTU = D, a diagonal matrix Here is
Trang 25Step 3: The above equations complete the transformation: substitution
of these relations into Equation 4.30 gives Ub = XKKTa = Xa because
KKT = I; therefore, y = Ub represents an alternate system of factors and coefficients for y = Xa.
To see this, consider the necessary properties of UTU Premultiplying y =
Ub by UT gives UTy = UTUb Substituting in terms of X gives UTU =
(xK)TXK = KTXTXK But M = XTX, and in light of Equation 4.29, this
substi-tution gives UTU = KTMK = ΛΛ Therefore, U is a diagonal matrix — the eigenvector matrix of XTX to be exact Collecting these equations:
u4 = 0.910 + 0.240 (x1 + x2 + x3)
Thus, the u k represent linear combinations of x k, and either system will give
identical values for y: y = a0 + a1x1 + a2x2 + a3x3 = b1u1 + b2u2 + b3u3 + b4u4.The advantage of the eigenvalue procedure over the source–target matrixprocedure is that:
1 We have merely rotated axes, not distorted factors
2 Both the original and new coordinate axes are orthogonal
3 We may apply the procedure to any nonsingular matrix, even when
X is nonsquare, because M = XTX will always be square
Trang 264 The new system, y = Ub, is orthogonal Therefore, we estimate b
without design bias
5 If the linear combinations of factors have meaning, they may sent a more parsimonious model and help to identify an importantunderlying relationship
repre-4.3.4 Canonical Forms
Consider a second-order surface given by the general second-order equation
in summation form for f factors.
(4.38)
In this formulation, the inequality in the index of the second-order terms
(j ≤ k ) includes the pure quadratics Now this surface may take severalpossible forms, depending on the coefficients However, by rotating and/ortranslating axes, we can always simplify the equation to either of two forms:
A canonical form (4.39a)
B canonical form (4.40a)
Box and Draper4 call the first the A canonical form and the second the B canonical form The A canonical form represents a rotation of axes The B
canonical form represents both a rotation and a translation to a new designcenter If we are far from the design center, the A canonical form will bemore useful If we are close to the design center, we shall prefer the Bcanonical form
4.3.4.1 Derivation of A Canonical Form
We may rewrite Equation 4.38 in matrix form as
2
y kk k u k
Trang 27a matrix containing all the terms that are second-order overall That is,
Now all second-order coefficients vanish except the pure quadratics
4.3.4.2 Derivation of B Canonical Form
We may also write Equation 4.41 as
2
22
x f
1 2
1 2
$
θθθ
λλλ
λ
11 22
1 2
sym
u u
Trang 28where xT = (1 x1 x2 … x f) and
Then, letting uT = xTK , where K is the eigenvector matrix of B, and Λ =
KTAK, we have
y = a0 + uTΛu B canonical form (4.40b)That is,
Equation 4.40b corresponds to a rotation of axes and a translation to a newcoordinate center By setting first derivatives to zero, we see that Λ represents
a set of coordinates in u1·u2···un space that correspond to the extremum(maximum, minimum, or saddle point, collectively referred to as the station-ary point) If the stationary point is close to our design center, we will preferthe A canonical form If not, we shall prefer the B canonical form If we canassign a clear meaning to the linear combinations of the factors derived fromthe canonical analysis, we may well prefer to keep our regression in thecanonical space rather than the original space
4.3.4.3 Canonical Form and Function Shape
One may use these canonical forms to simplify second-order equations ofthe type given in Equation 4.41 by either rotating (A form) or rotating andtranslating (B form) the axes By examining the second-order coefficients ofeither form, one may determine what kind of surface one is dealing with Ifall λkk are positive or negative, then one is dealing with a minimum ormaximum surface, respectively If they are of differing signs, then the surfacehas a min-max or saddle shape If some of the λkk are close to zero and theassociated θk are positive, then the surface is a rising ridge If the associated
222
1
1 2
11 22
1 2
$
'
%
λλ
Trang 29Example 4.4 Canonical Forms
Problem statement: Consider the second-order equation
y = 1.029 + 0.326 x1 – 0.085 x2 + 0.282 x1 – 0.031 x1x2 – 0.127 x2
(a) Explicitly declare the equation according to the forms of tion 4.39a Use an eigenvector analysis to reduce it to the A canon-ical form and give the explicit equations What can you tell aboutthe surfaces by inspection of the coefficients? (b) Repeat the anal-ysis using the B canonical form beginning with Equation 4.40a
Equa-Solution: a) From the problem statement we find
FIGURE 4.4
Various response surfaces Values of λ and θ allow the investigator to quickly determine the shape of the response surface in any number of dimensions When all λ s are of the same sign, then the response surface is a maximum or minimum; when the signs differ, the response surface is a saddle (min-max or col) shape When λ is close to zero, then the design is first order
in that response factor.
Trang 30According to Equation 4.39a, we have
or, explicitly
Making use of y = a0 + (xTK )(KTa ) + (xTK )(KTAK )(KTx), or
equiv-alently, y = a0 + uTθθθθ + uTΛu , we find K = eigenvectors (A) This
0 1 2 11 12 22
a a
= +( )⎛⎝⎜ ⎞⎠⎟+( )
⎛
1 2
11 12
12 222
1 2
x x
1 2
Trang 31
So, the A canonical form becomes
The reader may verify that this equation gives exactly the same
values as our starting equation The u vector represents a rotation
of axes that zeroes out the non-diagonal elements for Λ By ining the coefficients of Λ, we see that they are of opposite sign,
exam-indicating a saddle shape oriented along the axes of u1 and u2
The saddle is steeper in the u1 direction as indicated by a
coeffi-cient about double that of u2
b) Likewise, from Equation 4.40a, we may write our equation as
u u
x x
−1
1 029 0 163 0 043
0 163 0 282 0 0160
1 2 3
Trang 32Then we may write y = (xTK )(KTBK )(KTx), or equivalently, y =
uTΛu That is, explicitly
We may derive other statistics related to the ANOVA Some of the more
important ones are the coefficient of determination (r2), the adjusted coefficient
of determination (rA2), the prediction sum of squares (PRESS) statistic, and a coefficient we shall call the coefficient of prediction (r P2) We can also define
the variance inflation factor (VIF) and leverage statistics We shall speak to each
in turn
4.4.1 The Coefficient of Determination, r2
most misused) statistic for goodness of fit: the coefficient of determination, r2:
The statistic r2 gives the fraction of the total variance accounted for by the
model An r2 of 0.9 (or 90%) means that the model accounts for 90% of thetotal variation of the data For the purposes of combustion modeling, we
desire r2 > 0.8, according to the following scale:
• 0.9 < r2 < 1.0 — very strong correlation
• 0.8 < r2 < 0.9 — strong correlation
• 0.7 < r2 < 0.8 — good correlation
• 0.6 < r2 < 0.7 — fair correlation
• 0.5 < r2 < 0.6 — weak correlation
If r2 = 1, then the model accounts for all of the variation and fits the data
perfectly A related measure is r, the coefficient of correlation, Since
0 < r2 < 1, then
u u
SSRSST
r= r2
r ≥r2.Chapter 3 (Equation 3.43) defined an important ratio — the best known (and
Trang 334.4.2 Overfit
Now r or r2 will always increase as the number of adjustable parameters inthe model increases Continuing to add adjustable model parameters even-
tually results in a condition known as overfit Overfit is the unjustified
addi-tion of model parameters resulting in the fitting of random error to falsefactor effects This is a statistical no-no, because random effects cannot berelated to nonrandom variables An example will help make this clear
Suppose we have the hypothetical data given by Table 4.2 Most sheets and calculators have random number (uniform distribution) genera-tors The Excel command RAND( )simulates a uniform random distributionbetween 0 and 1 Therefore, the underlying model to these data is Suppose we fit the following models to the first four data points:
spread-Model 0:
Model 1:
Model 2:
Model 3:
All of these models except model 0 are nonsense Notwithstanding, here
are the least squares solutions and the associated r2 values:
Trang 34If we were to judge based on r2 only, we would prefer model 3 Clearly,
we have gotten too carried away and fit random behavior as if it were
determined by the factor x Our job, using least squares, is to find the
underlying model Here the underlying model is But by continuing
in a least squares frenzy, we have gone far beyond finding the true model
and have force-fit random behavior as a function of x Random behavior is
not a function of any known factor (else it would not be random)
Now the expected value of the fifth observation is 0.5 Model 0 comes theclosest to predicting the true value; model 0:0.57 The rest of the models areway off — model 1: 0.11, model 2: –1.07, and model 3: –0.66 With each
adjustable parameter to the model the r2 has gone up, but the predictivepower is lower than for model 0
4.4.3 Parsing Data into Model and Validation Sets
If we fit random data to a factor, we overfit the model and generate nonsense
This is a great danger in judging models with only r2 One way to catch thisbehavior (presuming we have enough data points) is to do the following:
FIGURE 4.5
On overfitting data The data were generated by the Excel function RAND( ) , which gives a
uniform distribution between 0 and 1 Therefore, the true model is y = 0.5 + e; i.e., model 0 is
the only sensible model Model 1 is a first-order fit to the first four data points (diamonds); likewise, model 2 is a second-order fit and model 3 a third-order fit Data point 5 (square) is next in the random sequence All models but model 0 are biased and give absurd results for point 5; they represent overfit — the ascription of random data to nonrandom effects Despite
this, all models have higher r2 than model 0 Therefore, r2 is not a useful statistic for revealing
overfit In this case, the model with the lowest r2 is actually the best.
Random Value Sequence #
Random Data, curve fit
Random Data, not curve fit
Predicted from curve fit
Deviation from actual
y = a0 + a1x ˆ
y = a0ˆ
y=0 5
Trang 351 Parse the data set into two sets, a model set (comprising about twothirds of the points) and a validation set (comprising one third ofthe points).
2 Randomly assign the points to the data sets
3 Fit coefficients to the model set
4 Gauge r2 using the validation set
Because a different set validates the data, the lower limit for r2 is no longer
zero and the difference between r2 coefficients for the whole and parsed datasets is an indicator for overfit We also hope for stable model coefficients.However, other data maladies (e.g., influential and errant response values)
can cause large differences in r2 In other words, this is a severe test forregressed data Nonetheless, if the data pass this test, one can be reasonablysure that the regression is trustworthy
For illustration purposes, and as we have already begun with an 80/20split of the data, let us extend this analysis for all five possible model/validation sets We shall use four points in the model set and one in the
validation set Table 4.3 shows the results; the statistic r k2 is the r2 statistic
based on the kth point for validation
The largest entries are bolded From examining the table, one sees that infour out of five instances, no model is superior to model 0 Therefore, themajority report from this analysis is that model 0 is the best model As well,
we could have parsed the data into a 60/40 model/validation split This
would have generated 10 r jk2 statistics: r122, r132, r142, r152, r232, r242, r252, r342, r352,
and r452 and resulted in the same conclusion (In general, there are 2n possible
r2 statistics for all possible parsings into two groups This means that a dataset comprising only 20 values could prepare and analyze over a million suchstatistics Obviously, this is would be overkill.)
4.4.4 The Adjusted Coefficient of Determination, r A 2
The foregoing discussion prods us to search for better goodness-of-fit tics Practitioners have developed several statistics to gauge more fairly the
Trang 36trade-off between increased fit and reduced degrees of freedom and to alertthe investigator to overfit One such statistic is the adjusted coefficient of
determination, r A2:
(4.44)
Thus, r2 makes use of the sum of squares while r A2 makes use of the meansquares This compensates partially for reducing the degrees of freedom as
we add more terms to the model However, r A2 still overstates the case,
though not as badly as r2
4.4.5 The PRESS Statistic
A better statistic for gauging overfit is the PRESS statistic When we overfitdata, we increase goodness of fit by correlating random error What we reallyhave is noise, but we add a coefficient to the model and treat it as if it were
information In other words, overfit adds bias by equivocating noise with mation This bias may have great influence, especially at the outskirts of the
infor-model, or beyond What we would really like to know is what the residualwould be if we were to encounter points that were not in our original dataset We have already described one method for detecting this — splitting thedata set into model and validation portions Though a milder test, the PRESSstatistic requires less effort and gives good results Many statistical programsperform it Effectively, we do the following:
1 Delete the first point from the model and calculate the variancebetween the deleted point and the model prediction
2 Do this for all n points.
3 Cumulate the variance
We shall call this the sum of squares residual, predicted, or SSRp, but thecommon term in the statistical literature is the PRESS statistic PRESS is acomposite acronym and stands for prediction sum of squares
(4.45)
In Equation 4.45, is the predicted value for y k but regressed from the
data with the kth response deleted Likewise, we may define the sum ofsquares model, predicted, SSMp as
n p A
SSRSST
Trang 37For a well-behaved model, we would expect that SSRp ≈ SSR and SSMp ≈ SSM.
Calculating n – 1 regressions does not appear to be a less tedious procedure
than parsing the data set However, we can make use of an identity tocalculate the PRESS statistic in a single regression run using the hat matrix
4.4.6 The Hat Matrix
H = X(XTX)–1XTThe matrix has some remarkable properties:
• It is square symmetric (and singular, except for saturated designs
where it becomes an n-row identity matrix).
• It has as many rows and columns as there are y values in the data set — that is, n rows and n columns.
• It codes for (y hat) in terms of linear combinations of y (Equation
1.80):
• It is idempotent
• Since it is symmetrical, H = HTH = HHT = H2 = Hn This last point
is so because if H2 = H, then we may also write H3 = H(H2) = HH = H.
• The diagonal elements of H are always between 0 and 1:
(4.48)
where h k,k are the diagonal elements of H and k indexes them.
• The sum of the diagonal elements of H is equal to the number of
parameters, p, in the model:
Trang 38One may use the diagonal hat matrix elements to transform the normalresidual to the deleted residual used in Equation 4.45:
(4.51)
This last property is the one that allows for direct calculation of SSRp in a
single regression, thus avoiding n regressions:
(4.52)
We can use the PRESS statistic to derive a modified coefficient of nation that we shall call the coefficient of determination, predicted
determi-4.4.7 The Coefficient of Determination, Predicted, r p
We may use SSRp to estimate a goodness of fit for the predicted values We
define the coefficient of determination, predicted (r p) as
overfit, the lower r p
For example, let us compare r2, r A2, and r p for the hypothetical data ofshows them
We notice several things:
• First, we note that r2 continually increases as we add more eters to the model This will always be the case for any data set This
param-is why we cannot look at r2 alone to decide on an appropriate model
• The cubic model (model 3) is particularly good at capturing the
variation with an r2 of 92.5% However, the PRESS statistic alerts us
to a problem
h j k h k
n
k j k
k k k
2 1
rp2= −1 SSRp = p
SST
SSMSST
r p2≤r2
Table 4.2 This is best done with dedicated statistical software Table 4.4
Trang 39• Comparing PRESS (SSE p) to SSE shows that in all cases, the PRESSstatistic exceeds the SSE This is typical What is more important isthat the magnitude of this difference is quite large in some cases andincreases with model order This should alert us to a potential prob-lem regarding the predictive ability of our model.
• Examining r A2 shows that models 1 and 2 are not preferred over
model 0 However, r A2 for model 3 is quite respectable at 70.1% It
is less than r2 of 92.5%, but not sufficiently Were we to choose a
model based on r A2, it would be model 3
• In contrast, the r p statistic clearly shows the worsening conditionwith increasing model order Were we to gauge the predictive ability
of the model using this statistic, we would conclude that no model
is better than model 0
Now, suppose the data were not random and that model 3 was actually a
valid model Our statistics would be no different, r p would still be verynegative, and this would lead us to question the validity of model 3 Butsuch skepticism would be appropriate If we were actually trying to fit amodel with four parameters (model 3), it would be best to have more thanfive data points Five data points leave only one degree of freedom to assess
random error The r p statistic would prod us to gather more data and let usknow that deletion of points makes a big change in the model form As such,
we could say that r p is a measure of robustness of fit Although no singlestatistic is a panacea, by looking at a variety of statistics we get a good idea
of a model’s fitness for purpose So then, let us continue to build our arsenal
of appropriate statistics to help us understand how well behaved our modeland data are
Trang 40even these have empirical parameters In all cases, we would prefer to avoidextrapolation, or at least know when it is occurring.
Extrapolation beyond individual factor ranges is easy to detect For
exam-ple, if –1 < x < 1, then x = 2 is clearly an extrapolation; if the new data point
lies outside any of the individual factor boundaries, we can be sure that it
is an extrapolation no matter what the shape of the data cloud in factorspace Now refer to Table 4.5
Before reading on, hazard a guess Does the last data point comprise anextrapolation compared to the previous ones?
The data point is clearly within the range of each individual factor
How-defined by both factors considered simultaneously
Hidden extrapolation refers to data that are outside the joint region of factors
but not outside the range of any individual factor For two factors, one may
jection However, even this may not detect all possible extrapolations Ideally,
we prefer a statistic that is easy to calculate, generates a single score, andreliably flags all extrapolation, hidden or otherwise A variant of the hatmatrix will do the trick
It turns out that H defines an ellipsoid or hyperellipsoid in p-dimensional
factor space that bounds the cloud of data If our new data point lies outside
this ellipsoid, then we have an extrapolation The diagonal elements of H
measure the distance between the kth value of X and the mean value of all
the X values We can think of them as the distances in p-factor space from
the point to the center of the data cloud We know from Equation 4.49 that
the diagonal elements of H sum to p, so the mean value of the diagonal elements of H must be p/n Thus, if we have a new value and want to test
for it being an outlier, it seems sensible to calculate xT(XTX)–1x and compare
it to p/n If xT(XTX)–1x≈ p/n, then the point represents an interpolation — it
ever, Figure 4.6 shows that the data point actually lies outside the joint region
easily detect hidden extrapolation by plotting data as we have done in Figure4.6 For three or more factors, one may plot every possible two-factor pro-