Problem setup The segmentation model is based on a linear regression with piecewise constant parameters, Here O i is the &dimensional parameter vector in segment i, pt is the regresso
Trang 1Change detection based on
filter banks
7.1 Basics 231
7.2 Problem setup 233
7.2.1 The changing regression model 233
7.2.2 Notation 234
7.3 Statistical criteria 234
7.3.1 The MML estimator 234
7.3.2 The a posteriori probabilities 236
7.3.3 On the choice of priors 237
7.4 Information based criteria 240
7.4.1 The MGL estimator 240
7.4.2 MGL with penalty term 241
7.4.3 Relation to MML 242
7.5 On-line local search for optimum 242
7.5.1 Local tree search 243
7.5.2 Design parameters 245
7.6 Off-line global search for optimum 245
7.7 Applications 246
7.7.1 Storing EKG signals 247
7.7.2 Speech segmentation 248
7.7.3 Segmentation of a car’s driven path 249
7.A Two inequalities for likelihoods 252
7 A l The first inequality 252
7.A.2 The second inequality 254
7.A.3 The exact pruning algorithm 255
7.B The posterior probabilities of a jump sequence 256
7.B.1 Main theorems 256
7.1 Basics
Let us start with considering change detection in linear regressions as an off- line problem which will be referred to as segmentation The goal is to find a
Adaptive Filtering and Change Detection
Fredrik Gustafsson Copyright © 2000 John Wiley & Sons, Ltd ISBNs: 0-471-49287-6 (Hardback); 0-470-84161-3 (Electronic)
Trang 2sequence of time indices kn = ( k l , k2, , k n ) , where both the number n and the
locations ki are unknown, such that a linear regression model with piecewise
constant parameters,
is a good description of the observed signal yt In this chapter, the measure-
ments may be vector valued, and the nominal covariance matrix of the noise
is Rt, and X ( i ) is a possibly unknown scaling, which is piecewise constant One way to guarantee that the best possible solution is found, is to con- sider all possible segmentations kn, estimate one linear regression model in each segment, and then choose the particular kn that minimizes an optimality
criteria,
h
n>l,O<kl< <k,=N
The procedure and, as it turns out, sufficient statistics as are defined in (7.6)-
(7.8), are shown below:
What is needed from each data segment is the sufficient statistics V (sum
of squared residuals), D (- log det of the covariance matrix) and number of data N in each segment, as defined in equations (7.6), (7.7) and (7.8) The
segmentation kn has n - 1 degrees of freedom
Two types of optimality criteria have been proposed:
0 Statistical criterion: the maximum likelihood or maximum a posteriori
estimate of kn is studied
0 Information based criterion: the information of data in each segment is
V ( i ) (the sum of squared residuals), and the total information is the sum
of these Since the total information is minimized for the degenerated solution kn = 1,2,3, , N , giving V ( i ) = 0, a penalty term is needed
Similar problems have been studied in the context of model structure selection, and from this literature Akaike's AIC and BIC criteria have been proposed for segmentation
The real challenge in segmentation is to cope with the curse of dimensionality
The number of segmentations kn is 2N (there can be either a change or no
change at each time instant) Here, several strategies have been proposed:
Trang 37.2 Problem setuD 233
0 Numerical searches based on dynamic programming or MCMC tech- niques
0 Recursive local search schemes
The main part of this chapter is devoted to the second approach, which pro- vides a solution to adaptive filtering, which is an on-line problem
7.2 Problem setup
The segmentation model is based on a linear regression with piecewise constant parameters,
Here O ( i ) is the &dimensional parameter vector in segment i, pt is the regressor
and ki denotes the change times The measurement vector is assumed t o have
dimension p The noise et in (7.2) is assumed to be Gaussian with variance
X(i)Rt, where X ( i ) is a possibly segment dependent scaling of the noise We will assume Rt to be known and the scaling as a possibly unknown parameter The problem is now to estimate the number of segments n and the sequence
of change times, denoted kn = ( k l , k2, , kn) Note that both the number n
and positions of change times ki are considered unknown
Two important special cases of (7.2) are a changing mean model where
cpt = 1 and an auto-regression, where pt = (-yt-1, , -yt-d) T
For the analysis in Section 7.A, and for defining the prior on each seg- mentation, the following equivalent state space model turns out be be more convenient:
Here St is a binary variable, which equals one when the parameter vector changes and is zero otherwise, and ut is a sequence of unknown parameter vectors Putting St = 0 into (7.3) gives a standard regression model with con- stant parameters, but when St = 1 it is assigned a completely new parameter
vector ut taken at random Thus, models (7.3) and (7.2) are equivalent For
convenience, it is assumed that ko = 0 and 60 = 1, so the first segment be- gins at time 1 The segmentation problem can be formulated as estimating the number of jumps n and the jump instants kn, or alternatively the jump
parameter sequence S N = (61, ,S,)
Trang 4The models (7.3) and (7.2) will be referred to as changing regressions,
because they change between different regression models The most impor- tant feature with the changing regression model is that the jumps divide the measurements into a number of independent segments This follows, since the
parameter vectors in the different segments are independent; they are two
different samples of the stochastic process {ut}
A related model studied in Andersson (1985) is a jumping regression model The difference to the approach herein, is that the changes are added to the paremeter vector In (7.3), this would mean that the parameter variation
model is &+l = Ot + &ut We lose the property of independent segments The
optimal algorithms proposed here are then only sub-optimal
7.2.2 Notation
Given a segmentation k n , it will be useful to introduce compact notation Y ( i )
for the measurements in the ith segment, that is Y k i - l + l , ,yki = yk;-l+l
The least squares estimate and its covariance matrix for the ith segment are denoted:
Trang 57.3 Statistical criteria 235
y N given all parameters is denoted p ( y N I k n , P , An) We will assume indepen- dent Gaussian noise distributions, so p ( e N ) = nt&-l+l (27rA(i))-P/2- (det exp(-e?RF1et/(2X(i))) Then, we have
- 2logP(Y Ik N n,on, An)
= N p log(27r) + c log det Rt + c N ( i ) log(A(i)p)
Here and in the sequel, p is the dimension of the measurement vector yt
There are two ways of eliminating the nuisance parameters P , An, leading t o the marginalixed and generalized likelihoods, respectively The latter is the
standard approach where the nuisance parameters are removed by minimiza- tion of (7.9) A relation between these is given in Section 7.4 See Wald (1947) for a discussion on generalized and marginalized (or weighted) likelihoods
We next investigate the use of the marginalized likelihood, where (7.9) is integrated with respect to a prior distribution of the nuisance parameters The
likelihood given only kn is then given by
p ( y N I k n , P , Xn)p(enIXn)p(Xn)dendAn (7.10) n,X"
In this expression, the prior for 19, p ( P I A n ) , is technically a function of the noise variance scaling X, but is usually chosen as an independent function The maximum likelihood estimator is given by maximization of p ( y N I k n ) Finally,
the a posteriori probabilities can be computed from Bayes' law,
The prior p ( k n ) = p(knln)p(n) or, equivalently, p ( S N ) on the segmentation
is a user's choice (in fact the only one) A natural and powerful possibility
is to use p ( J N ) and assume a fixed probability q of jump at each new time instant That is, consider the jump sequence SN as independent Bernoulli variables 6, E Be(q), which means
0 with probability 1 - q
J t = { 1 with probability Q
Trang 6It might be useful in some applications to tune the jump probability q above,
because it controls the number of jumps estimated Since there is a one-to-one
correspondence between kn and b N , both priors are given by
p ( k " ) = p ( P ) = - 4 ) N - n (7.12)
A q less than 0.5 penalizes a large number of segments A non-informative prior p ( k n ) = 0.5N is obtained with q = 0.5 In this case, the MAP estimator equals the Muzirnurn Likelihood ( M L ) estimator, which follows from (7.11)
7.3.2 The a posteriori probabilities
In Appendix 7.B, the U posteriori probabilities are derived in three theorems for the three different cases of treating the measurement covariance: com- pletely known, known except for a constant scaling and finally known with
an unknown changing scaling The case of completely unknown covariance matrix is not solved in the literature These are generalizations and exten- sions of results for a changing mean models (cpt = 1) presented in Chapter 3;
see also Smith (1975) and Lee and Hefhinian (1978) Appendix 7.B also con- tains a discussion and motivation of the particular prior distributions used in marginalization The different steps in the MAP estimator can be summarized
as follows; see also (7.16)
Filter bank segmentation
0 Examine every possible segmentation, parameterized in the number of jumps n and jump times kn, separately
0 For each segmentation, compute the best models in each
segment parameterized in the least squares estimates 8 ( i )
and their covariance matrices P ( i )
0 Compute the sum of squared prediction errors V ( i ) and
D ( i ) = - log det P ( i ) in each segment
0 The MAP estimate of the model structure for the three
different assumptions on noise scaling (known X ( i ) = X,,
unknown but constant X ( i ) = X and finally unknown and
changing X ( i ) ) is given in equations (7.13), (7.14) and (7.15),
respectively,
Trang 7Data Y1, YZ, , Ykl Ykl+l, 0 7 Y k z - Yk,_l+l, 0 7 Y k ,
Segmentation Segment 1 Segment 2 Segment n (7.16)
LS estimates @I), ~ ( 1 ) 8 ( 2 ) , ~ ( 2 ) W , P ( n )
Statistics V(1L W V P ) , W ) * * * V ( n > , W4
The required steps in computing the MAP estimated segmentation are as follows First, every possible segmentation of the data is examined separately For each segmentation, one model for every segment is estimated and the test statistics are computed Finally, one of equations (7.13)-(7.15) is evaluated
In all cases, constants in the a posteriori probabilities are omitted The
difference in the three approaches is thus basically only how to treat the sum
of squared prediction errors A prior probability q causes a penalty term
increasing linearly in n for q < 0.5 As noted before, q = 0.5 corresponds t o
ML estimation
The derivations of (7.13) to (7.15) are valid only if all terms are well- defined The condition is that P ( i ) has full rank for all i , and that the de- nominator under V(i) is positive That is, N p - n d - 4 > 0 in (7.14) and
N ( i ) p - d - 4 > 0 in (7.15) The segments must therefore be forced to be long enough
The Gaussian assumption on the noise is a standard one, partly because it
gives analytical expressions and partly because it has proven to work well in
Trang 8practice Other alternatives are rarely seen The Laplacian distribution is shown in Wu and Fitzgerald (1995) to also give an analytical solution in the case of unknown mean models It was there found that it is less sensitive to large measurement errors
The standard approach used here for marginalization is to consider both Gaussian and non-informative prior in parallel We often give priority to a non-informative prior on 8, using a flat density function, in our aim to have as few non-intuitive design parameters as possible That is, p ( P I X n ) = C is an arbitrary constant in (7.10) The use of non-informative priors, and especially improper ones, is sometimes criticized See Aitken (1991) for an interesting discussion Specifically, here the flat prior introduces an arbitrary term n log C
in the log likelihood The idea of using a flat prior, or non-informative prior,
in marginalization is perhaps best explained by an example
Example 7.7 Marginalized likelihood for variance estimation
Suppose we have t observations from a Gaussian distribution; yt E N(p, X) Thus the likelihood p(ytl,u, X) is Gaussian We want to compute the likelihood conditioned on just X using marginalization: p(ytIX) = Jp(ytlp,X)p(p)dp
Two alternatives of priors are a Gaussian, p E N(p0, PO), and a flat prior,
p ( p ) = C In both cases, we end up with an inverse Wishart density function
(3.54) with maximas
where jj is the sample average Note the scaling factor l / ( t - l), which makes the estimate unbiased The joint likelihood estimate of both mean and variance gives a variance estimator scaling factor l/t The prior thus induces a bias in the estimate
Thus, a flat prior eliminates the bias induced by the prior We remark that the likelihood interpreted as a conditional density function is proper, and
it does not depend upon the constant C
The use of a flat prior can be motivated as follows:
0 The data dependent terms in the log likelihood increase like log N That
is, whatever the choice of C, the prior dependent term will be insignifi- cant for a large amount of data
Trang 97.3 Statistical criteria 239
0 The choice C M 1 can be shown to give approximately the same like- lihood as a proper informative Gaussian prior would give if the true parameters were known and used in the prior See Gustafsson (1996), where an example is given
More precisely, with the prior N(I90, PO), where 130 is the true value of O ( i ) the
constant should be chosen as C = det PO The uncertainty about 190 reflected
in PO should be much larger than the data information in P ( i ) if one wants the data to speak for themselves Still, the choice of PO is ambiguous The larger value, the higher is the penalty on a large number of segments This is
priate Since the true value 80 is not known, this discussion seems t o validate
the use of a flat prior with the choice C = 1, which has also been confirmed t o work well by simulations An unknown noise variance is assigned a flat prior
as well with the same pragmatic motivation
Example 7.2 Lindley's paradox
Consider the hypothesis test
H0 :y E N ( 0 , l ) H1 :Y E N ( 4 11,
and assume that the prior on I9 is N(I30,Po) Equation (5.98) gives for scalar
measurements that
Here we have N = 1, P1 = (P;' + l)-' + 1 and 81 = P l y + y Then the likelihood ratio is
Trang 10since the whole expression behaves like 1 / a This fact is not influenced by the number of data or what the true mean is, or what 130 is That is, the more non-informative the prior, the more H0 is favored!
7.4 Information based criteria
The information based approach of this section can be called a penalized Max-
i m u m Generalized Likelihood (MGL) approach
It is straightforward to show that the minimum of (7.9) with respect to P ,
assuming a known X(i), is
and finally, for a changing noise scaling
MGL(kn) = min -210gp(yNIlcn, P , An)
Trang 117.4 Information based criteria 241
In summary, the counterparts to the MML estimates (7.13)-(7.15) are given
model complexity and data fit
An attempt to satisfy the parsimonious principle is to add a penalty term t o
the generalized likelihoods (7.17)-(7.19) A general form of suggested penalty terms is n ( d + l ) y ( N ) , which is proportional to the number of parameters used
to describe the signal (here the change time itself is counted as one parame- ter) Penalty terms occuring in model order selection problems can be used in this application as well, like Akaike's AIC (Akaike, 1969) or the equivalent cri- teria: Akaike's BIC (Akaike, 1977), Rissanen's Minimum Description Length ( M D L ) approach (Rissanen, 1989) and Schwartz criterion (Schwartz, 1978)
The penalty term in AIC is 2n(d + 1) and in BIC n ( d + 1) log N
AIC is proposed in Kitagawa and Akaike (1978) for auto-regressive models with a changing noise variance (one more parameter per segment), leading t o
(7.23)
Trang 12and BIC is suggested in Yao (1988) for a changing mean model (cpt = 1) and unknown constant noise variance:
is not commented upon in Kitagawa and Akaike (1978)
The MDL theory provides a nice interpretation of the segmentation prob- lem: choose the segments such that the fewest possible data bits are used t o describe the signal up to a certain accuracy, given that both the parameter vectors and the prediction errors are stored with finite accuracy
Both AIC and BIC are based on an assumption on a large number of data, and its use in segmentation where each segment could be quite short
is questioned in Kitagawa and Akaike (1978) Simulations in Djuric (1994) indicate that AIC and BIC tend to over-segment data in a simple example where marginalized ML works fine
A comparison of the generalized likelihoods (7.17)-(7.19) with the marginal- ized likelihoods (7.13)-(7.15) (assuming q = 1/2), shows that the penalty term introduced by marginalization is & ' D ( i ) in all cases It is therefore interesting to study this term in more detail
weakly consistent estimate of the number of the change times in segmentation
of changing mean models The asymptotic link with BIC supports the use of marginalized likelihoods
7.5 On-line local search for optimum
Computing the exact likelihood or information based estimate is computation- ally intractable because of the exponential complexity This section reviews
Trang 137.5 On-line local search for oDtimum 243
Figure 7.1 The tree of jump sequences A path marked 0 corresponds to no jump, while 1
in the S-parameterization of the jump sequence corresponds to a jump
local search techniques, while the next section comments on numerical meth- ods
7.5.1 local tree search
In Section 7.A, an exact pruning possibility having quadratic in time com- plexity is described Here a natural recursive (linear in time) approximate algorithm will be given The complexity of the problem can be compared t o the growing tree in Figure 7.1 The algorithm will use terminology from this analogy, like cutting, pruning and merging branches Generally, the global maximum can be found only by searching through the whole tree However, the following arguments indicate heuristically how the complexity can be de- creased dramatically
At time t , every branch splits into two branches where one corresponds t o
a jump Past data contain no information about what happens after a jump Therefore, only one sequence among all those with a jump at a given time instant has to be considered, i.e the most likely one This is the point in the first step, after which only one new branch in the tree is started at each time instant That is, there are only N branches left This exploration of a finite
memory property has much in common with the famous Viterbi algorithm
in equalization, see Algorithm 5.5 or the articles Viterbi (1967) and Forney
(1973)
Trang 14It seems to be a waste of computational power to keep updating proba- bilities for sequences which have been unlikely for a long time However, one still cannot be sure that one of them will not start to grow and become the MAP estimate The solution offered in Section 7.A, is t o compute a common upper bound on the a posteriori probabilities If this bound does not exceed the MAP estimate’s probability, which is normally the case, one can be sure that the true MAP estimate is found The approximation in the following algorithm is to simply reject these sequences
The following algorithm is a straightforward extension of Algorithm 4.1
Algorithm 7.7 Recursive parameter segmentation
Choose an optimality criterion The options are the a posteriori prob-
abilities as in Theorem 7.3, 7.4 or 7.5, or the information criteria AIC (7.23) or BIC (7.24)
Compute recursively the optimality criterion using a bank of least squares estimators, each one matched to a particular segmentation
Use the following rules for maintaining the hypotheses and keeping the number of considered sequences ( M ) fixed:
a) Let only the most probable sequence split
b) Cut off the least probable sequence, so only M are left
c) Assume a minimum segment length: let the most probable sequence split only if it is not too young A suitable default value is 0
d ) Assure that sequences are not cut off immediately after they are born: cut off the least probable sequences among those that are older than
a certain minimum lifelength, until only M are left This should
mostly be chosen as large as possible
The last two restrictions are important for performance A tuning rule
in simulations is to simulate the signal without noise for tuning the local search parameters
The output of the algorithm at time t is the parameter estimate of the most probable sequence, or possibly a weighted sum of all estimates However, it should be pointed out that the fixed interval smoothing estimate is readily available by back-tracking the history of the most probable sequence, which can be realized from (7.16) Algorithm 7.1 is similar to the one proposed in Andersson (1985) However, this algorithm is ad hoc, and works only for the
case of known noise
Section 4.3 contains some illustrative examples, while Section 7.7 uses the algorithm in a number of applications