Adaptive lọc và phát hiện thay đổi P7

Problem setup The segmentation model is based on a linear regression with piecewise constant parameters, Here O i is the &dimensional parameter vector in segment i, pt is the regresso

Trang 1

Change detection based on

filter banks

7.1 Basics 231

7.2 Problem setup 233

7.2.1 The changing regression model 233

7.2.2 Notation 234

7.3 Statistical criteria 234

7.3.1 The MML estimator 234

7.3.2 The a posteriori probabilities 236

7.3.3 On the choice of priors 237

7.4 Information based criteria 240

7.4.1 The MGL estimator 240

7.4.2 MGL with penalty term 241

7.4.3 Relation to MML 242

7.5 On-line local search for optimum 242

7.5.1 Local tree search 243

7.5.2 Design parameters 245

7.6 Off-line global search for optimum 245

7.7 Applications 246

7.7.1 Storing EKG signals 247

7.7.2 Speech segmentation 248

7.7.3 Segmentation of a car’s driven path 249

7.A Two inequalities for likelihoods 252

7 A l The first inequality 252

7.A.2 The second inequality 254

7.A.3 The exact pruning algorithm 255

7.B The posterior probabilities of a jump sequence 256

7.B.1 Main theorems 256

7.1 Basics

Let us start with considering change detection in linear regressions as an off- line problem which will be referred to as segmentation The goal is to find a

Adaptive Filtering and Change Detection

Trang 2

sequence of time indices kn = ( k l , k2, , k n ) , where both the number n and the

locations ki are unknown, such that a linear regression model with piecewise

constant parameters,

is a good description of the observed signal yt In this chapter, the measure-

ments may be vector valued, and the nominal covariance matrix of the noise

is Rt, and X ( i ) is a possibly unknown scaling, which is piecewise constant One way to guarantee that the best possible solution is found, is to consider all possible segmentations kn, estimate one linear regression model in each segment, and then choose the particular kn that minimizes an optimality

criteria,

h

n>l,O<kl< <k,=N

The procedure and, as it turns out, sufficient statistics as are defined in (7.6)-

(7.8), are shown below:

What is needed from each data segment is the sufficient statistics V (sum

of squared residuals), D (- log det of the covariance matrix) and number of data N in each segment, as defined in equations (7.6), (7.7) and (7.8) The

segmentation kn has n - 1 degrees of freedom

Two types of optimality criteria have been proposed:

0 Statistical criterion: the maximum likelihood or maximum a posteriori

estimate of kn is studied

0 Information based criterion: the information of data in each segment is

V ( i ) (the sum of squared residuals), and the total information is the sum

of these Since the total information is minimized for the degenerated solution kn = 1,2,3, , N , giving V ( i ) = 0, a penalty term is needed

Similar problems have been studied in the context of model structure selection, and from this literature Akaike's AIC and BIC criteria have been proposed for segmentation

The real challenge in segmentation is to cope with the curse of dimensionality

The number of segmentations kn is 2N (there can be either a change or no

change at each time instant) Here, several strategies have been proposed:

Trang 3

7.2 Problem setuD 233

0 Numerical searches based on dynamic programming or MCMC techniques

0 Recursive local search schemes

The main part of this chapter is devoted to the second approach, which provides a solution to adaptive filtering, which is an on-line problem

7.2 Problem setup

The segmentation model is based on a linear regression with piecewise constant parameters,

Here O ( i ) is the &dimensional parameter vector in segment i, pt is the regressor

and ki denotes the change times The measurement vector is assumed t o have

dimension p The noise et in (7.2) is assumed to be Gaussian with variance

X(i)Rt, where X ( i ) is a possibly segment dependent scaling of the noise We will assume Rt to be known and the scaling as a possibly unknown parameter The problem is now to estimate the number of segments n and the sequence

of change times, denoted kn = ( k l , k2, , kn) Note that both the number n

and positions of change times ki are considered unknown

Two important special cases of (7.2) are a changing mean model where

cpt = 1 and an auto-regression, where pt = (-yt-1, , -yt-d) T

For the analysis in Section 7.A, and for defining the prior on each segmentation, the following equivalent state space model turns out be be more convenient:

Here St is a binary variable, which equals one when the parameter vector changes and is zero otherwise, and ut is a sequence of unknown parameter vectors Putting St = 0 into (7.3) gives a standard regression model with constant parameters, but when St = 1 it is assigned a completely new parameter

vector ut taken at random Thus, models (7.3) and (7.2) are equivalent For

convenience, it is assumed that ko = 0 and 60 = 1, so the first segment be- gins at time 1 The segmentation problem can be formulated as estimating the number of jumps n and the jump instants kn, or alternatively the jump

parameter sequence S N = (61, ,S,)

Trang 4

The models (7.3) and (7.2) will be referred to as changing regressions,

because they change between different regression models The most important feature with the changing regression model is that the jumps divide the measurements into a number of independent segments This follows, since the

parameter vectors in the different segments are independent; they are two

different samples of the stochastic process {ut}

A related model studied in Andersson (1985) is a jumping regression model The difference to the approach herein, is that the changes are added to the paremeter vector In (7.3), this would mean that the parameter variation

model is &+l = Ot + &ut We lose the property of independent segments The

optimal algorithms proposed here are then only sub-optimal

7.2.2 Notation

Given a segmentation k n , it will be useful to introduce compact notation Y ( i )

for the measurements in the ith segment, that is Y k i - l + l , ,yki = yk;-l+l

The least squares estimate and its covariance matrix for the ith segment are denoted:

Trang 5

7.3 Statistical criteria 235

y N given all parameters is denoted p ( y N I k n , P , An) We will assume independent Gaussian noise distributions, so p ( e N ) = nt&-l+l (27rA(i))-P/2- (det exp(-e?RF1et/(2X(i))) Then, we have

- 2logP(Y Ik N n,on, An)

= N p log(27r) + c log det Rt + c N ( i ) log(A(i)p)

Here and in the sequel, p is the dimension of the measurement vector yt

There are two ways of eliminating the nuisance parameters P , An, leading t o the marginalixed and generalized likelihoods, respectively The latter is the

standard approach where the nuisance parameters are removed by minimiza- tion of (7.9) A relation between these is given in Section 7.4 See Wald (1947) for a discussion on generalized and marginalized (or weighted) likelihoods

We next investigate the use of the marginalized likelihood, where (7.9) is integrated with respect to a prior distribution of the nuisance parameters The

likelihood given only kn is then given by

p ( y N I k n , P , Xn)p(enIXn)p(Xn)dendAn (7.10) n,X"

In this expression, the prior for 19, p ( P I A n ) , is technically a function of the noise variance scaling X, but is usually chosen as an independent function The maximum likelihood estimator is given by maximization of p ( y N I k n ) Finally,

the a posteriori probabilities can be computed from Bayes' law,

The prior p ( k n ) = p(knln)p(n) or, equivalently, p ( S N ) on the segmentation

is a user's choice (in fact the only one) A natural and powerful possibility

is to use p ( J N ) and assume a fixed probability q of jump at each new time instant That is, consider the jump sequence SN as independent Bernoulli variables 6, E Be(q), which means

0 with probability 1 - q

J t = { 1 with probability Q

Trang 6

It might be useful in some applications to tune the jump probability q above,

because it controls the number of jumps estimated Since there is a one-to-one

correspondence between kn and b N , both priors are given by

p ( k " ) = p ( P ) = - 4 ) N - n (7.12)

A q less than 0.5 penalizes a large number of segments A non-informative prior p ( k n ) = 0.5N is obtained with q = 0.5 In this case, the MAP estimator equals the Muzirnurn Likelihood ( M L ) estimator, which follows from (7.11)

7.3.2 The a posteriori probabilities

In Appendix 7.B, the U posteriori probabilities are derived in three theorems for the three different cases of treating the measurement covariance: completely known, known except for a constant scaling and finally known with

an unknown changing scaling The case of completely unknown covariance matrix is not solved in the literature These are generalizations and exten- sions of results for a changing mean models (cpt = 1) presented in Chapter 3;

see also Smith (1975) and Lee and Hefhinian (1978) Appendix 7.B also contains a discussion and motivation of the particular prior distributions used in marginalization The different steps in the MAP estimator can be summarized

as follows; see also (7.16)

Filter bank segmentation

0 Examine every possible segmentation, parameterized in the number of jumps n and jump times kn, separately

0 For each segmentation, compute the best models in each

segment parameterized in the least squares estimates 8 ( i )

and their covariance matrices P ( i )

0 Compute the sum of squared prediction errors V ( i ) and

D ( i ) = - log det P ( i ) in each segment

0 The MAP estimate of the model structure for the three

different assumptions on noise scaling (known X ( i ) = X,,

unknown but constant X ( i ) = X and finally unknown and

changing X ( i ) ) is given in equations (7.13), (7.14) and (7.15),

respectively,

Trang 7

Data Y1, YZ, , Ykl Ykl+l, 0 7 Y k z - Yk,_l+l, 0 7 Y k ,

Segmentation Segment 1 Segment 2 Segment n (7.16)

LS estimates @I), ~ ( 1 ) 8 ( 2 ) , ~ ( 2 ) W , P ( n )

Statistics V(1L W V P ) , W ) * * * V ( n > , W4

The required steps in computing the MAP estimated segmentation are as follows First, every possible segmentation of the data is examined separately For each segmentation, one model for every segment is estimated and the test statistics are computed Finally, one of equations (7.13)-(7.15) is evaluated

In all cases, constants in the a posteriori probabilities are omitted The

difference in the three approaches is thus basically only how to treat the sum

of squared prediction errors A prior probability q causes a penalty term

increasing linearly in n for q < 0.5 As noted before, q = 0.5 corresponds t o

ML estimation

The derivations of (7.13) to (7.15) are valid only if all terms are well- defined The condition is that P ( i ) has full rank for all i , and that the de- nominator under V(i) is positive That is, N p - n d - 4 > 0 in (7.14) and

N ( i ) p - d - 4 > 0 in (7.15) The segments must therefore be forced to be long enough

The Gaussian assumption on the noise is a standard one, partly because it

gives analytical expressions and partly because it has proven to work well in

Trang 8

practice Other alternatives are rarely seen The Laplacian distribution is shown in Wu and Fitzgerald (1995) to also give an analytical solution in the case of unknown mean models It was there found that it is less sensitive to large measurement errors

The standard approach used here for marginalization is to consider both Gaussian and non-informative prior in parallel We often give priority to a non-informative prior on 8, using a flat density function, in our aim to have as few non-intuitive design parameters as possible That is, p ( P I X n ) = C is an arbitrary constant in (7.10) The use of non-informative priors, and especially improper ones, is sometimes criticized See Aitken (1991) for an interesting discussion Specifically, here the flat prior introduces an arbitrary term n log C

in the log likelihood The idea of using a flat prior, or non-informative prior,

in marginalization is perhaps best explained by an example

Example 7.7 Marginalized likelihood for variance estimation

Suppose we have t observations from a Gaussian distribution; yt E N(p, X) Thus the likelihood p(ytl,u, X) is Gaussian We want to compute the likelihood conditioned on just X using marginalization: p(ytIX) = Jp(ytlp,X)p(p)dp

Two alternatives of priors are a Gaussian, p E N(p0, PO), and a flat prior,

p ( p ) = C In both cases, we end up with an inverse Wishart density function

(3.54) with maximas

where jj is the sample average Note the scaling factor l / ( t - l), which makes the estimate unbiased The joint likelihood estimate of both mean and variance gives a variance estimator scaling factor l/t The prior thus induces a bias in the estimate

Thus, a flat prior eliminates the bias induced by the prior We remark that the likelihood interpreted as a conditional density function is proper, and

it does not depend upon the constant C

The use of a flat prior can be motivated as follows:

0 The data dependent terms in the log likelihood increase like log N That

is, whatever the choice of C, the prior dependent term will be insignifi- cant for a large amount of data

Trang 9

7.3 Statistical criteria 239

0 The choice C M 1 can be shown to give approximately the same likelihood as a proper informative Gaussian prior would give if the true parameters were known and used in the prior See Gustafsson (1996), where an example is given

More precisely, with the prior N(I90, PO), where 130 is the true value of O ( i ) the

constant should be chosen as C = det PO The uncertainty about 190 reflected

in PO should be much larger than the data information in P ( i ) if one wants the data to speak for themselves Still, the choice of PO is ambiguous The larger value, the higher is the penalty on a large number of segments This is

priate Since the true value 80 is not known, this discussion seems t o validate

the use of a flat prior with the choice C = 1, which has also been confirmed t o work well by simulations An unknown noise variance is assigned a flat prior

as well with the same pragmatic motivation

Example 7.2 Lindley's paradox

Consider the hypothesis test

H0 :y E N ( 0 , l ) H1 :Y E N ( 4 11,

and assume that the prior on I9 is N(I30,Po) Equation (5.98) gives for scalar

measurements that

Here we have N = 1, P1 = (P;' + l)-' + 1 and 81 = P l y + y Then the likelihood ratio is

Trang 10

since the whole expression behaves like 1 / a This fact is not influenced by the number of data or what the true mean is, or what 130 is That is, the more non-informative the prior, the more H0 is favored!

7.4 Information based criteria

The information based approach of this section can be called a penalized Max-

i m u m Generalized Likelihood (MGL) approach

It is straightforward to show that the minimum of (7.9) with respect to P ,

assuming a known X(i), is

and finally, for a changing noise scaling

MGL(kn) = min -210gp(yNIlcn, P , An)

Trang 11

7.4 Information based criteria 241

In summary, the counterparts to the MML estimates (7.13)-(7.15) are given

model complexity and data fit

An attempt to satisfy the parsimonious principle is to add a penalty term t o

the generalized likelihoods (7.17)-(7.19) A general form of suggested penalty terms is n ( d + l ) y ( N ) , which is proportional to the number of parameters used

to describe the signal (here the change time itself is counted as one parameter) Penalty terms occuring in model order selection problems can be used in this application as well, like Akaike's AIC (Akaike, 1969) or the equivalent criteria: Akaike's BIC (Akaike, 1977), Rissanen's Minimum Description Length ( M D L ) approach (Rissanen, 1989) and Schwartz criterion (Schwartz, 1978)

The penalty term in AIC is 2n(d + 1) and in BIC n ( d + 1) log N

AIC is proposed in Kitagawa and Akaike (1978) for auto-regressive models with a changing noise variance (one more parameter per segment), leading t o

(7.23)

Trang 12

and BIC is suggested in Yao (1988) for a changing mean model (cpt = 1) and unknown constant noise variance:

is not commented upon in Kitagawa and Akaike (1978)

The MDL theory provides a nice interpretation of the segmentation problem: choose the segments such that the fewest possible data bits are used t o describe the signal up to a certain accuracy, given that both the parameter vectors and the prediction errors are stored with finite accuracy

Both AIC and BIC are based on an assumption on a large number of data, and its use in segmentation where each segment could be quite short

is questioned in Kitagawa and Akaike (1978) Simulations in Djuric (1994) indicate that AIC and BIC tend to over-segment data in a simple example where marginalized ML works fine

A comparison of the generalized likelihoods (7.17)-(7.19) with the marginalized likelihoods (7.13)-(7.15) (assuming q = 1/2), shows that the penalty term introduced by marginalization is & ' D ( i ) in all cases It is therefore interesting to study this term in more detail

weakly consistent estimate of the number of the change times in segmentation

of changing mean models The asymptotic link with BIC supports the use of marginalized likelihoods

7.5 On-line local search for optimum

Computing the exact likelihood or information based estimate is computation- ally intractable because of the exponential complexity This section reviews

Trang 13

7.5 On-line local search for oDtimum 243

Figure 7.1 The tree of jump sequences A path marked 0 corresponds to no jump, while 1

in the S-parameterization of the jump sequence corresponds to a jump

local search techniques, while the next section comments on numerical meth- ods

7.5.1 local tree search

In Section 7.A, an exact pruning possibility having quadratic in time complexity is described Here a natural recursive (linear in time) approximate algorithm will be given The complexity of the problem can be compared t o the growing tree in Figure 7.1 The algorithm will use terminology from this analogy, like cutting, pruning and merging branches Generally, the global maximum can be found only by searching through the whole tree However, the following arguments indicate heuristically how the complexity can be de- creased dramatically

At time t , every branch splits into two branches where one corresponds t o

a jump Past data contain no information about what happens after a jump Therefore, only one sequence among all those with a jump at a given time instant has to be considered, i.e the most likely one This is the point in the first step, after which only one new branch in the tree is started at each time instant That is, there are only N branches left This exploration of a finite

memory property has much in common with the famous Viterbi algorithm

in equalization, see Algorithm 5.5 or the articles Viterbi (1967) and Forney

(1973)

Trang 14

It seems to be a waste of computational power to keep updating probabilities for sequences which have been unlikely for a long time However, one still cannot be sure that one of them will not start to grow and become the MAP estimate The solution offered in Section 7.A, is t o compute a common upper bound on the a posteriori probabilities If this bound does not exceed the MAP estimate’s probability, which is normally the case, one can be sure that the true MAP estimate is found The approximation in the following algorithm is to simply reject these sequences

The following algorithm is a straightforward extension of Algorithm 4.1

Algorithm 7.7 Recursive parameter segmentation

Choose an optimality criterion The options are the a posteriori prob-

abilities as in Theorem 7.3, 7.4 or 7.5, or the information criteria AIC (7.23) or BIC (7.24)

Compute recursively the optimality criterion using a bank of least squares estimators, each one matched to a particular segmentation

Use the following rules for maintaining the hypotheses and keeping the number of considered sequences ( M ) fixed:

a) Let only the most probable sequence split

b) Cut off the least probable sequence, so only M are left

c) Assume a minimum segment length: let the most probable sequence split only if it is not too young A suitable default value is 0

d ) Assure that sequences are not cut off immediately after they are born: cut off the least probable sequences among those that are older than

a certain minimum lifelength, until only M are left This should

mostly be chosen as large as possible

The last two restrictions are important for performance A tuning rule

in simulations is to simulate the signal without noise for tuning the local search parameters

The output of the algorithm at time t is the parameter estimate of the most probable sequence, or possibly a weighted sum of all estimates However, it should be pointed out that the fixed interval smoothing estimate is readily available by back-tracking the history of the most probable sequence, which can be realized from (7.16) Algorithm 7.1 is similar to the one proposed in Andersson (1985) However, this algorithm is ad hoc, and works only for the

case of known noise

Section 4.3 contains some illustrative examples, while Section 7.7 uses the algorithm in a number of applications

Tiêu đề	Adaptive filtering and change detection
Tác giả	Fredrik Gustafsson
Trường học	John Wiley & Sons, Ltd
Chuyên ngành	Adaptive Filtering and Change Detection
Thể loại	thesis
Năm xuất bản	2000
Thành phố	Hoboken

Định dạng
Số trang	29
Dung lượng	1,17 MB