Theorem 5.5.1 Stratified Sampling Assuming that a maximum number of N samples can be collected, that is, El”=, N, = N , the optimal value of N , is given by which gives a minimal varian
Trang 1Suppose that X can be generated via the composition method Thus, we assume that there exists a random variable Y taking values in { 1, , m } , say, with known probabilities
{ p , , i = 1, , m } , and we assume that it is easy to sample from the conditional distri-
bution of X given Y The events {Y = i}, i = 1, , m form disjoint subregions, or
strata (singular: stratum), of the sample space 0, hence the name stratification Using the
conditioning formula (1.1 l), we can write
where X,, is the j-th observation from the conditional distribution of X given Y = ị Here
N , is the sample size assigned to the i-th stratum The variance of the stratified sampling
estimator is given by
1= 1 3 = 1
(5.35)
where uz = Var(H(X) I Y = i)
How the strata should be chosen depends very much on the problem at hand However, for a given particular choice of the strata, the sample sizes { N,} can be obtained in an
optimal manner, as given in the next theorem
Theorem 5.5.1 (Stratified Sampling) Assuming that a maximum number of N samples can be collected, that is, El”=, N, = N , the optimal value of N , is given by
which gives a minimal variance of
(5.36)
(5.37)
Proof: The theorem is straightforwardly proved using Lagrange multipliers and is left as
Theorem 5.5.1 asserts that the minimal variance of ệ is attained for sample sizes Ni that
are proportional to p i uị A difficulty is that although the probabilities pi are assumed to be known, the standard deviations {ai} are usually unknown In practice, one would estimate the { u i } from “pilot” runs and then proceed to estimate the optimal sample sizes, N t , from
(5.36)
A simple stratification procedure, which can achieve variance reduction without requiring prior knowledge of u: and H ( X ) , is presented next
Trang 2IMPORTANCE SAMPLING 131
Proposition 5.5.1 Let the sample sizes N, beproportional to p,, that is, N, = p , N, i =
1 , m Then
var(ês) 6 E r ( l j Proof Substituting N, = p , N in (5.35) yields Var(ệ) = & zz, p , 0: The result now follows from
m
NVar(P^) = Var(H(X)) 2 E[Var(H(X) 1 Y)] = x p l a : = NVar(ês),
r = l
Proposition 5.5.1 states that the estimator is more accurate than the CMC estimator
It effects stratification by favoring those events {Y = i} whose probabilities p , are largest
Intuitively, this cannot, in general, be an optimal assignment, since information on a: and
where H is the sample performance and f is the probability density of X For reasons that will become clear shortly, we ađ a subscript f to the expectation to indicate that it is taken with respect to the density f
Let g be another probability density such that H f is dominated by g That is, g(x) =
0 + H(x) f(x) = 0 Using the density g we can represent e as
(5.40) where the subscript g means that the expectation is taken with respect to g Such a density
is called the importance sampling density, proposal density, or instrumental density (as we
use g as an instrument to obtain information about l ) Consequently, if X I , , XN is a
random sample from g, that is, X I , , XN are iid random vectors with density g, then
(5.41)
Trang 3is an unbiased estimator of e This estimator is called the importance sampling estimator
The ratio of densities,
(5.42)
is called the likelihood ratio For this reason the importance sampling estimator is also called the likelihood ratio estimator In the particular case where there is no change of
measure, that is, g = f, we have W = 1, and the likelihood ratio estimator in (5.41)
reduces to the usual CMC estimator
5.6.1 Weighted Samples
The likelihood ratios need only be known up to a constanf, that is, W(X) = cw ( X ) for some known function w(.) Since IE,[W(X)] = 1, we can write f2 = IEg[H(X) W ( X ) ] as
This suggests, as an alternative to the standard likelihood ratio estimator (5.42), the follow-
ing weighted sample estimator:
(5.43)
Here the { w k } , with 'uik = w ( & ) , are interpreted as weights of the random sample
{ X k } , and the sequence { ( x k , W k ) } is called a weighted (random) sample from g(x)
Similar to the regenerative ratio estimator in Chapter 4, the weighted sample estimator (5.43) introduces some bias, which tends to 0 as N increases Loosely speaking, we
may view the weighted sample { (&, W k ) } as a representation of f ( x ) in the sense that
e = IE,[H(X)I =: e2, for any function H ( )
5.6.2 The Variance Minimization Method
Since the choice of the importance sampling density g is crucially linked to the variance
of the estimator Fin (5.41), we consider next the problem of minimizing the variance of
with respect to g, that is,
minVarg (H(x) g(x,> f (XI
It is not difficult to prove (see, for example, Rubinstein and Melamed [3 11 and Problem 5.13)
that the solution of the problem (5.44) is
In particular, if H(x) 0 - which we will assume from now on - then
(5.45)
(5.46)
and
Var, (F) = varg- (H(x)w(x)) = Var, (e) = o
The density g* as per (5.45) and (5.46) is called the optimal importance sampling density
Trang 4IMPORTANCE SAMPLING 133
EXAMPLE58
Let X - Exp(u-') and H ( X ) = I { x 2 7 ) for some y > 0 Let f denote the pdf of
X Consider the estimation of
We have
Thus, the optimal importance sampling distribution of X is the shfted exponential
distribution Note that H f is dominated by g' but f itself is not dominated by g *
Since g* is optimal, the likelihood ratio estimator z i s constant Namely, with N = 1,
It is important to realize that, although (5.41) is an unbiased estimator for any pdf g
dominating H f , not all such pdfs are appropriate One of the main rules for choosing a good importance sampling pdf is that the estimator (5.41) should have finite variance This
is equivalent to the requirement that
(5.47)
This suggests that g should not have a "lighter tail" than f and that, preferably, the likelihood ratio, f / g should be bounded
In general, implementation of the optimal importance sampling density g* as per (5.45)
and (5.46) is problematic The main difficulty lies in the fact that to derive g * ( x ) one needs
to know e But e is precisely the quantity we want to estimate from the simulation!
In most simulation studies the situation is even worse, since the analytical expression for the sample performance H is unknown in advance To overcome this difficulty, one can perform a pilot run with the underlying model, obtain a sample H ( X 1 ) , , H ( X N ) ,
and then use it to estimate g* It is important to note that sampling from such an artificially constructed density may be a very complicated and time-consuming task, especially when
g is a high-dimensional density
Remark5.6.1 (Degeneracy of the Likelihood Ratio Estimator) The likelihood ratio es-
timator C in (5.41) suffers from a form of degeneracy in the sense that the distribution of
W ( X ) under the importance sampling density g may become increasingly skewed as the
dimensionality n of X increases That is, W(X) may take values close to 0 with high probability, but may also take very large values with a small but significant probability As
a consequence, the variance of W(X) under g may become very large for large n As an example of this degeneracy, assume for simplicity that the components in X are iid, under both f and g Hence, both f ( x ) and g(x) are the products of their marginal pdfs Suppose the marginal pdfs of each component X i are f l and 91, respectively We can then write
W ( X ) as
(5.48)
Trang 5Using the law of large numbers, the random variable c:=, In (fl(Xi)/gl(Xi)) is approx-
imately equal to n E,, [In (fi ( X ) / g l (X))] for large n Hence,
as a dimension-reduction technique
When the pdf f belongs to some parametric family of distributions, it is often convenient
to choose the importance sampling distribution from the same family In particular, suppose
that f ( ) = f(.; u) belongs to the family
9 = { f ( ; v ) , v E Y } Then the problem of finding an optimal importance sampling density in this class reduces
to the following parametric minimization problem:
We shall call either of the equivalent problems (5.50) and (5.5 1) the variance minimization
(VM) problem, and we shall call the parameter vector v that minimizes programs (5.50) -
(5.5 1) the optimal VMreferenceparameter vector We refer to u as the nominal parameter
The sample average version of (5.51) - (5.52) is
where
(5.53)
(5.54)
and the sample X I , , XN is from f(x; u) Note that as soon as the sample X1, , X N
is available, the function v(v) becomes a deterministic one
Since in typical applications both functions V ( v ) and 6 ( v ) are convex and differentiable with respect to v, and since one can typically interchange the expectation and differentiation
operators (see Rubinstein and Shapiro [32]), the solutions of programs (5.51) - (5.52) and
Trang 6EXAMPLES9
Consider estimating e = IE[X], where X N Exp(u-') Choosing f ( z ; v ) =
v-' exp(z,u-'), z 2 0 as the importance sampling pdf, the program (5.51) reduces
The optimal reference parameter *v is given by
*v = 2 2 1
We see that .IJ is exactly two times larger than u Solving the sample average version (5.56) (numerically), one should find that, for large N , its optimal solution .z will be close to the true parameter *v
EXAMPLE 5.10 Example 5.8 (Continued)
Consider again estimating e = P U ( X 2 y) = exp(-yu-') In this case, using the family { f (z; v), v > 0) defined by f (2; v) = vP1 exp(zv-l), z 2 0, the program (5.51) reduces to
The optimal reference parameter .w is given by
1
2
*?I = - {y + 'u + &G2} = y + ; + O((u/y)2) ,
where O(z2) is a function of z such that
lim (30 = constant
2-0 5 2
We see that for y >> u, v is approximately equal to y
Trang 7It is important to note that in this case the sample version (5.56) (or (5.53) - (5.54))
is meaningful only for small y, in particular for those y for which C is not a rare-event probability, say where C < For very small C, a tremendously large sample N is needed (because of the indicator function I{ x)y}) and thus the importance sampling
estimator F i s useless We shall discuss the estimation of rare-event probabilities in more detail in Chapter 8
Observe that the VM problem (5.5 1) can also be written as
min V(V) = m i n E, [H’(x) W(X; u, v) W(X; u, w)] , (5.57)
V E Y V E Y
where w is an arbitrary reference parameter Note that (5.57) is obtained from (5.52) by multiplying and dividing the integrand by f(x; w) We now replace the expected value in
(5.57) by its sample (stochastic) counterpart and then take the optimal solution of the asso-
ciated Monte Carlo program as an estimator of *v Specifically, the stochastic counterpart
of (5.57) is
N
1 min ?(v) = min - H’(X,) W(Xk ; u , v ) W(Xk ; u, w) , ( 5 5 8 )
where X I , , XN is an iid sample from f( .; w) and w is an appropriately chosen trial parameter Solving the stochastic program (5.58) thus yields an estimate, say 3, of *v
In some cases it may be useful to iterate this procedure, that is, use as a trial vector in
(5.58), to obtain a better estimate
Once the reference parameter v = 3 is determined, C is estimated via the likelihood ratio estimator
V E Y “ E Y N ,=I
(5.59)
where X I , , XN is a random sample from f(.; v) Typically, the sample size N in (5.59)
is larger than that used for estimating the reference parameter We call (5.59) the standard likelihood ratio (SLR) estimator
5.6.3 The Cross-Entropy Method
An alternative approach for choosing an “optimal” reference parameter vector in (5.59) is based on the Kullback-Leibler cross-entropy, or simply crass-entropy (CE), mentioned in (1 S9) For clarity we repeat that the C E distance between two pdfs g and h is given (in the
continuous case) by
Recall that ID(g, h ) 2 0, with equality if and only if g = h
The general idea is to choose the importance sampling density, say h, such that the C E
distance between the optimal importance sampling density g* in (5.45) and h is minimal
We call this the CE optirnalpdf: Thus, this pdf solves the followingfunctional optimization
program:
min ID (g’, h)
Trang 8max D(v) = max IE, [ H ( X ) In f ( X ; v)] (5.61) Since typically D(v) is convex and differentiable with respect to v (see Rubinstein and
Shapiro [32]), the solution to (5.61) may be obtained by solving
E, [ H ( X ) V In f ( X ; v)] = 0 , (5.62) provided that the expectation and differentiation operators can be interchanged The sample counterpart of (5.62) is
N
(5.63)
By analogy to the VM program (5.51), we call (5.61) the CE program, and we call the parameter vector v* that minimizes the program (5.64) the optimal CE referenceparameter vector
Arguing as in (5.57), it is readily seen that (5.61) is equivalent to the following program:
max D(v) = max E, [ H ( X ) W ( X ; u, w) In f ( X ; v)] , (5.64) where W ( X ; u, w) is again the likelihood ratio and w is an arbitrary tilting parameter
Similar to (5.58), we can estimate v* as the solution of the stochastic program
V
N
1
max ~ ( v ) = max - C H ( x ~ ) w ( x ~ ; u, w) In f ( & ; v) , (5.65)
where X I , , X N is a random sample from I ( ; w) As in the VM case, we mention the
possibility of iterating this procedure, that is, using the solution of (5.65) as a trial parameter
for the next iteration
Since in typical applications the function 5 in (5.65) is convex and differentiable with respect to v (see [32]), the solution of (5.65) may be obtained by solving (with respect to
v) the following system of equations:
k=l
(5.66)
Trang 9where the gradient is with respect to v
Our extensive numerical studies show that for moderate dimensions n, say n 5 50, the optimal solutions of the CE programs (5.64)and (5.65) (or (5.66)) and their VM counterparts
(5.57) and (5.58) are typically nearly the same However, for high-dimensional problems
(n > 50), we found numerically that the importance sampling estimator g i n (5.59) based
on VM updating of v outperforms its C E counterpart in both variance and bias The latter
is caused by the degeneracy of W , to which, we found, CE is more sensitive
The advantage of the CE program is that it can often be solved analytically In particular,
this happens when the distribution of X belongs to an exponentialfamily of distributions; see
Section A.3 of the Appendix Specifically (see (A 16)) for a one-dimensional exponential
family parameterized by the mean, the CE optimal parameter is always
and the corresponding sample-based updating formula is
Observe also that because of the degeneracy of W , one would always prefer the estimator
(5.70) to (5.69), especially for high-dimensional problems But as we shall see below, this
is not always feasible, particularly when estimating rare-event probabilities in Chapter 8
Consider again the estimation of l = E [ X ] , where X N E x p ( u - l ) and f(z; v) =
v-' exp(zv-'), z 2 0 Solving (5.62) we find that the optimal reference parameter
v * is equal to
Thus, v* is exactly the same as *v For the sample average of (5.62), we should find that for large N its optimal solution 8' is close to the optimal parameter v* = 2u
Trang 10IMPORTANCE SAMPLING 139
I EXAMPLE 5.12 Example 5.10 (Continued)
Consider again the estimation of C = Bu(X > y) = exp(-yv.-') In this case, we readily find from (5.67) that the optimal reference parameter is w* = y + u Note that
similar to the VM case, for y >> u, the optimal reference parameter is approximately
7
Note that in the above example, similar to the VM problem, the CE sample version
(5.66) is meaningful only when y is chosen such that C is not a rare-eventprobability, say
when l < In Chapter 8 we present a general procedure for estimating rare-event probabilities of the form C = B,(S(X) 2 y) for an arbitrary function S(x) and level y
EXAMPLE 5.13 Finite Support Discrete Distributions
Let X be a discrete random variable with finite support, that is, X can only take a
finite number of values, say a l , Let ui = B(X = ai),i = 1, , m and
define u = (u1, , urn) The distribution of X is thus trivially parameterized by the vector u We can write the density of X as
m
From the discussion at the beginning of this section we know that the optimal CE
and VM parameters coincide, since we optimize over all densities on { a1 , , a m }
By (5.45) the VM (and CE) optimal density is given by
so that
for any reference parameter w , provided that Ew[H(X) W(X; u, w)] > 0
The vector V* can be estimated from the stochastic counterpart of (5.71), that is,
as
where XI, , XN is an iid sample from the density f(.; w)
A similar result holds for a random vector X = ( X I , , X,) where X I , ,
X , are independent discrete random variables with finite support, characterized by
Trang 11the parameter vectors u l , , u, Because of the independence assumption, the
C E problem (5.64) separates into n subproblems of the form above, and all the components of the optimal CE reference parameter v* = (v;, , v;), which is
now a vector of vectors, follow from (5.72) Note that in this case the optimal VM
and CE reference parameters are usually not equal, since we are not optimizing the
C E over all densities See, however, Proposition 4 2 in Rubinstein and Kroese [29]
for an important case where they do coincide and yield a zero-variance likelihood
ratio estimator
The updating rule (5.72), which involves discrete finite support distributions, and in
particular the Bernoulli distribution, will be extensively used for combinatorial optimization problems later on in the book
Consider the bridge network in Figure 5.1, and let
S(X) = m i n ( X l + X4, X I + X 3 + X 5 , X z + X3 + X4, X Z + X 5 )
Suppose we wish to estimate the probability that the shortest path from node A to node B has a length of at least y; that is, with H ( x ) = I{s(x)2r}, we want to estimate
e = W W I = PU(S(X) 2 7 ) = L [ I { S ( X ) > y } I '
We assume that the components { X , } are independent, that Xi - E x p ( u ; l ) , i =
1, , 5 , and that y is chosen such that C 2 lo-' Thus, here the CE updating formula (5.69) and its particular case (5.70) (with w = u) applies We shall show that this
yields substantial variance reduction The likelihood ratio in this case is
As a concrete example, let the nominal parameter vector u be equal to (1,1,0.3, 0.2,O.l) and let y = 1.5 We will see that this probability C is approximately 0.06 Note that the typical length of a path from A to B is smaller than y = 1.5; hence, using importance sampling instead of CMC should be beneficial The idea
is to estimate the optimal parameter vector v* without using likelihood ratios, that
is, using (5.70), since likelihood ratios, as in (5.69) (with quite arbitrary w, say by
guessing an initial trial vector w), would typically make the estimator of v * unstable, especially for high-dimensional problems
Denote by G1 the CE estimator of v* obtained from (5.70) We can iterate (repeat)
this procedure, say for T iterations, using (5.69), and starting with w = G g , Once the final reference vector V^T is obtained, we then estimate C via a larger sample
from f ( x ; G ~ ) , say of size N1, using the SLR estimator (5.59) Note, however,
that for high-dimensional problems, iterating in this way could lead to an unstable final estimator G T In short, a single iteration with (5.70) might often be the best alternative
Trang 12SEQUENTIAL IMPORTANCE SAMPLING 141
0
1
2
3
Table 5.1 presents the performance of the estimator (5.59), starting from w = u =
(1,1,0.3,0.2,0.1) and then iterating (5.69) three times Note again that in the first iteration we generate a sample X 1 , XN from f(x; u) and then apply (5.70) to obtain an estimate v^ = (51, ,55) of the CE optimal reference parameter vector
v* The sample sizes for updating v^ and calculating the estimator l were N = lo3
and N1 = lo5, respectively In the table R E denotes the estimated relative error
2.4450 2.3274 0.2462 0.2113 0.1030 0.0631 0.0082 2.3850 2.3894 0.3136 0.2349 0.1034 0.0644 0.0079 2.3559 2.3902 0.3472 0.2322 0.1047 0.0646 0.0080
Table 5.1 Iterating the five-dimensional vector 0
Note that v^ already converged after the first step, so using likelihood ratios in Steps 2 and 3 did not add anything to the quality of v^ It also follows from the results of Table 5.1 that CE outperforms CMC (compare the relative errors 0.008 and 0.0121 for CE and CMC, respectively) To obtain a similar relative error of 0.008 with CMC would require a sample size of approximately 2.5 l o 5 instead of lo5; we thus obtained a reduction by a factor of 2.5 when using the C E estimation procedure
As we shall see in Chapter 8 for smaller probabilities, a variance reduction of several orders of magnitude can be achieved
5.7 SEQUENTIAL IMPORTANCE SAMPLING
Sequential importance sampling (SIS), also called dynamic importancesampling, is simply
importance sampling carried out in a sequential manner To explain the SIS procedure, consider the expected performance l in (5.39) and its likelihood ratio estimator Fin (5.41)
with f ( x ) the “target” and g(x) the importance sampling, or proposal, pdf Suppose that (a) X is decomposable, that is, it can be written as a vector X = ( X I , , Xn), where each
of the Xi may be multi-dimensional, and (b) it is easy to sample from g(x) sequentially Specifically, suppose that g(x) is of the form
(5.74) where it is easy to generate X1 from density g l ( q ) , and conditional on X1 = 2 1 , the second component from density 92(52121) and so on, until one obtains a single random vector X from g ( x ) Repeating this independently N times, each time sampling from g(x),
one obtains a random sample X I , , XN from g(x) and estimates C according to (5.41)
To further simplify the notation, we abbreviate ( 2 1 , z t ) to x1:t for all t In particular,
~ 1= : x Typically, ~ t can be viewed as a (discrete) time parameter and ~ 1as a path or : ~
trajectory By the product rule of probability (1.4), the target pdf J(x) can also be written sequentially, that is,
g(x) = g1(21) g2(22 1x1) ’ ‘ ’ gn(2n Izlr z n - 1 ) 7
f ( x ) = f ( 2 1 ) f(z2 1 5 1 ) ’ f ( z n 1 X1:n-1) (5.75)
Trang 13From (5.74) and (5.75) it follows that we can write the likelihood ratio in product form
as
(5.76)
f ( ~ l ) f ( ~ I x l ) " ' f ( x n Ix1:n-l) g1(21)g2(22 I ~ l ) g n ( x n Ix1:n-l) W(x) =
or, if WL(xl:t) denotes the likelihood ratio up to time t , recursively as
Wt(X1:t) = U t Wt-l(Xl:t-l), t = 1, 1 n 7 (5.77) with initial weight Wo(x1:o) = 1 and incremental weights u1 = f(z1)/g1(xi) and
, t = 2 , , n (5.78)
In order to update the likelihood ratio recursively, as in (5.78), one needs to known the marginal pdfs f ( x ~ : ~ ) This may not be easy when f does not have a Markov structure, as
it requires integrating f(x) over all z ~ + ~ , , 2, Instead one can introduce a sequence of
auxiliary pdfs fl, f 2 , , fn that are easily evaluated and such that each ft(xl:t) is a good approximation to f ( x p t ) The terminating pdf fn must be equal to the original f Since
fort = 1 , , n, where we put fo(x1:o) = 1
Remark 5.7.1 Note that the incremental weights ut only need to be defined up to uconstunt,
say c t , for each t In this case the likelihood ratio W(x) is known up to a constant as well, say W(x) = C w ( x ) , where 1/C = E,[w(X)] can be estimated via the corresponding sample mean In other words, when the normalization constant is unknown, one can still estimate e
using the weighted sample estimator (5.43) rather than the likelihood ratio estimator (5.42)
Summarizing, the SIS method can be written as follows
Algorithm 5.7.1 (SIS Algorithm)
I For eachjnite t = 1, , n, sample X t from gt(Zt 1 xpt-1)
2 Compute wt = ut wL-l, where wo = 1 and
Trang 14SEQUENTIAL IMPORTANCE SAMPLING 143
Consider the random walk on the integers of Example 1.10 on page 19, with prob- abilities p and q for jumping up or down, respectively Suppose that p < q, so that the walk has a drift toward -m Our goal is to estimate the rare-event probability
C of reaching state K before state 0, starting from state 0 < k << K , where K is a
large number As an intermediate step consider first the probability of reaching K in exactly n steps, that is, P(X, = K ) = IE[IA,,], where A, = {X, = K } We have
f(X1:n) = f(s1 I k ) f ( x 2 1x1) f(53 1x2) ' ' f(zn I %-l) 7
where the conditional probabilities are either p (for upward jumps) or q (for down-
ward jumps) If we simulate the random walk with different upward and downward
probabilities, 6 and ij, then the importance sampling pdf g(x1:,) has the same form as
f(xl:,) above Thus, the importance weight after Step t is updated via the incremental
gives an efficient estimator fore
Trang 155.7.1 Nonlinear Filtering for Hidden Markov Models
This section describes an application of SIS to nonlinear filtering Many problems in engineering, applied sciences, statistics, and econometrics can be formulated as hidden Markov models (HMM) In its simplest form, an HMM is a stochastic process { (Xt, Y,)} where X t (which may be multidimensional) represents the true state of some system and
Yt represents the observed state of the system at a discrete time t It is usually assumed that { X , } is a Markov chain, say with initial distribution f ( z 0 ) and one-step transition
probabilities J ( x t I It is important to note that the actual state of the Markov chain
remains hidden, hence the name HMM All information about the system is conveyed by
the process { Y , } We assume that, given X O , , X t , the observation Yt depends only
on X t via some conditional pdf f ( y t I x,) Note that we have used here a Bayesian style
of notation in which all (conditional) probability densities are represented by the same symbol f We will use this notation throughout the rest of this section We denote by
XI:, = ( X I , , Xt) and Ylrt = (Y1, , yt) the unobservable and observable sequences
up to time t , respectively - and similarly for their lowercase equivalents
The HMM is represented graphically in Figure 5.2 This is an example of a Bayesian network The idea is that edges indicate the dependence structure between two variables For
example, given the states X I , , Xt, the random variable Yt is conditionally independent
of X I , , Xt-l, because there is no direct edge from Yt to any of these variables We thus
have J ( y t I X I x t ) = f ( y t I xt), and more generally
Figure 5.2 A graphical representation of the HMM