on merging hidden markov models with deformable templates

The main problem with deformable templates is that the energy functions are experimentally designed, and they do not statistically segment the image.. Alternatively, looking at our solut

Trang 1

ON MERGING HIDDEN MARKOV MODELS WITH DEFORMABLE

TEMPLATES

Ram R Rao and Russell M Mersereau School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332

rr@eedsp.gatech.edu

ABSTRACT Hidden Markov modeling has proven extremely useful

for statistical analysis of speech signals There are,

however, inherent problems in two dimensional exten-

sions to HMM’s, one of which is the exponential com-

plexity associated with fully 2-D HMM’s In this paper,

we propose a new 2-D HMM-like structure obtained by

embedding states within regions of a deformable tem-

plate structure With this state-embedded deformable

template (SEDT), each region of a deformable tem-

plate has an underlying observation probability distri-

bution This structure allows for computation of the

Plimage|template] The template that maximizes this

probability provides an optimal segmentation of the

image This segmentation capability will be demon-

strated in facial analysis applications

1 INTRODUCTION Facial analysis is a difficult problem which has many

potential applications Robust facial analysis systems

are an integral part of any model-based coding, fa-

cial recognition [1], or visual speech recognition sys-

tem [2] Many researchers are attempting to provide a

standard framework for tackling these image analysis

tasks Two of the more interesting analysis approaches

are deformable templates and hidden Markov model-

ing Both of these approaches have advantages and

shortcomings

Deformable templates [3] have been used to model

the eyes, lips, and face for applications such as visual

speech recognition and face recognition These tem-

plates have certain structural characteristics, such as

associating the head with an ellipse, or the lips with

four parabolas They also have energy functions which

are often the sum of an image-related energy term, and

an internal energy term The image-related term is

This work is supported by the U.S Army Research Office, Contract DAAL03-92-G-0068

usually a function of the edge, peak and valley fields derived from the image ‘The internal energy is often heuristically designed to keep template parameters within acceptable ranges Minimization of the energy function yields the template which best matches the image The main problem with deformable templates is that the energy functions are experimentally designed, and they do not statistically segment the image There is strong motivation for statistically modeling the pixel values which occur in an image Since there

is a difference between “skin” colors and background colors in head and shoulder images [4], one would like

to model these distributions and use this information

to segment the image

Hidden Markov models [5] provide a strong statistical framework for analyzing one-dimensional ran- dom processes The key concept behind HMM’s is

a set of states which have probabilistic output distributions Two-dimensional HMM’s aren’t quite so tractable Fully two-dimensional HMM’s have been shown to have exponential complexity [6] One practi- cal solution to this has been to use psuedo-2D HMM’s [7] Essentially, one dimensional HMM’s operate on the rows of the image, and these HMM’s are nested in another HMM Psuedo-2D HMM’s, however, can not incorporate any shape constraints since each row is an- alyzed independently

2 STATE-EMBEDDED DEFORMABLE

TEMPLATES Since it seems that deformable templates provide a good framework for structurally analyzing an image, and HMM’s provide a good framework for statistically analyzing an image, it makes sense to capitalize on the benefits of both Our solution entails associating a state with each region of a deformable template These states have observation probability density functions which reflect the probability of observing a particular pixel value while in the state For example, the head

Trang 2

” =

(b)

(a) Figure 1: (a) SEDT used for facial extraction

(x1,22,y1,y2) (b) SEDT used for lip tracking

(x1, 22, y1, y2, 3)

A=

can be modeled by an ellipse with a foreground state

and a background state (Figure la) This has some in-

tuitive sense since the face normally has different sta-

tistical characteristics than the background, especially

when using color

Our SEDT’s are specified as follows:

e The variable, À = (Ai -Àx), parameterizes a deformable template structure For example, if the template were a rectangle, K = 4, and A could be the x and y coordinates of the upper- left and lower-right corners of the rectangle

The template divides the image into N regions Ro +Rn— 1 In case of N = 2, we have an image divided into foreground and background regions Each region has an associated observation probability density function, b;(Q), where Q is

a (possibly multidimensional) pixel value 6;(Q) can be any parameterized pdf such as a Gaussian

or Gaussian mixture

From this, it follows that:

N-1

j=0 (x,y)eRj where I is the image, and I(z,y) is the (possibly mul-

tidimensional) pixel value at location (2, y)

Maximizing P[Z|A] over A yields the optimal tem-

plate Equivalently, we can minimize — log P[J|A] Look-

ing at SEDT’s from a deformable template perspective,

we can think of — log P{I|A] as our energy function

Alternatively, looking at our solution from an HMM

perspective, we can think of the optimal template as

Figure 2: Shown is a) original image with the initial and final position of the template (foreground: p == 200, o* = 100; background: pz = 200, o? = 10); b) points for which P[pixel « foreground] > P[pixel ¢ background]

being analogous to the optimal state sequence parti- tioning The analog of Viterbi training would be to par- tition the data using the optimal templates, reestimate the output probability distribution functions given the partitioned data, and repeat until convergence

3 SYNTHETIC EXAMPLE The first test of our template was to find an arbitrary sized rectangle within an image The rectangle had pixels with intensity specified by a Gaussian with mean and variance, ps and oy, respectively Likewise, the background had interisity specified by a Gaussian with

fp and oy Our template was a rectangle specified by the coordinates of its upper-left and lower-right corners

Starting with an initial template, estimates of the foreground and background pdf’s were made A steepest descent minimization algorithm was then used to minimize ~— log P[I|A] over A This new template was then used to reestimate the foreground and background pdf’s, and the process was repeated until convergence

it was seen that this process is sensitive to the initial placement of the template Good results were obtained when the initial template completely covered the unknown rectangle, or when it was contained within the unknown rectangle These template choices work well because either the foreground pdf or the background pdf is reliably estimated initially Now since we didn’t know the position of the rectangle, our system was al- ways started with a rectangle that covered a majority

of the input image (Figure 2)

There is a problem with this procedure Consider the case where the foreground has a lower variance than the background, and they both have equal means The choice of a large initial template would likely con- tain pixels from both the foreground and background

Trang 3

(d)

Figure 3: Initialization procedure (a) Region used

to estimate facial distribution for “Chris”; (b) Result

of applying this distribution to “Haluk” and applying

threshold; (c) Probability of pixel being part of face

for “Haluk” using distribution derived from (a) (dark

region = high probability); (d) “Haluk” image, with

initial template position

Thus, the estimate of the variance of the foreground pdf

would approach the variance of the background pdf

When there is a large overlap between the two pdf’s

the system will not work well This can be remedied

by altering the reestimation procedure to ensure that

there is adequate separation between the two pdf’s

4, FACIAL EXTRACTION One of our main objectives was to find a robust pro-

cedure for extracting the boundary of a person’s head

in a full-color head and shoulders video sequence The

head was modeled as an ellipse with no rotation, and

the foreground and background pdf’s were modeled as

Gaussian mixtures Each mixture contained two Gaus-

sians with full covariance matrices

In the development of our system, a number of facts

became clear First, if the foreground and background

pdf’s are available, minimizing the energy function,

~log P[F|A], would successfully segment the face from

the background However, since these distributions are

unknown in the initial frame, they must somehow be es-

timated Second, if a point on the person’s face could

be located, a region around this point could be used

to estimate the foreground pdf Assuming everything

outside this region was background, we could also esti-

mate a background pdf The facial border could then

{c) (d)

Figure 4: Facial Extraction (a) Original image with initial and final placement of template; (b) Pixels for

which Phead > Phackground; (¢) & (d) Probability of head and background, respectively (dark = high probability)

be found by iterating between minimizing the energy function and reestimating foreground and background distributions

One important task was to develop a subsystem which could locate a point on a person’s face This could be done by first developing a general “face” pdf ` Ideally, one would like to collect a large database of faces under varying lighting conditions to estimate a

general “face” pdf, but we didn’t have such a large

database We chose to use the facial distribution of one person as an approximation of the facial distribution for a different person A point in the face was found

by applying this pdf to the input image A threshold

was applied to the new image to find all points which had probability within a certain range of the pixel with maximum probability The median z and y values of these pixels would be located in the person’s face The

median operation works much better than averaging, and also works better than attempting to find an n by

mn square of pixels whose joint probability is greatest It also seems to implicitly use the fact that for the most part, the face of interest is near the center of the image This procedure is shown in Figure 3

Figure 4 shows the convergence of the template to the final head border Image specific distributions for the foreground and background are estimated using the:

initial template A steepest descent minimization al-

gorithm is then used to minimize —log P[ZjA] This process is repeated until convergence Comparing

Trang 4

Figure 5: Results of lip tracking algorithm (top); Pixels

for which Pips > Phackground (bottom)

ure 4(c) and Figure 3(c) shows the difference between

using a general facial distribution, and one matched to

the actual image Notice how the facial region is much

darker in Figure 4, indicating a higher probability

5 LIP TRACKING Another goal of our research is to develop a robust lip

analysis system As a first step, we wanted to test

the ability of SEDT’s to track the border of the lips

through a video sequence Our template is shown in

Figure 1(b) The template has two parabolas which

are embedded in a rectangle There are a total of five

parameters — four for the rectangle, and one to specify

the vertical position of the intersection of the parabo-

las

Our test consisted of manually placing the tem-

plate in frame 1 of the video sequence, and estimating

the foreground and background distributions These

distributions were applied to successive frames, and

a minimization algorithm was run to find the opti-

mal template As shown in Figure 5, the results are

very promising Likewise, the inner contour of the lips

can be tracked by estimating the distribution of the

mouth opening, and considering the lips themselves to

be background

6 CONCLUSION

In this paper, we have presented an extension to de-

formable templates which allows for statistical segmen-

tation of images The system performed well on many

foreground/background segmentation tasks including

facial extraction and lip tracking Our method cap- italizes on the statistical segmentation properties of HMM’s and incorporates the shape coherence properties of deformable templates Work remains in finding

automatic methods for initializing the templates, par- ticularly for the the lip tracking algorithm It is also

necessary to assess which color spaces and parameter

sets work best and which ones are most invariant to

varying lighting conditions and differing speakers -

7 REFERENCES

[1] R Chellapa, C Wilson, and S Sirohey, “Human

and machine recognition of faces: A survey,” Pro- ceedings of the IEEE, vol 83, pp 705-740, May

1995

M Hennecke, K Prasad, and D Stork, “Using deformable templates to infer visual speech dynam- ics,” in Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, (Pacific Grove, CA), November 1994

A Yuille, P Hallinan, and D Cohen, “Feature ex-

traction from faces using deformable templates,”

International Journal of Computer Vision, vol 8,

no 2, pp 99-111, 1992

H M Hunke, “Locating and tracking of human faces with neural networks,” Tech Rep CMU-CS- 94-155, Carnegie Mellon University, August 1994

L Rabiner and B Juang, Fundamentals of Spech Recognition Englewood Cliffs, NJ: Prentice-Hall,

1993

E Levin and R Pieraccini, “Dynamic planar warp-

ing for optical character recognition,” in Proc Int Conf Acoust., Speech, Signal Processing, pp I1I-149

— I[I-152, 1992

O Agazzi and S Kuo, “Hidden Markov model based optical character recognition in the presence

of deterministic transformations,” Pattern Recogni-

tion, vol 26, no 12, pp 1813-26, 1993

[2]

[7]

Định dạng
Số trang	4
Dung lượng	690,42 KB