The main problem with deformable templates is that the energy functions are experimentally designed, and they do not statistically segment the image.. Alternatively, looking at our solut
Trang 1ON MERGING HIDDEN MARKOV MODELS WITH DEFORMABLE
TEMPLATES
Ram R Rao and Russell M Mersereau School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332
rr@eedsp.gatech.edu
ABSTRACT Hidden Markov modeling has proven extremely useful
for statistical analysis of speech signals There are,
however, inherent problems in two dimensional exten-
sions to HMM’s, one of which is the exponential com-
plexity associated with fully 2-D HMM’s In this paper,
we propose a new 2-D HMM-like structure obtained by
embedding states within regions of a deformable tem-
plate structure With this state-embedded deformable
template (SEDT), each region of a deformable tem-
plate has an underlying observation probability distri-
bution This structure allows for computation of the
Plimage|template] The template that maximizes this
probability provides an optimal segmentation of the
image This segmentation capability will be demon-
strated in facial analysis applications
1 INTRODUCTION Facial analysis is a difficult problem which has many
potential applications Robust facial analysis systems
are an integral part of any model-based coding, fa-
cial recognition [1], or visual speech recognition sys-
tem [2] Many researchers are attempting to provide a
standard framework for tackling these image analysis
tasks Two of the more interesting analysis approaches
are deformable templates and hidden Markov model-
ing Both of these approaches have advantages and
shortcomings
Deformable templates [3] have been used to model
the eyes, lips, and face for applications such as visual
speech recognition and face recognition These tem-
plates have certain structural characteristics, such as
associating the head with an ellipse, or the lips with
four parabolas They also have energy functions which
are often the sum of an image-related energy term, and
an internal energy term The image-related term is
This work is supported by the U.S Army Research Office, Contract DAAL03-92-G-0068
usually a function of the edge, peak and valley fields derived from the image ‘The internal energy is of- ten heuristically designed to keep template parameters within acceptable ranges Minimization of the energy function yields the template which best matches the image The main problem with deformable templates is that the energy functions are experimentally designed, and they do not statistically segment the image There is strong motivation for statistically modeling the pixel values which occur in an image Since there
is a difference between “skin” colors and background colors in head and shoulder images [4], one would like
to model these distributions and use this information
to segment the image
Hidden Markov models [5] provide a strong sta- tistical framework for analyzing one-dimensional ran- dom processes The key concept behind HMM’s is
a set of states which have probabilistic output dis- tributions Two-dimensional HMM’s aren’t quite so tractable Fully two-dimensional HMM’s have been shown to have exponential complexity [6] One practi- cal solution to this has been to use psuedo-2D HMM’s [7] Essentially, one dimensional HMM’s operate on the rows of the image, and these HMM’s are nested in another HMM Psuedo-2D HMM’s, however, can not incorporate any shape constraints since each row is an- alyzed independently
2 STATE-EMBEDDED DEFORMABLE
TEMPLATES Since it seems that deformable templates provide a good framework for structurally analyzing an image, and HMM’s provide a good framework for statistically analyzing an image, it makes sense to capitalize on the benefits of both Our solution entails associating a state with each region of a deformable template These states have observation probability density functions which reflect the probability of observing a particular pixel value while in the state For example, the head
Trang 2
” =
(b)
(a) Figure 1: (a) SEDT used for facial extraction
(x1,22,y1,y2) (b) SEDT used for lip tracking
(x1, 22, y1, y2, 3)
A=
A=
can be modeled by an ellipse with a foreground state
and a background state (Figure la) This has some in-
tuitive sense since the face normally has different sta-
tistical characteristics than the background, especially
when using color
Our SEDT’s are specified as follows:
e The variable, À = (Ai -Àx), parameterizes a deformable template structure For example, if the template were a rectangle, K = 4, and A could be the x and y coordinates of the upper- left and lower-right corners of the rectangle
The template divides the image into N regions Ro +Rn— 1 In case of N = 2, we have an im- age divided into foreground and background re- gions Each region has an associated observation probability density function, b;(Q), where Q is
a (possibly multidimensional) pixel value 6;(Q) can be any parameterized pdf such as a Gaussian
or Gaussian mixture
From this, it follows that:
N-1
j=0 (x,y)eRj where I is the image, and I(z,y) is the (possibly mul-
tidimensional) pixel value at location (2, y)
Maximizing P[Z|A] over A yields the optimal tem-
plate Equivalently, we can minimize — log P[J|A] Look-
ing at SEDT’s from a deformable template perspective,
we can think of — log P{I|A] as our energy function
Alternatively, looking at our solution from an HMM
perspective, we can think of the optimal template as
Figure 2: Shown is a) original image with the initial and final position of the template (foreground: p == 200, o* = 100; background: pz = 200, o? = 10); b) points for which P[pixel « foreground] > P[pixel ¢ background]
being analogous to the optimal state sequence parti- tioning The analog of Viterbi training would be to par- tition the data using the optimal templates, reestimate the output probability distribution functions given the partitioned data, and repeat until convergence
3 SYNTHETIC EXAMPLE The first test of our template was to find an arbitrary sized rectangle within an image The rectangle had pix- els with intensity specified by a Gaussian with mean and variance, ps and oy, respectively Likewise, the background had interisity specified by a Gaussian with
fp and oy Our template was a rectangle specified by the coordinates of its upper-left and lower-right cor- ners
Starting with an initial template, estimates of the foreground and background pdf’s were made A steep- est descent minimization algorithm was then used to minimize ~— log P[I|A] over A This new template was then used to reestimate the foreground and background pdf’s, and the process was repeated until convergence
it was seen that this process is sensitive to the initial placement of the template Good results were obtained when the initial template completely covered the un- known rectangle, or when it was contained within the unknown rectangle These template choices work well because either the foreground pdf or the background pdf is reliably estimated initially Now since we didn’t know the position of the rectangle, our system was al- ways started with a rectangle that covered a majority
of the input image (Figure 2)
There is a problem with this procedure Consider the case where the foreground has a lower variance than the background, and they both have equal means The choice of a large initial template would likely con- tain pixels from both the foreground and background
Trang 3(d)
Figure 3: Initialization procedure (a) Region used
to estimate facial distribution for “Chris”; (b) Result
of applying this distribution to “Haluk” and applying
threshold; (c) Probability of pixel being part of face
for “Haluk” using distribution derived from (a) (dark
region = high probability); (d) “Haluk” image, with
initial template position
Thus, the estimate of the variance of the foreground pdf
would approach the variance of the background pdf
When there is a large overlap between the two pdf’s
the system will not work well This can be remedied
by altering the reestimation procedure to ensure that
there is adequate separation between the two pdf’s
4, FACIAL EXTRACTION One of our main objectives was to find a robust pro-
cedure for extracting the boundary of a person’s head
in a full-color head and shoulders video sequence The
head was modeled as an ellipse with no rotation, and
the foreground and background pdf’s were modeled as
Gaussian mixtures Each mixture contained two Gaus-
sians with full covariance matrices
In the development of our system, a number of facts
became clear First, if the foreground and background
pdf’s are available, minimizing the energy function,
~log P[F|A], would successfully segment the face from
the background However, since these distributions are
unknown in the initial frame, they must somehow be es-
timated Second, if a point on the person’s face could
be located, a region around this point could be used
to estimate the foreground pdf Assuming everything
outside this region was background, we could also esti-
mate a background pdf The facial border could then
{c) (d)
Figure 4: Facial Extraction (a) Original image with initial and final placement of template; (b) Pixels for
which Phead > Phackground; (¢) & (d) Probability of head and background, respectively (dark = high prob- ability)
be found by iterating between minimizing the energy function and reestimating foreground and background distributions
One important task was to develop a subsystem which could locate a point on a person’s face This could be done by first developing a general “face” pdf ` Ideally, one would like to collect a large database of faces under varying lighting conditions to estimate a
general “face” pdf, but we didn’t have such a large
database We chose to use the facial distribution of one person as an approximation of the facial distribution for a different person A point in the face was found
by applying this pdf to the input image A threshold
was applied to the new image to find all points which had probability within a certain range of the pixel with maximum probability The median z and y values of these pixels would be located in the person’s face The
median operation works much better than averaging, and also works better than attempting to find an n by
mn square of pixels whose joint probability is greatest It also seems to implicitly use the fact that for the most part, the face of interest is near the center of the image This procedure is shown in Figure 3
Figure 4 shows the convergence of the template to the final head border Image specific distributions for the foreground and background are estimated using the:
initial template A steepest descent minimization al-
gorithm is then used to minimize —log P[ZjA] This process is repeated until convergence Comparing
Trang 4
Figure 5: Results of lip tracking algorithm (top); Pixels
for which Pips > Phackground (bottom)
ure 4(c) and Figure 3(c) shows the difference between
using a general facial distribution, and one matched to
the actual image Notice how the facial region is much
darker in Figure 4, indicating a higher probability
5 LIP TRACKING Another goal of our research is to develop a robust lip
analysis system As a first step, we wanted to test
the ability of SEDT’s to track the border of the lips
through a video sequence Our template is shown in
Figure 1(b) The template has two parabolas which
are embedded in a rectangle There are a total of five
parameters — four for the rectangle, and one to specify
the vertical position of the intersection of the parabo-
las
Our test consisted of manually placing the tem-
plate in frame 1 of the video sequence, and estimating
the foreground and background distributions These
distributions were applied to successive frames, and
a minimization algorithm was run to find the opti-
mal template As shown in Figure 5, the results are
very promising Likewise, the inner contour of the lips
can be tracked by estimating the distribution of the
mouth opening, and considering the lips themselves to
be background
6 CONCLUSION
In this paper, we have presented an extension to de-
formable templates which allows for statistical segmen-
tation of images The system performed well on many
foreground/background segmentation tasks including
facial extraction and lip tracking Our method cap- italizes on the statistical segmentation properties of HMM’s and incorporates the shape coherence proper- ties of deformable templates Work remains in finding
automatic methods for initializing the templates, par- ticularly for the the lip tracking algorithm It is also
necessary to assess which color spaces and parameter
sets work best and which ones are most invariant to
varying lighting conditions and differing speakers -
7 REFERENCES
[1] R Chellapa, C Wilson, and S Sirohey, “Human
and machine recognition of faces: A survey,” Pro- ceedings of the IEEE, vol 83, pp 705-740, May
1995
M Hennecke, K Prasad, and D Stork, “Using de- formable templates to infer visual speech dynam- ics,” in Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, (Pacific Grove, CA), November 1994
A Yuille, P Hallinan, and D Cohen, “Feature ex-
traction from faces using deformable templates,”
International Journal of Computer Vision, vol 8,
no 2, pp 99-111, 1992
H M Hunke, “Locating and tracking of human faces with neural networks,” Tech Rep CMU-CS- 94-155, Carnegie Mellon University, August 1994
L Rabiner and B Juang, Fundamentals of Spech Recognition Englewood Cliffs, NJ: Prentice-Hall,
1993
E Levin and R Pieraccini, “Dynamic planar warp-
ing for optical character recognition,” in Proc Int Conf Acoust., Speech, Signal Processing, pp I1I-149
— I[I-152, 1992
O Agazzi and S Kuo, “Hidden Markov model based optical character recognition in the presence
of deterministic transformations,” Pattern Recogni-
tion, vol 26, no 12, pp 1813-26, 1993
[2]
[7]