Crane Gesture Recognition Using Pseudo 3-D Hidden Markov ModelsStefan M¨uller, Stefan Eickeler, Gerhard Rigoll Gerhard-Mercator-University Duisburg Department of Computer Science Faculty
Trang 1Crane Gesture Recognition Using Pseudo 3-D Hidden Markov Models
Stefan M¨uller, Stefan Eickeler, Gerhard Rigoll
Gerhard-Mercator-University Duisburg Department of Computer Science Faculty of Electrical Engineering
47057 Duisburg – Germany e-mail:fstm,eickeler,rigollg@fb9-ti.uni-duisburg.de
Abstract
A recognition technique based on novel pseudo 3-D
Hid-den Markov Models, which can integrate spatial as well as
temporal derived features is presented in this paper The
approach allows the recognition of dynamic gestures such
as waving hands as well as static gestures such as
stand-ing in a special pose Pseudo 3-D Hidden Markov
Mod-els (P3DHMMs) are an extension of the pseudo 2-D case,
which has been successfully used for the classification of
images and the recognition of faces In the P3DHMM
case the so-called superstates contain P2DHMMs and thus
whole image sequences can be generated by these
mod-els Our approach has been evaluated on a crane signal
database, which consists of 12 different predefined gestures
for maneuvering cranes.
1 Introduction
There are many publications which, recently, report
about the use of Hidden Markov Models (HMMs) for the
recognition of human actions in image sequences For
ex-ample Yamato et al [1], which is probably the first
publica-tion addressing this problem, use discrete HMMs and thus
a sequence of VQ-labels in order to recognize six classes
representing tennis strokes In their approach several
pre-processing steps including low pass filtering, background
subtraction and binarization are applied to each image of a
sequence The outcome of these steps is a two level image,
where the pose of the human is roughly extracted Prior to
the calculation of the features itself, size normalization and
a centering step are applied to the binarized image The
fea-tures itself are the amounts of black pixels in a mesh, i.e a
subsampled image arranged in a feature vector These
fea-tures are vector quantized and thus the image sequence
be-comes a sequence of VQ-labels, which can be processed by
a discrete HMM (at that time the preferred modeling tech-nique)
Schuster and Rigoll also applied discrete HMMs to the task of image sequence recognition in [2] Their approach utilizes a much simpler preprocessing, which leads to a sys-tem with real-time capabilities The color images of a se-quence are subsampled for each RGB plane separately and horizontal or vertical stribes are directly fed into a vector quantizer Alternatively, the same steps are applied to a dif-ference image sequence This real-time capable system has been evaluated on a ten class database, which consists of
gestures such as nod-no,nod-yes, kotow and clapping.
The system mentioned above has been improved by uti-lizing continuous HMMs in conjunction with geometric moments calculated on difference images As reported in [3] the improved system is capable of classifying 24 ges-tures with a recognition accuracy of>90%
Continuous HMMs in combination with moments are also used by Starner et al in [4] This system recognizes American Sign Language by extracting the hands of a per-son from images and performs a second moment analysis on the extracted blobs Besides the components derived from the extracted shapes of the hands, dynamic features such as the change of the position between frames are also part of the feature vector
Most of the systems mentioned previously heavily rely
on the existence of motion or moving body parts, due to the calculation of e.g moments on the difference images
In order to overcome this limitation, we propose the usage
of pseudo 3-D HMMs, which are able to integrate features derived from temporal as well as spatial information and which can also perform an elastic matching on the individ-ual images This is different from the previously mentioned approaches, because either VQ-labels are assigned to whole images ([1, 2]) or global features are calculated ([3],[4]) and thus no elastic matching on the image itself is performed
Trang 2The elastic matching procedure should also allow a position
invariant recognition of gestures
This paper is organized as follows Section 2 gives an
introduction to pseudo 3-D HMMs and describes the
fea-ture extraction used in the experiments Section 3 presents
experimental results A summary is given in Section 4
2 Pseudo 3-D HMMs for the Stochastic
Mod-eling of Three-Dimensional Data
Hidden Markov Models are finite non-deterministic state
machines which have been successfully applied to
continu-ous speech [5] and online handwriting recognition [6] They
consist of a fixed number of states with associated output
density functions (pdfs) as well as transition probabilities
a
ij
= P r(q
t
= s
j jq
t 1
= s
i ), whereq
tdenotes the actual state at timet,s
jis a distinct state and~ odenotes a feature
vector Especially large feature vectors consisting of
inho-mogeneous components are often divided into statistically
independent streams (see e.g [7]) and thus forS streams
and given streamweights sthe pdfb
j (~ o )of states
jcan be calculated as
b
j (~ o ) = S
Y
s=1 b
js
~ o
s s
(1)
For every streams, the pdfsb
js
~ o
s are usually given by finite Gaussian mixtures of the form
b
js
~
o
s
= Ms
X
m=1 c
jsm N
~ o
s
; ~
jsm
;
$
jsm
(2)
wherec
jsm is the mixture coefficient for themth mixture
in streamsandN (~ o
s
; ~
jsm
;
$
jsm )is a multivariate Gaus-sian density with mean vector~
jsmand covariance matrix
$
jsm The use of streams allows the integration of
fea-tures derived from temporal as well as spatial data into a
single model Furthermore, the stream weights provide the
opportunity to adjust the influence of temporal and spatial
features
A HMM(~ ;
$
a ;
~
b with N states is fully described by the NN-dimensional transition matrix $
a, the N-dimen-sional output pdf vector~
b and the initial state distribu-tion vector ~ which consists of the probabilities
j
=
P r(q
t=1
= s
j
) After the modelhas been trained
us-ing the Baum-Welch algorithm, feature sequences ~
O =
~
o
1
; : ; ~ o can be scored according to
P r
~
Oj
=
X
q
q b
q (~ o
1 T
Y
t=2 a
qt
1 t b
qt (~ o
t ) (3)
Usually the likelihoodP r(
~
Oj)is estimated by the Viterbi algorithm, which is an approximation based on the most
likely state sequence ( 1 ) For recognition tasks,
P r(
~
Oj)is used to classify an unknown pattern to class p?
which satisfies Eq 4
p?
= argmax p
P r
~
O j p
(4)
A very detailed explanation of the HMM-framework is given by Rabiner in [5]
It has been shown that HMMs can not only be ap-plied successfully to time series problems, but also to pat-tern recognition problems with the patpat-tern varying in space rather than in time Therefore, HMMs have been recently applied to image recognition problems with promising re-sults [8, 9] In both publications pseudo 2-D HMMs have been utilized, which are also known as planar HMMs A P2DHMM is an extension of the one-dimensional HMM paradigm, which has been developed in order to model
two-dimensional data They are called pseudo due to the fact
that the state alignment of consecutive columns is calculated independently from each other P2DHMMs are stochas-tic state machines with a two-dimensional arrangement of the states, as outlined in Fig 1 The states in
horizon-
Figure 1 Pseudo 2-D Hidden Markov Model
tal direction are denoted as superstates, and each
super-state consists of a one-dimensional HMM in vertical direc-tion The P2DHMM shown in Fig 1 can be trained from data, after features have been extracted, using the segmen-tal k-means algorithm Once the models have been trained for each class, the recognition procedure is accomplished
by calculating the class-dependent probability that the (un-classified) data has been generated by the corresponding HMM For this procedure, the doubly embedded Viterbi al-gorithm can be utilized, which has been proposed by Kuo and Agazzi in [8] Alternatively, Samaria shows in [10], that a P2DHMM can be transformed into an equivalent
one-dimensional HMM by the insertion of special
start-of-line states and features Fig 2 shows an augmented6 6
P2DHMM with start-of-line states (indicated by a cross)
Trang 3Figure 3 Pseudo three-dimensional Hidden Markov Model
Figure 2 Augmented 6 6 P2DHMM with
start-of-line marker states
These states generate a high probability for the emission of
start-of-line features When using the structure in Fig 2
one has to take care of the fact that the value for the
start-of-line feature is different from all possible ordinary
fea-tures These equivalent HMMs can be trained by the
stan-dard Baum-Welch algorithm and the recognition step can be
carried out using the standard Viterbi algorithm
The natural extension of the two-dimensional case leads
to a structure as shown in Fig 3, which shows a pseudo
3-D HMM Each superstate now consists of a P2DHMM
We implemented the structure in Fig 3 by applying the
technique suggested by Samaria twice, i.e by additionally
inserting special start-of-image states and features Due
to this implementation technique, the P3DHMM shown in
Fig 3 can be trained from data, by applying standard HMM
techniques
The feature extraction used throughout this paper is
based on the discrete cosine transform (DCT) Each image
of a sequence is scanned with a sampling window top to
bottom and left to right The pixels in the sampling window
of the size are transformed using the DCT according
to the equation:
C(u; v) = (u)(v)
15
X
x=0
15
X
y=0
f (x; y)
cos
(2x + 1)u
32
cos
(2y + 1)v
32
(5)
A triangle shaped mask extracts the first 15 coefficients
(u + v 4), which are arranged in a vector These DCT coefficients are calculated on the individual images (static feature component) of a sequence as well as the difference images (dynamic feature component) Due to the utilization
of the HMM framework, both features can be integrated by using feature-streams and by assigning stream weights in order to control the influence of the individual streams (see also Eq 1)
3 Experiments and Results
In order to obtain a detailed evaluation of the P3DHMM approach, experiments on a crane signal database consist-ing of 12 classes have been performed Crane signals are
a well defined set of gestures, which allow to maneuver
a crane in the presence of obstacles or problematic
envi-ronments (see also [11]) Fig 4 shows the 12 classes slew
left (right), travel to (from) me, extend (retract) jib, jib up (down), hoist, lower, stop and emergency stop, where the
latter two classes represent two examples for static gestures
with hardly any movement involved Five individuals per-formed each of the 12 gestures several times and thus two repetitions for each gesture built the training set, whereas the remaining repetitions are used for testing Fig 5
il-lustrates the two classes jib up and jib down in the upper and lower row, respectively, taken from the stm set
Ta-ble 1 shows the recognition accuracies achieved in the ex-periments and presents also results on the crane signal task using one-dimensional HMMs and geometric moments as described in [3] In the experiments, four superstates with
(5 5)P2DHMMs per superstate have been used as con-figuration of the P3DHMMs Note that the P3DHMM ap-proach shows a slightly higher recognition accuracy
Trang 4com-Figure 4 Denition of the twelve crane
sig-nals slew left (right), travel to (from) me, extend
(re-tract) jib, jib up (down), hoist, lower, stopand
emer-gency stop [11].
pared to the one-dimensional case However, there are two
more important reasons for using P3DHMMs: One is the
fact that static and dynamic gestures can be now mixed and
handled with the same unique recognition paradigm The
other is the possibility that due to the warping capabilities
of the P3DHMM an elastic matching can be performed on
the individual images which results in a position and size
invariant gesture recognition mode
4 Summary
Image sequence recognition based on novel pseudo
three-dimensional Hidden Markov Models has been
pre-sented The modeling technique allows the integration of
spatial and temporal derived features in an elegant way and
is also capable of recognizing static gestures where hardly
any body movement is involved Compared to an approach
based on one-dimensional HMMs and geometric moments,
1D HMM P3DHMM ste 100% 88.6%
stm 85.3% 91.2%
ank 100% 100%
bw 88.2% 94.1%
jmr 80.5% 80.5%
average 90.74% 90.88%
Table 1 Recognition accuracies achieved in the
experiments
the P3DHMMs showed a slightly better recognition accu-racy on a 12 class crane signal task Due to the warping capabilities of the P3DHMMs, the proposed approach leads
to a position independent recognition mode However, this has not been fully evaluated yet and the present publication shows mainly the feasibility of this modeling approach
References
[1] J Yamato, J Ohya, and K Ishii, “Recognizing Hu-man Action in Time-Sequential Images Using Hidden Markov Model”, In Proc IEEE Int Conference on Computer Vision and Pattern Recognition, 1992, pp 379–385
[2] M Schuster and G Rigoll, “Fast Online Video Im-age Sequence Recognition with Statistical Methods”,
In Proc IEEE Int Conference on Acoustics, Speech and Signal Processing, Atlanta, 1996, pp 3450–3453
[3] G Rigoll and A Kosmala, “New Improved Feature Extraction Methods for Real-Time High Performance Image Sequence Recognition”, In Proc IEEE Int Conference on Acoustics, Speech, and Signal Process-ing, Munich, 1997, pp 3373–3376
[4] T Starner, J Weaver, and A Pentland, “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video”, IEEE Trans
on Pattern Recognition and Machine Intelligence, Vol 20, No 12, Dec 1998, pp 1371–1375
[5] L R Rabiner, “A Tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recognition”, Proc of the IEEE, Vol 77, No 2, Feb 1989, pp 257– 285
[6] K S Nathan, J R Bellegarda, D Nahamoo, and
E J Bellegarda, “On-line Handwriting Recognition Using Continuous Parameter Hidden Markov Mod-els”, In Proc IEEE Intern Conference on Acoustics,
Trang 5jib up jib down stm
Speech, and Signal Processing , Minneapolis, 1993,
Vol 5, pp 121–124
[7] V N Gupta, M Lenning, and P Mermelstein,
“Inte-gration of Acoustic Information in a Large Vocabulary
Word Recognizer”, In Proc IEEE Intern Conference
on Acoustics, Speech, and Signal Processing , Dallas,
1997, pp 697–700
[8] S Kuo and O Agazzi, “Keyword Spotting in Poorly
Printed Documents Using Pseudo 2-D Hidden Markov
Models”, IEEE Trans on Pattern Recognition and
Machine Intelligence, Vol 16, No 8, 1994, pp 842–
848
[9] S Eickeler, S M¨uller, and G Rigoll, “High Quality
Face Recognition in JPEG Compressed Images”, In
Proc IEEE Intern Conference on Image Processing,
Kobe, 1999
[10] F.S Samaria, “Face Recognition Using Hidden
Markov Models”, Ph D Thesis, Cambridge
Univer-sity, 1994
[11] A Parrish, “Mechanical Engineers’s Reference
Book”, Butterworth, London, 1980
... sizeinvariant gesture recognition mode
4 Summary
Image sequence recognition based on novel pseudo
three-dimensional Hidden Markov Models has been... Images Using Hidden Markov Model”, In Proc IEEE Int Conference on Computer Vision and Pattern Recognition, 1992, pp 379–385
[2] M Schuster and G Rigoll, “Fast Online Video Im-age Sequence Recognition. .. Agazzi, “Keyword Spotting in Poorly
Printed Documents Using Pseudo 2-D Hidden Markov
Models”, IEEE Trans on Pattern Recognition and
Machine Intelligence, Vol 16, No 8, 1994,