crane gesture recognition using pseudo 3-d hidden markov models5

Crane Gesture Recognition Using Pseudo 3-D Hidden Markov ModelsStefan M¨uller, Stefan Eickeler, Gerhard Rigoll Gerhard-Mercator-University Duisburg Department of Computer Science Faculty

Trang 1

Crane Gesture Recognition Using Pseudo 3-D Hidden Markov Models

Stefan M¨uller, Stefan Eickeler, Gerhard Rigoll

Gerhard-Mercator-University Duisburg Department of Computer Science Faculty of Electrical Engineering

47057 Duisburg – Germany e-mail:fstm,eickeler,rigollg@fb9-ti.uni-duisburg.de

Abstract

A recognition technique based on novel pseudo 3-D

Hid-den Markov Models, which can integrate spatial as well as

temporal derived features is presented in this paper The

approach allows the recognition of dynamic gestures such

as waving hands as well as static gestures such as

stand-ing in a special pose Pseudo 3-D Hidden Markov

Mod-els (P3DHMMs) are an extension of the pseudo 2-D case,

which has been successfully used for the classification of

images and the recognition of faces In the P3DHMM

case the so-called superstates contain P2DHMMs and thus

whole image sequences can be generated by these

mod-els Our approach has been evaluated on a crane signal

database, which consists of 12 different predefined gestures

for maneuvering cranes.

1 Introduction

There are many publications which, recently, report

about the use of Hidden Markov Models (HMMs) for the

recognition of human actions in image sequences For

ex-ample Yamato et al [1], which is probably the first

publica-tion addressing this problem, use discrete HMMs and thus

a sequence of VQ-labels in order to recognize six classes

representing tennis strokes In their approach several

pre-processing steps including low pass filtering, background

subtraction and binarization are applied to each image of a

sequence The outcome of these steps is a two level image,

where the pose of the human is roughly extracted Prior to

the calculation of the features itself, size normalization and

a centering step are applied to the binarized image The

fea-tures itself are the amounts of black pixels in a mesh, i.e a

subsampled image arranged in a feature vector These

fea-tures are vector quantized and thus the image sequence

be-comes a sequence of VQ-labels, which can be processed by

a discrete HMM (at that time the preferred modeling tech-nique)

Schuster and Rigoll also applied discrete HMMs to the task of image sequence recognition in [2] Their approach utilizes a much simpler preprocessing, which leads to a sys-tem with real-time capabilities The color images of a se-quence are subsampled for each RGB plane separately and horizontal or vertical stribes are directly fed into a vector quantizer Alternatively, the same steps are applied to a dif-ference image sequence This real-time capable system has been evaluated on a ten class database, which consists of

gestures such as nod-no,nod-yes, kotow and clapping.

The system mentioned above has been improved by uti-lizing continuous HMMs in conjunction with geometric moments calculated on difference images As reported in [3] the improved system is capable of classifying 24 ges-tures with a recognition accuracy of>90%

Continuous HMMs in combination with moments are also used by Starner et al in [4] This system recognizes American Sign Language by extracting the hands of a per-son from images and performs a second moment analysis on the extracted blobs Besides the components derived from the extracted shapes of the hands, dynamic features such as the change of the position between frames are also part of the feature vector

Most of the systems mentioned previously heavily rely

on the existence of motion or moving body parts, due to the calculation of e.g moments on the difference images

In order to overcome this limitation, we propose the usage

of pseudo 3-D HMMs, which are able to integrate features derived from temporal as well as spatial information and which can also perform an elastic matching on the individ-ual images This is different from the previously mentioned approaches, because either VQ-labels are assigned to whole images ([1, 2]) or global features are calculated ([3],[4]) and thus no elastic matching on the image itself is performed

Trang 2

The elastic matching procedure should also allow a position

invariant recognition of gestures

This paper is organized as follows Section 2 gives an

introduction to pseudo 3-D HMMs and describes the

fea-ture extraction used in the experiments Section 3 presents

experimental results A summary is given in Section 4

2 Pseudo 3-D HMMs for the Stochastic

Mod-eling of Three-Dimensional Data

Hidden Markov Models are finite non-deterministic state

machines which have been successfully applied to

continu-ous speech [5] and online handwriting recognition [6] They

consist of a fixed number of states with associated output

density functions (pdfs) as well as transition probabilities

a

ij

= P r(q

t

= s

j jq

t 1

= s

i ), whereq

tdenotes the actual state at timet,s

jis a distinct state and~ odenotes a feature

vector Especially large feature vectors consisting of

inho-mogeneous components are often divided into statistically

independent streams (see e.g [7]) and thus forS streams

and given streamweights sthe pdfb

j (~ o )of states

jcan be calculated as

b

j (~ o ) = S

Y

s=1 b

js

~ o

s s

(1)

For every streams, the pdfsb

js

~ o

s are usually given by finite Gaussian mixtures of the form

b

js

~

o

s

= Ms

X

m=1 c

jsm N

~ o

s

; ~

jsm

;

$

jsm

(2)

wherec

jsm is the mixture coefficient for themth mixture

in streamsandN (~ o

s

; ~

jsm

;

$

jsm )is a multivariate Gaus-sian density with mean vector~

jsmand covariance matrix

$

jsm The use of streams allows the integration of

fea-tures derived from temporal as well as spatial data into a

single model Furthermore, the stream weights provide the

opportunity to adjust the influence of temporal and spatial

features

A HMM(~ ;

$

a ;

~

b with N states is fully described by the NN-dimensional transition matrix $

a, the N-dimen-sional output pdf vector~

b and the initial state distribu-tion vector ~ which consists of the probabilities

j

=

P r(q

t=1

= s

j

) After the modelhas been trained

us-ing the Baum-Welch algorithm, feature sequences ~

O =

~

o

1

; : ; ~ o can be scored according to

P r

~

Oj

=

X

q

q b

q (~ o

1 T

Y

t=2 a

qt

1 t b

qt (~ o

t ) (3)

Usually the likelihoodP r(

~

Oj)is estimated by the Viterbi algorithm, which is an approximation based on the most

likely state sequence ( 1 ) For recognition tasks,

P r(

~

Oj)is used to classify an unknown pattern to class p?

which satisfies Eq 4

p?

= argmax p

P r

~

O j p

(4)

A very detailed explanation of the HMM-framework is given by Rabiner in [5]

It has been shown that HMMs can not only be ap-plied successfully to time series problems, but also to pat-tern recognition problems with the patpat-tern varying in space rather than in time Therefore, HMMs have been recently applied to image recognition problems with promising re-sults [8, 9] In both publications pseudo 2-D HMMs have been utilized, which are also known as planar HMMs A P2DHMM is an extension of the one-dimensional HMM paradigm, which has been developed in order to model

two-dimensional data They are called pseudo due to the fact

that the state alignment of consecutive columns is calculated independently from each other P2DHMMs are stochas-tic state machines with a two-dimensional arrangement of the states, as outlined in Fig 1 The states in

horizon-

Figure 1 Pseudo 2-D Hidden Markov Model

tal direction are denoted as superstates, and each

super-state consists of a one-dimensional HMM in vertical direc-tion The P2DHMM shown in Fig 1 can be trained from data, after features have been extracted, using the segmen-tal k-means algorithm Once the models have been trained for each class, the recognition procedure is accomplished

by calculating the class-dependent probability that the (un-classified) data has been generated by the corresponding HMM For this procedure, the doubly embedded Viterbi al-gorithm can be utilized, which has been proposed by Kuo and Agazzi in [8] Alternatively, Samaria shows in [10], that a P2DHMM can be transformed into an equivalent

one-dimensional HMM by the insertion of special

start-of-line states and features Fig 2 shows an augmented6 6

P2DHMM with start-of-line states (indicated by a cross)

Trang 3

Figure 3 Pseudo three-dimensional Hidden Markov Model

Figure 2 Augmented 6 6 P2DHMM with

start-of-line marker states

These states generate a high probability for the emission of

start-of-line features When using the structure in Fig 2

one has to take care of the fact that the value for the

start-of-line feature is different from all possible ordinary

fea-tures These equivalent HMMs can be trained by the

stan-dard Baum-Welch algorithm and the recognition step can be

carried out using the standard Viterbi algorithm

The natural extension of the two-dimensional case leads

to a structure as shown in Fig 3, which shows a pseudo

3-D HMM Each superstate now consists of a P2DHMM

We implemented the structure in Fig 3 by applying the

technique suggested by Samaria twice, i.e by additionally

inserting special start-of-image states and features Due

to this implementation technique, the P3DHMM shown in

Fig 3 can be trained from data, by applying standard HMM

techniques

The feature extraction used throughout this paper is

based on the discrete cosine transform (DCT) Each image

of a sequence is scanned with a sampling window top to

bottom and left to right The pixels in the sampling window

of the size are transformed using the DCT according

to the equation:

C(u; v) = (u)(v)

15

X

x=0

15

X

y=0

f (x; y)

cos

(2x + 1)u

32

cos

(2y + 1)v

32

(5)

A triangle shaped mask extracts the first 15 coefficients

(u + v 4), which are arranged in a vector These DCT coefficients are calculated on the individual images (static feature component) of a sequence as well as the difference images (dynamic feature component) Due to the utilization

of the HMM framework, both features can be integrated by using feature-streams and by assigning stream weights in order to control the influence of the individual streams (see also Eq 1)

3 Experiments and Results

In order to obtain a detailed evaluation of the P3DHMM approach, experiments on a crane signal database consist-ing of 12 classes have been performed Crane signals are

a well defined set of gestures, which allow to maneuver

a crane in the presence of obstacles or problematic

envi-ronments (see also [11]) Fig 4 shows the 12 classes slew

left (right), travel to (from) me, extend (retract) jib, jib up (down), hoist, lower, stop and emergency stop, where the

latter two classes represent two examples for static gestures

with hardly any movement involved Five individuals per-formed each of the 12 gestures several times and thus two repetitions for each gesture built the training set, whereas the remaining repetitions are used for testing Fig 5

il-lustrates the two classes jib up and jib down in the upper and lower row, respectively, taken from the stm set

Ta-ble 1 shows the recognition accuracies achieved in the ex-periments and presents also results on the crane signal task using one-dimensional HMMs and geometric moments as described in [3] In the experiments, four superstates with

(5 5)P2DHMMs per superstate have been used as con-figuration of the P3DHMMs Note that the P3DHMM ap-proach shows a slightly higher recognition accuracy

Trang 4

com-Figure 4 Denition of the twelve crane

sig-nals slew left (right), travel to (from) me, extend

(re-tract) jib, jib up (down), hoist, lower, stopand

emer-gency stop [11].

pared to the one-dimensional case However, there are two

more important reasons for using P3DHMMs: One is the

fact that static and dynamic gestures can be now mixed and

handled with the same unique recognition paradigm The

other is the possibility that due to the warping capabilities

of the P3DHMM an elastic matching can be performed on

the individual images which results in a position and size

invariant gesture recognition mode

4 Summary

Image sequence recognition based on novel pseudo

three-dimensional Hidden Markov Models has been

pre-sented The modeling technique allows the integration of

spatial and temporal derived features in an elegant way and

is also capable of recognizing static gestures where hardly

any body movement is involved Compared to an approach

based on one-dimensional HMMs and geometric moments,

1D HMM P3DHMM ste 100% 88.6%

stm 85.3% 91.2%

ank 100% 100%

bw 88.2% 94.1%

jmr 80.5% 80.5%

average 90.74% 90.88%

Table 1 Recognition accuracies achieved in the

experiments

the P3DHMMs showed a slightly better recognition accu-racy on a 12 class crane signal task Due to the warping capabilities of the P3DHMMs, the proposed approach leads

to a position independent recognition mode However, this has not been fully evaluated yet and the present publication shows mainly the feasibility of this modeling approach

References

[1] J Yamato, J Ohya, and K Ishii, “Recognizing Hu-man Action in Time-Sequential Images Using Hidden Markov Model”, In Proc IEEE Int Conference on Computer Vision and Pattern Recognition, 1992, pp 379–385

[2] M Schuster and G Rigoll, “Fast Online Video Im-age Sequence Recognition with Statistical Methods”,

In Proc IEEE Int Conference on Acoustics, Speech and Signal Processing, Atlanta, 1996, pp 3450–3453

[3] G Rigoll and A Kosmala, “New Improved Feature Extraction Methods for Real-Time High Performance Image Sequence Recognition”, In Proc IEEE Int Conference on Acoustics, Speech, and Signal Process-ing, Munich, 1997, pp 3373–3376

[4] T Starner, J Weaver, and A Pentland, “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video”, IEEE Trans

on Pattern Recognition and Machine Intelligence, Vol 20, No 12, Dec 1998, pp 1371–1375

[5] L R Rabiner, “A Tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recognition”, Proc of the IEEE, Vol 77, No 2, Feb 1989, pp 257– 285

[6] K S Nathan, J R Bellegarda, D Nahamoo, and

E J Bellegarda, “On-line Handwriting Recognition Using Continuous Parameter Hidden Markov Mod-els”, In Proc IEEE Intern Conference on Acoustics,

Trang 5

jib up jib down stm

Speech, and Signal Processing , Minneapolis, 1993,

Vol 5, pp 121–124

[7] V N Gupta, M Lenning, and P Mermelstein,

“Inte-gration of Acoustic Information in a Large Vocabulary

Word Recognizer”, In Proc IEEE Intern Conference

on Acoustics, Speech, and Signal Processing , Dallas,

1997, pp 697–700

[8] S Kuo and O Agazzi, “Keyword Spotting in Poorly

Printed Documents Using Pseudo 2-D Hidden Markov

Models”, IEEE Trans on Pattern Recognition and

Machine Intelligence, Vol 16, No 8, 1994, pp 842–

848

[9] S Eickeler, S M¨uller, and G Rigoll, “High Quality

Face Recognition in JPEG Compressed Images”, In

Proc IEEE Intern Conference on Image Processing,

Kobe, 1999

[10] F.S Samaria, “Face Recognition Using Hidden

Markov Models”, Ph D Thesis, Cambridge

Univer-sity, 1994

[11] A Parrish, “Mechanical Engineers’s Reference

Book”, Butterworth, London, 1980

invariant gesture recognition mode

4 Summary

Image sequence recognition based on novel pseudo

three-dimensional Hidden Markov Models has been... Images Using Hidden Markov Model”, In Proc IEEE Int Conference on Computer Vision and Pattern Recognition, 1992, pp 379–385

[2] M Schuster and G Rigoll, “Fast Online Video Im-age Sequence Recognition. .. Agazzi, “Keyword Spotting in Poorly

Printed Documents Using Pseudo 2-D Hidden Markov

Models”, IEEE Trans on Pattern Recognition and

Machine Intelligence, Vol 16, No 8, 1994,

Định dạng
Số trang	5
Dung lượng	212,83 KB