DSpace at VNU: Automated posture segmentation in continuous finger spelling recognition

Automated Posture Segmentation in Continuous Finger Spelling Recognition Nhat Thanh Nguyen Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hano

Trang 1

Automated Posture Segmentation in Continuous

Finger Spelling Recognition

Nhat Thanh Nguyen Human Machine Interaction Laboratory

College of Technology, Vietnam National University, Hanoi

The Duy Bui Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hanoi

Abstract— Recognizing continuous finger spelling plays an

important role in understanding sign language There are two

major phases in recognizing continuous finger spelling, which are

posture segmentation and posture recognition In the former, a

continuous gesture sequence is decomposed into segments, which

are then used for the latter to indentify corresponding characters

Among all the segments, beside valid postures corresponding to

characters, there are also many movement epentheses, which

appear between pairs of postures to move the hands from the end

of one posture to the beginning of the next In this paper, we

propose a framework to split a continuous movement sequence

into segments as well as to identify valid postures and movement

epentheses By using the velocity and signing rate based filter, we

can obtain very good result with both high recall and precision

rate

Keywords- finger spelling recognition, posture segmentation,

velocity filter, signing rate filter, maximal matching

I INTRODUCTION Sign language, a non-verbal language, is a primary means

of communication in the deaf community Different from

speech, sign language uses finger spelling and gestures to

convey information Automatic sign language recognition and

interpretation concentrate on understanding human signs and

translating them into text or speech, which might help to

overcome the difficulties in communication between the deft

people and the rest of the world These systems are often

developed with two main approaches: vision based one and

device based one Corresponding to the two approaches,

time-serial data is obtained as the input of systems in two different

formats Vision based approach uses video cameras to capture

the gestures of users, while device based approach depends on

sensing gloves to get hand parameters such as joint angles and

hand position

Sign language is presented by the sequential gestures in

which some gestures bring information, while others are

movement epentheses Movement epentheses are movements

that are added between two consecutive valid signs to move the

hand from the end of one sign to the beginning of the next The

arisen question is how to identify and locate valid gestures in

the time-serial data Segmentation is one solution to this

problem Segmentation has been considered as a critical phase

that determined the quality of the later processing of sign

language recognition and interpretation systems

The way to differentiate between meaningful signs and movement epentheses depends on whether the sign language expression manners are gestures or finger spelling In gestures,

a valid segment is where the hand posture expressed by hand shape, hand position, and hand orientation together with movement trajectory of one or both hands form a meaningful word or phrase A movement epenthesis is where hands transit from the end-point of a sign to the beginning-point of the next sign Many researches uses hand velocity as the cue for gesture segmentation Tanibata et al [10] separated valid signs and motion epentheses with the assumption that valid signs have small velocities while motion epentheses have large velocities

In addition, large changes of hand motion were considered as the cues of borders Sagawa and Takeuchi [6] proposed a similar approach Besides, they also considered the noise made

by unstable gestures They excluded them by comparing the sum of maximum velocities of two adjacent candidates to predefined thresholds The meaningful gestures and transitions are separated by acceleration in which meaningful gestures have the minimum acceleration This method was applied for

100 words of Japanese Sign Language, and got 80.2% accuracy Another approach in [14] uses time-varying parameters (TVP) as the cues to detect the correct postures which have the number of TVPs dropping below a threshold Gaolin Fang et al [4] proposed a more effective method Simple Recurrent Network (SRN) was used to classify gesture into three output units: the left boundary, the right boundary, and interior of segments Using SRN independently, the accuracy of segmentation was 87% Hence, self organizing feature maps (SOFM) was added It was used as the feature extraction network providing inputs for SRN It can determine the left boundary and right boundary, used as constraint in the segmentation With this method, the segmentation recall reaches to 98.8%

Beside gestures, finger spelling plays an important role in sign language In this manner, finger postures corresponding to letters of the alphabet are presented sequentially and conform

to the spelling rules to make words The segmentation for finger spelling concentrates on marking the points where valid postures occur Some approaches require unnatural performance to mark the valid segments such as inserting marks [2] or keeping a posture about one minute before it is recognized [16] Harling et al [12] calculated hand tension and used it as a cue to detect valid segments The transitions have less tension than valid postures The transition from one

Trang 2

intentional posture to the next intentional posture goes through

a relaxed hand state, which is used to detect letter borders

However, this method is only tested on a small scale and

suitable for device based systems Wu et al [8] based on the

differences between current frames to separate the moving and

the steady H Birk et al [5] used two motion cues to solve the

temporal segmentation In addition, they considered to the case

where the same letter is repeated, e.g letter l in hello The third

cue is based on the view that there is a small amount of motion

between them Three cues are combined with AND operation

to obtain final decision R Erenshteyn et al [13] recognized

letters in real time and then used two filters for segmentation of

dynamic signing A low pass filter relies on the difference

between frames The derivative analysis provides the

foundation for the second filter In this filter, the end point of a

letter is where there is greatest variation of recognition results

and meets an additional minimum proximity heuristic The

recognition is performed at the midpoints of the segments The

segmentation accuracies of two filters are 87.8% and 92.3%,

respectively Nevertheless, the first filter leaves many

redundant segments, and the second deletes extra middle

points

We propose in this paper a framework to split a continuous

movement sequence in finger spelling into segments as well as

to identify valid postures and movement epentheses In our

framework, a number of techniques are applied sequentially to

identify valid segments in the time serial data Firstly, hand

velocity is calculated to find the stable candidates where

velocities fall under a certain threshold Then, we apply a filter

based on the signing rate featured by the posturing duration to

remove redundant segments A represented value of each valid

segment is calculated to be the input of a letter recognition

system After that, words are segmented from the sequence of

recognized letters based on maximal matching to predefined

words in a Vietnamese dictionary With our framework, a

sentence presented by finger spelling can be segmented fast

and correctly By combining segmentation techniques, our

framework has proved to be an effective method with both high

precision and recall rate The signing rate filter based on

posturing duration works well in eliminating superfluous

segments The problem of two adjacent letters referring to the

same value (ex: “hello”, “litter”) as mention in [5] is solved by

this filter

The rest of the paper is organized as follows In section 2

we propose the segmentation framework and related techniques

in detail Section 3 shows the experimental results and

discussion

II SEGMENTATION FRAMEWORK

The segmentation framework focuses on separating letters and

words from a time-serial data of a sentence presented by

finger spelling Firstly, hand velocity is calculated to find the

stable candidates where velocities fall under a predefined

threshold However, many redundant segments are found

together with valid segments because this technique is very

sensitive with noises Therefore, in the next step, we apply a

filter based on the signing rate which features with the

posturing duration to remove superfluous candidates The

letter value and the average of weighted sum of values in a selected segment are used for recognition After that, words need to be recognized from the sequence of recognized letters Word segmentation is not required if the signer is forced to place a special sign after each word However, that is not a natural way of signing In order to separate word automatically from a sequence of characters, we use the maximal matching approach with the presence of a Vietnamese dictionary This approach is also used to correct mis-recognized letters

A Letter segmentation by hand velocity

This technique is based on the nature of finger spelling In this manner, letters are signed sequentially following the spelling grammar Each letter is presented by a posture described by hand shape and palm orientation Postures have to

be stable to recognize Therefore, postures are corresponding to segments having low hand velocity, while transitions are vise versa as mentioned in [5], [6] and [10] Based on hand parameters, hand velocity is calculated and compared with predefined threshold to find the candidates for the next step

B Letter segmentation by signing rate filter

Based on hand velocity techniques, most of valid segments are detected However, the subtle changes of the hand velocity

in unstable postures or noise make many superfluous candidates Fortunately, a signer has to hold a posture long enough for people to recognize Therefore, in order to remove superfluous candidates, we propose a letter signing rate filter Letter signing rate refers to the duration of signing a posture

At each segment candidate, we calculate the letter signing rate and compare to experimentally chosen thresholds (low threshold and high threshold) A valid segment is the one with posturing duration in between two thresholds (see Figure 1)

0 0.2 0.4 0.6 0.8 1 1.2

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148

Figure 1 The posture is hold at suitable duration between signing rate low

threshold and signing rate high threshold

C Letter recognition

At each valid segment, we calculate the represented value of the segment which is used for recognition This value is calculated as the average of sum of corresponding values in a selected number of segments

We applied the classification method mentioned in [17] Twenty three letters (A, B, C, D, Đ, E, G, H, I, K, L, M, N, O,

posture

Trang 3

P, Q, R, S, T, U, V, X, and Y) of Vietnamese Sign Language

alphabet are recognized with high recognition accuracy

In this paper, we have not considered letters with diacritical

signs (e.g Â, Ă, Ô, Ơ, Ê, Ư) and tones (e.g level, high rising,

low (falling), dipping rising, high rising glottalized, and low

glottalized) of the Vietnamese alphabet Each diacritical sign is

presented by an independent sign and follows a particular letter

to form another Each tone is formed by a sign combined with

a motion Therefore, the segmentation for them is carried out

after the recognition phase and needs additional techniques

III EXPERIMENT AND DISCUSSION

A Data collection and pre-processing

We used 5DT Data Glove 5 [3] as an input device for our

system The data glove has 18 sensors corresponding to ten

positions on fingers (thumb near, thumb far, index near, index

far, middle near, middle far, ring near, ring far, litter near, litter

far), four positions between fingers (thumb/index,

index/middle, middle/ring, ring/litter), and a position on the

back of the hand Sensors measure and return values of finger’s

flexure, finger’s spread, and the pitch and roll of the hand

After calibration and normalization, the sensor values are in the

range from 0 to 1 (see Figure 2)

We performed the experiment on 594 samples of 23 letters

in Vietnamese alphabet In the pre-processing phase, data

received from sensing glove is smoothed by Gaussian filter

(see Figure 3) based on the Gaussian distribution in 1-D with

the standard deviation of the distribution as 1 (Equation 1)

2 2

2

1 )

π δ

x

e x

G

−

= where δ = 1 (1)

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85

Figure 2 Raw data

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85

Figure 3 Data is smoothed with Gaussian Filter

B Segmentation with hand velocity and signing rate filter

With the data collected from the Data Glove, the hand velocity was calculated by Equation 2 at every frame:

1

( , ) ( , 1) ( )

N

i

P i t P i t

v t

N

=

(2)

where P(i, t) is the value of sensor i at frame t, and N is the

number of sensors

The adjacent frames form a candidate if the velocities at these frames are lower than a threshold Almost of all segments including postures are detected by this technique However, the number of superfluous segments is rather large The reason is that unstable postures, e.g the third segment in Figure 5, or slow movement parts of the hand, e.g the fourth segment in Figure 5, create invalid segments (noises) We found that noise

is often slighter and faster than posture Besides, too long segments are also abnormal In most cases, they are found by wrong velocity segmentation rather than formed by signers Therefore, they are often redundant (see Figure 6) We calculate the posturing duration at each segment candidate In this experiment, the segment is rejected if its posturing duration

is lower than the empirically determined threshold of 150ms or higher than the empirically determined threshold of 1500ms

In addition, Vietnamese Sign Language also has to face the problem of two adjacent letters, similar to two l in hello in American Sign Language As being analyzed in [5], there is a small amount of motions between them Therefore, signing rate filter which is based on posturing duration can solve the problem (Figure 7)

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208

Figure 4 Valid segments of the word “CHUA”.

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162

Figure 5 Five segments are detected by hand velocity and two invalid

segments are eliminated by signing rate filter

Trang 4

1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169

Figure 6 The invalid segment which is too long is eliminated by signing rate

filter

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121

Figure 7 The problem of two adjacent letters referring to the same value

(the word “CO OI” in this example) is solved by the singing rate filter

C Results

The Precision and Recall are calculated by Equation 3 and

Equation 4, respectively

% 100

ents tectedSegm NumberofDe

s lidSegment NumberofVa

% 100

ts tualSegmen NumberofAc

lidSegment NumberofVa

The hand velocity threshold is tested with five values: 0.02,

0.05, 0.10, 0.15, and 0.20 The values threshold lower than

0.02 or higher than 0.20 give not very good result, therefore we

do not include in this here The results are shown in Table 1

and Table 2 The segmentation archived highest precision rate

and recall rate with two velocity thresholds (0.05 or 0.10) In

the experiment, if the lower velocity threshold (0.02) is chosen,

many valid segments are missed On the other hand, the higher

velocity thresholds (0.15 or 0.20) combine adjacent segments

into one, which leads to the low recall and precision rate

Using only velocity segmentation, we could detect most of

correct segments in time serial data (as high as 96.46%);

however, the precision rate was rather low (as low as 68.95%

for the recall of 94.46%) Combining signing rate filter

afterward, we kept recall rate and increased the precision rate

considerably (as high as 95.27% for the recall of 94.95%, and

93.78% for the recall of 96.46%) It showed that the

combination of techniques obtained better results than only one

in segmentation The preserved recall rate proved the effect of

chosen signing rate thresholds

Table 1 Velocity Segmentation with different velocity thresholds

Velocity Threshold Precision (%) Velocity Segmentation Recall (%) 0.02 57.08 83.50

0.15 60.87 77.78 0.20 57.14 60.60

Table 2 Velocity and Signing Rate Segmentation with different velocity

thresholds

Velocity Threshold

Velocity and Signing Rate Segmentation Precision (%) Recall (%) 0.02 86.26 83.50

0.15 87.50 77.78 0.20 81.08 60.60

0 20 40 60 80 100

Velocity Threshold

Hand velocity Filter Hand veloctiy and Signing rate filter

Figure 8 Precision rate of two segmentation techniques

60.60 77.78

94.95 83.50

96.46

0 20 40 60 80 100

Velocity Threshold

Figure 9 Recall rate of two segmentation techniques

IV CONCLUSION

We have proposed in this paper a framework to separate signing postures from a time serial data We have applied a number of techniques sequentially in order to identify meaningful segments Hand velocity is calculated first Stable candidates with low velocity are selected A filter based on the signing rate is then applied to remove superfluous segments Finally, recognized letters are group together to form words based on a dictionary With our framework, we have obtained

Trang 5

high recall and precision rate (94.95% and 95.27%,

respectively) in separating valid segments from a continuous

stream of data when testing with Vietnamese sign language

In the future, we want to carry out experiments on our

framework together with different recognition techniques to

see the overall recognition rate We also want to apply our

framework to vision-based approach

REFERENCES [1] Durell Bouchard (2006), “Automated Time Series

Segmentation for Human Motion Analysis”,

http://cg.cis.upenn.edu/hms/research/RIVET/AutomatedTi

meSeriesSegmentation.pdf

[2] D Rubine (1991), “Specifying Gestures by Example”,

Computer Graphics, pp 329-337

[3] Fifth Dimension Technologies (2004), “5DT Data Glove

Ultra Series, User’s Manual”, http://www.5DT.com

[4] Gaolin Fang, Wen Gao, Xilin Chen, Chunli Wang, and

Jiyong Ma (2001), “Signer-independent Continuous Sign

Language Recognition Based on SRN/HMM”, Lecture

Notes In Computer Science, vol 2298, pp 76-85

[5] H Birk, T.B Moeslund, and C.B Madsen (1997),

“Real-Time Recognition of Hand Alphabet Gestures Using

Principal Component Analysis”, Proc Scandinavian Conf

Image Analysis, pp 261-268

[6] H Sagawa, and M Takeuchi (2000), “A Method for

Recognizing a Sequence of Sign Language Words

Represented in a Japanese Sign Language Sentence”, Proc

Fourth IEEE International Conf on Automatic Face and

Gesture Recognition, pp 434-439

[7] J Karamer and L Leifer (1978), “The Talking Glove: An

Expressive and Receptive Verbal Communication Aid for

the Deaf, Deaf-Blind, and Nonvocal”, Proc Third Ann

Conf Computer Technology, Special Education,

Rehabilitation, pp 335-340

[8] J Wu and W Gao (2001), “The Recognition of

Finger-Spelling for Chinese Sign Language”, Proc Gesture Workshop, pp 96-100

[9] N Chaimanonart, and D J Young (2006), “Remote RF

powering system for wireless MEMS strain sensors”, IEEE Sensors Journal, Vol 6-2, pp 484 – 489

[10] N Tanibata, and N Shimada (2002), “Extraction of Hand

Features for Recognition of Sign Language Words”, Proc Int’l Conf Vision Interface, pp 391-398

[11] Peter Vamplew, and Anthony Adams (1998), “Recognition

of sign language gestures using neural networks”,

Australian Journal of Intelligent Information Processing Systems, pp 94-102

[12] Philip A Harling, and Alistair D.N Edwards (1996),

“Hand tension as a gesture segmentation cue”, Proc of Gesture Workshop on Progress in Gestural Interaction, pp

75-88

[13] R Erenshteyn, P Laskov, R Foulds, L Messing, and G Stem (1996), “Recognition Approach to Gesture Language

Understanding”, Proc Int’l Conf Pattern Recognition, vol

3, pp 431-435

[14] Rung-Huei Liang, Ming Ouhyoung (1998), “A Real-time Continuous Gesture Recognition System for Sign

Language”, Proc Third IEEE International Conf on Automatic Face and Gesture Recognition, pp 558-567

[15] Sylvie C.W Ong and Surendra Ranganath (2005),

“Automatic Sign Language Analysis: A Survey and the

Future beyond Lexical meaning”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol 27, no 6,

pp 873-891

[16] T Takahashi and F Kishino (1991), “Hand Gesture Coding Based on Experiments using a Hand Gesture

Interface Device”, SIGCHI Bulletin, pp 67-73

[17] The Duy Bui, and Thang Long Nguyen (2007),

“Recognizing postures in Vietnamese Sign Language with

MEMS accelerometers”, Sensors Journal, IEEE, vol 7-5,

pp 707-712

Định dạng
Số trang	5
Dung lượng	305,1 KB