Automated Posture Segmentation in Continuous Finger Spelling Recognition Nhat Thanh Nguyen Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hano
Trang 1Automated Posture Segmentation in Continuous
Finger Spelling Recognition
Nhat Thanh Nguyen Human Machine Interaction Laboratory
College of Technology, Vietnam National University, Hanoi
The Duy Bui Human Machine Interaction Laboratory College of Technology, Vietnam National University, Hanoi
Abstract— Recognizing continuous finger spelling plays an
important role in understanding sign language There are two
major phases in recognizing continuous finger spelling, which are
posture segmentation and posture recognition In the former, a
continuous gesture sequence is decomposed into segments, which
are then used for the latter to indentify corresponding characters
Among all the segments, beside valid postures corresponding to
characters, there are also many movement epentheses, which
appear between pairs of postures to move the hands from the end
of one posture to the beginning of the next In this paper, we
propose a framework to split a continuous movement sequence
into segments as well as to identify valid postures and movement
epentheses By using the velocity and signing rate based filter, we
can obtain very good result with both high recall and precision
rate
Keywords- finger spelling recognition, posture segmentation,
velocity filter, signing rate filter, maximal matching
I INTRODUCTION Sign language, a non-verbal language, is a primary means
of communication in the deaf community Different from
speech, sign language uses finger spelling and gestures to
convey information Automatic sign language recognition and
interpretation concentrate on understanding human signs and
translating them into text or speech, which might help to
overcome the difficulties in communication between the deft
people and the rest of the world These systems are often
developed with two main approaches: vision based one and
device based one Corresponding to the two approaches,
time-serial data is obtained as the input of systems in two different
formats Vision based approach uses video cameras to capture
the gestures of users, while device based approach depends on
sensing gloves to get hand parameters such as joint angles and
hand position
Sign language is presented by the sequential gestures in
which some gestures bring information, while others are
movement epentheses Movement epentheses are movements
that are added between two consecutive valid signs to move the
hand from the end of one sign to the beginning of the next The
arisen question is how to identify and locate valid gestures in
the time-serial data Segmentation is one solution to this
problem Segmentation has been considered as a critical phase
that determined the quality of the later processing of sign
language recognition and interpretation systems
The way to differentiate between meaningful signs and movement epentheses depends on whether the sign language expression manners are gestures or finger spelling In gestures,
a valid segment is where the hand posture expressed by hand shape, hand position, and hand orientation together with movement trajectory of one or both hands form a meaningful word or phrase A movement epenthesis is where hands transit from the end-point of a sign to the beginning-point of the next sign Many researches uses hand velocity as the cue for gesture segmentation Tanibata et al [10] separated valid signs and motion epentheses with the assumption that valid signs have small velocities while motion epentheses have large velocities
In addition, large changes of hand motion were considered as the cues of borders Sagawa and Takeuchi [6] proposed a similar approach Besides, they also considered the noise made
by unstable gestures They excluded them by comparing the sum of maximum velocities of two adjacent candidates to predefined thresholds The meaningful gestures and transitions are separated by acceleration in which meaningful gestures have the minimum acceleration This method was applied for
100 words of Japanese Sign Language, and got 80.2% accuracy Another approach in [14] uses time-varying parameters (TVP) as the cues to detect the correct postures which have the number of TVPs dropping below a threshold Gaolin Fang et al [4] proposed a more effective method Simple Recurrent Network (SRN) was used to classify gesture into three output units: the left boundary, the right boundary, and interior of segments Using SRN independently, the accuracy of segmentation was 87% Hence, self organizing feature maps (SOFM) was added It was used as the feature extraction network providing inputs for SRN It can determine the left boundary and right boundary, used as constraint in the segmentation With this method, the segmentation recall reaches to 98.8%
Beside gestures, finger spelling plays an important role in sign language In this manner, finger postures corresponding to letters of the alphabet are presented sequentially and conform
to the spelling rules to make words The segmentation for finger spelling concentrates on marking the points where valid postures occur Some approaches require unnatural performance to mark the valid segments such as inserting marks [2] or keeping a posture about one minute before it is recognized [16] Harling et al [12] calculated hand tension and used it as a cue to detect valid segments The transitions have less tension than valid postures The transition from one
978-1-4244-7570-4/10/$26.00 ©2010 IEEE
Trang 2intentional posture to the next intentional posture goes through
a relaxed hand state, which is used to detect letter borders
However, this method is only tested on a small scale and
suitable for device based systems Wu et al [8] based on the
differences between current frames to separate the moving and
the steady H Birk et al [5] used two motion cues to solve the
temporal segmentation In addition, they considered to the case
where the same letter is repeated, e.g letter l in hello The third
cue is based on the view that there is a small amount of motion
between them Three cues are combined with AND operation
to obtain final decision R Erenshteyn et al [13] recognized
letters in real time and then used two filters for segmentation of
dynamic signing A low pass filter relies on the difference
between frames The derivative analysis provides the
foundation for the second filter In this filter, the end point of a
letter is where there is greatest variation of recognition results
and meets an additional minimum proximity heuristic The
recognition is performed at the midpoints of the segments The
segmentation accuracies of two filters are 87.8% and 92.3%,
respectively Nevertheless, the first filter leaves many
redundant segments, and the second deletes extra middle
points
We propose in this paper a framework to split a continuous
movement sequence in finger spelling into segments as well as
to identify valid postures and movement epentheses In our
framework, a number of techniques are applied sequentially to
identify valid segments in the time serial data Firstly, hand
velocity is calculated to find the stable candidates where
velocities fall under a certain threshold Then, we apply a filter
based on the signing rate featured by the posturing duration to
remove redundant segments A represented value of each valid
segment is calculated to be the input of a letter recognition
system After that, words are segmented from the sequence of
recognized letters based on maximal matching to predefined
words in a Vietnamese dictionary With our framework, a
sentence presented by finger spelling can be segmented fast
and correctly By combining segmentation techniques, our
framework has proved to be an effective method with both high
precision and recall rate The signing rate filter based on
posturing duration works well in eliminating superfluous
segments The problem of two adjacent letters referring to the
same value (ex: “hello”, “litter”) as mention in [5] is solved by
this filter
The rest of the paper is organized as follows In section 2
we propose the segmentation framework and related techniques
in detail Section 3 shows the experimental results and
discussion
II SEGMENTATION FRAMEWORK
The segmentation framework focuses on separating letters and
words from a time-serial data of a sentence presented by
finger spelling Firstly, hand velocity is calculated to find the
stable candidates where velocities fall under a predefined
threshold However, many redundant segments are found
together with valid segments because this technique is very
sensitive with noises Therefore, in the next step, we apply a
filter based on the signing rate which features with the
posturing duration to remove superfluous candidates The
letter value and the average of weighted sum of values in a selected segment are used for recognition After that, words need to be recognized from the sequence of recognized letters Word segmentation is not required if the signer is forced to place a special sign after each word However, that is not a natural way of signing In order to separate word automatically from a sequence of characters, we use the maximal matching approach with the presence of a Vietnamese dictionary This approach is also used to correct mis-recognized letters
A Letter segmentation by hand velocity
This technique is based on the nature of finger spelling In this manner, letters are signed sequentially following the spelling grammar Each letter is presented by a posture described by hand shape and palm orientation Postures have to
be stable to recognize Therefore, postures are corresponding to segments having low hand velocity, while transitions are vise versa as mentioned in [5], [6] and [10] Based on hand parameters, hand velocity is calculated and compared with predefined threshold to find the candidates for the next step
B Letter segmentation by signing rate filter
Based on hand velocity techniques, most of valid segments are detected However, the subtle changes of the hand velocity
in unstable postures or noise make many superfluous candidates Fortunately, a signer has to hold a posture long enough for people to recognize Therefore, in order to remove superfluous candidates, we propose a letter signing rate filter Letter signing rate refers to the duration of signing a posture
At each segment candidate, we calculate the letter signing rate and compare to experimentally chosen thresholds (low threshold and high threshold) A valid segment is the one with posturing duration in between two thresholds (see Figure 1)
0 0.2 0.4 0.6 0.8 1 1.2
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148
Figure 1 The posture is hold at suitable duration between signing rate low
threshold and signing rate high threshold
C Letter recognition
At each valid segment, we calculate the represented value of the segment which is used for recognition This value is calculated as the average of sum of corresponding values in a selected number of segments
We applied the classification method mentioned in [17] Twenty three letters (A, B, C, D, Đ, E, G, H, I, K, L, M, N, O,
posture
Trang 3P, Q, R, S, T, U, V, X, and Y) of Vietnamese Sign Language
alphabet are recognized with high recognition accuracy
In this paper, we have not considered letters with diacritical
signs (e.g Â, Ă, Ô, Ơ, Ê, Ư) and tones (e.g level, high rising,
low (falling), dipping rising, high rising glottalized, and low
glottalized) of the Vietnamese alphabet Each diacritical sign is
presented by an independent sign and follows a particular letter
to form another Each tone is formed by a sign combined with
a motion Therefore, the segmentation for them is carried out
after the recognition phase and needs additional techniques
III EXPERIMENT AND DISCUSSION
A Data collection and pre-processing
We used 5DT Data Glove 5 [3] as an input device for our
system The data glove has 18 sensors corresponding to ten
positions on fingers (thumb near, thumb far, index near, index
far, middle near, middle far, ring near, ring far, litter near, litter
far), four positions between fingers (thumb/index,
index/middle, middle/ring, ring/litter), and a position on the
back of the hand Sensors measure and return values of finger’s
flexure, finger’s spread, and the pitch and roll of the hand
After calibration and normalization, the sensor values are in the
range from 0 to 1 (see Figure 2)
We performed the experiment on 594 samples of 23 letters
in Vietnamese alphabet In the pre-processing phase, data
received from sensing glove is smoothed by Gaussian filter
(see Figure 3) based on the Gaussian distribution in 1-D with
the standard deviation of the distribution as 1 (Equation 1)
2 2
2
2
1 )
π δ
x
e x
G
−
= where δ = 1 (1)
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85
Figure 2 Raw data
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85
Figure 3 Data is smoothed with Gaussian Filter
B Segmentation with hand velocity and signing rate filter
With the data collected from the Data Glove, the hand velocity was calculated by Equation 2 at every frame:
1
( , ) ( , 1) ( )
N
i
P i t P i t
v t
N
=
(2)
where P(i, t) is the value of sensor i at frame t, and N is the
number of sensors
The adjacent frames form a candidate if the velocities at these frames are lower than a threshold Almost of all segments including postures are detected by this technique However, the number of superfluous segments is rather large The reason is that unstable postures, e.g the third segment in Figure 5, or slow movement parts of the hand, e.g the fourth segment in Figure 5, create invalid segments (noises) We found that noise
is often slighter and faster than posture Besides, too long segments are also abnormal In most cases, they are found by wrong velocity segmentation rather than formed by signers Therefore, they are often redundant (see Figure 6) We calculate the posturing duration at each segment candidate In this experiment, the segment is rejected if its posturing duration
is lower than the empirically determined threshold of 150ms or higher than the empirically determined threshold of 1500ms
In addition, Vietnamese Sign Language also has to face the problem of two adjacent letters, similar to two l in hello in American Sign Language As being analyzed in [5], there is a small amount of motions between them Therefore, signing rate filter which is based on posturing duration can solve the problem (Figure 7)
1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208
Figure 4 Valid segments of the word “CHUA”.
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162
Figure 5 Five segments are detected by hand velocity and two invalid
segments are eliminated by signing rate filter
Trang 41 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169
Figure 6 The invalid segment which is too long is eliminated by signing rate
filter
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121
Figure 7 The problem of two adjacent letters referring to the same value
(the word “CO OI” in this example) is solved by the singing rate filter
C Results
The Precision and Recall are calculated by Equation 3 and
Equation 4, respectively
% 100
ents tectedSegm NumberofDe
s lidSegment NumberofVa
% 100
ts tualSegmen NumberofAc
lidSegment NumberofVa
The hand velocity threshold is tested with five values: 0.02,
0.05, 0.10, 0.15, and 0.20 The values threshold lower than
0.02 or higher than 0.20 give not very good result, therefore we
do not include in this here The results are shown in Table 1
and Table 2 The segmentation archived highest precision rate
and recall rate with two velocity thresholds (0.05 or 0.10) In
the experiment, if the lower velocity threshold (0.02) is chosen,
many valid segments are missed On the other hand, the higher
velocity thresholds (0.15 or 0.20) combine adjacent segments
into one, which leads to the low recall and precision rate
Using only velocity segmentation, we could detect most of
correct segments in time serial data (as high as 96.46%);
however, the precision rate was rather low (as low as 68.95%
for the recall of 94.46%) Combining signing rate filter
afterward, we kept recall rate and increased the precision rate
considerably (as high as 95.27% for the recall of 94.95%, and
93.78% for the recall of 96.46%) It showed that the
combination of techniques obtained better results than only one
in segmentation The preserved recall rate proved the effect of
chosen signing rate thresholds
Table 1 Velocity Segmentation with different velocity thresholds
Velocity Threshold Precision (%) Velocity Segmentation Recall (%) 0.02 57.08 83.50
0.15 60.87 77.78 0.20 57.14 60.60
Table 2 Velocity and Signing Rate Segmentation with different velocity
thresholds
Velocity Threshold
Velocity and Signing Rate Segmentation Precision (%) Recall (%) 0.02 86.26 83.50
0.15 87.50 77.78 0.20 81.08 60.60
0 20 40 60 80 100
Velocity Threshold
Hand velocity Filter Hand veloctiy and Signing rate filter
Figure 8 Precision rate of two segmentation techniques
60.60 77.78
94.95 83.50
96.46
0 20 40 60 80 100
Velocity Threshold
Figure 9 Recall rate of two segmentation techniques
IV CONCLUSION
We have proposed in this paper a framework to separate signing postures from a time serial data We have applied a number of techniques sequentially in order to identify meaningful segments Hand velocity is calculated first Stable candidates with low velocity are selected A filter based on the signing rate is then applied to remove superfluous segments Finally, recognized letters are group together to form words based on a dictionary With our framework, we have obtained
Trang 5high recall and precision rate (94.95% and 95.27%,
respectively) in separating valid segments from a continuous
stream of data when testing with Vietnamese sign language
In the future, we want to carry out experiments on our
framework together with different recognition techniques to
see the overall recognition rate We also want to apply our
framework to vision-based approach
REFERENCES [1] Durell Bouchard (2006), “Automated Time Series
Segmentation for Human Motion Analysis”,
http://cg.cis.upenn.edu/hms/research/RIVET/AutomatedTi
meSeriesSegmentation.pdf
[2] D Rubine (1991), “Specifying Gestures by Example”,
Computer Graphics, pp 329-337
[3] Fifth Dimension Technologies (2004), “5DT Data Glove
Ultra Series, User’s Manual”, http://www.5DT.com
[4] Gaolin Fang, Wen Gao, Xilin Chen, Chunli Wang, and
Jiyong Ma (2001), “Signer-independent Continuous Sign
Language Recognition Based on SRN/HMM”, Lecture
Notes In Computer Science, vol 2298, pp 76-85
[5] H Birk, T.B Moeslund, and C.B Madsen (1997),
“Real-Time Recognition of Hand Alphabet Gestures Using
Principal Component Analysis”, Proc Scandinavian Conf
Image Analysis, pp 261-268
[6] H Sagawa, and M Takeuchi (2000), “A Method for
Recognizing a Sequence of Sign Language Words
Represented in a Japanese Sign Language Sentence”, Proc
Fourth IEEE International Conf on Automatic Face and
Gesture Recognition, pp 434-439
[7] J Karamer and L Leifer (1978), “The Talking Glove: An
Expressive and Receptive Verbal Communication Aid for
the Deaf, Deaf-Blind, and Nonvocal”, Proc Third Ann
Conf Computer Technology, Special Education,
Rehabilitation, pp 335-340
[8] J Wu and W Gao (2001), “The Recognition of
Finger-Spelling for Chinese Sign Language”, Proc Gesture Workshop, pp 96-100
[9] N Chaimanonart, and D J Young (2006), “Remote RF
powering system for wireless MEMS strain sensors”, IEEE Sensors Journal, Vol 6-2, pp 484 – 489
[10] N Tanibata, and N Shimada (2002), “Extraction of Hand
Features for Recognition of Sign Language Words”, Proc Int’l Conf Vision Interface, pp 391-398
[11] Peter Vamplew, and Anthony Adams (1998), “Recognition
of sign language gestures using neural networks”,
Australian Journal of Intelligent Information Processing Systems, pp 94-102
[12] Philip A Harling, and Alistair D.N Edwards (1996),
“Hand tension as a gesture segmentation cue”, Proc of Gesture Workshop on Progress in Gestural Interaction, pp
75-88
[13] R Erenshteyn, P Laskov, R Foulds, L Messing, and G Stem (1996), “Recognition Approach to Gesture Language
Understanding”, Proc Int’l Conf Pattern Recognition, vol
3, pp 431-435
[14] Rung-Huei Liang, Ming Ouhyoung (1998), “A Real-time Continuous Gesture Recognition System for Sign
Language”, Proc Third IEEE International Conf on Automatic Face and Gesture Recognition, pp 558-567
[15] Sylvie C.W Ong and Surendra Ranganath (2005),
“Automatic Sign Language Analysis: A Survey and the
Future beyond Lexical meaning”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol 27, no 6,
pp 873-891
[16] T Takahashi and F Kishino (1991), “Hand Gesture Coding Based on Experiments using a Hand Gesture
Interface Device”, SIGCHI Bulletin, pp 67-73
[17] The Duy Bui, and Thang Long Nguyen (2007),
“Recognizing postures in Vietnamese Sign Language with
MEMS accelerometers”, Sensors Journal, IEEE, vol 7-5,
pp 707-712