Road traffic control gesture recognition using Microsoft Kinect
Trang 1Trường Đại học Công Nghệ
CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC”
Trang 2Our study concentrates on building an intelligent system in smart vehicle.Specifically, this system identifies the traffic control commands of policeman topropose right decision to driver Our work enables smart vehicle to detect andrecognize traffic officer on the road Technically, we use a built-in depth sensorofMicrosoft Kinect to capture image for recognition system The feature characteristics
of depth image is depth information providing (same as 3D information), color andtexture invariance which is the difference from RGB camera.By incorporatingspatial-temporal invariance into the geometric features and applying machine learningclassifier model, we are able to predict traffic control command from depthinformation captured The construction of feature vector is based on relative anglesbetween body parts of human which possibly be extractedfrom Kinect We presentexperimental result on a test data of more than 30,000 frames whose is 6 kind oftraffic commands Using both Kmeans andSupport vector machine (SVM)to classify,the betterresult is about 99.8%by SVM classifier Moreover, the application of thissystem runssteadily in real-time
Trang 31 Problem statement 355
2 Related work 388
2.1 Human body parts recognition using Microsoft Kinect 388
2.2 Traffic gesture recognition 31010
2.3 Real-time human pose recognition in parts from single depth images 31111 2.4 Hand tracking and gesture recognition 31111
3 Problem solution 31212
3.1 Proposal approach 31212
3.1.1 Feature vector selection 31212
3.1.2 Training and classification 31515
3.2 Experiment 32222
3.3 Testbed system 32323
4 Conclusion 32525
5 Future work 32525
5.1 Dynamic gesture recognition problem 32525
5.2 On-going method 32525
6 References 32727
Trang 4Index of Figure and Table
Figure 1 Traffic gestures and skeletal joints 37
Figure 2 Block Diagram of the Prime Sense Reference Design [20] 38
Figure 3 OpenNI’s kinematic model of the human body 39
Figure 4 Chinese traffic gestures 310
Figure 5 Direct k-means clustering algorithm 316
Figure 6 Diagram of information flow in our system 324
Figure 7 The proposed system GUI and the result of gesture recognition 324
Figure 8 The change diagram of gesture 1 (table 7) with 950 frames recorded 327
Table 1 Three types of traffic control command .36
Table 2 Six traffic gestures defined 36
Table 3.Rotation joint angles related each arm .311
Table 4 Rotation joint angle between two body parts .315
Table 5 The KMeans and SVM result about detailed accuracy by class .323
Table 6 Six dynamic gestures .326
Table 7 Geometric function G attributes .326
Table 1 Three types of control command 6
Table 2 Six traffic gestures defined. 6
Table 3.Rotation joint angles related each arm 11 Table 4 Rotation joint angle between two body parts 15
Table 5 The KMeans and SVM result about detailed accuracy by class 23 Table 6 Six dynamic gestures 26
Table 7 Geometric function G attributes 26
Trang 51 Problem statement
Human traffic control is preferred for developing nations because of the relativelyfewer cars, the few major intersections, and the low cost of human traffic-controllers[3] In human traffic control environment, drivers must follow the directions givenfrom the traffic police officer in forms of human body gestures To improve the safety
of the drivers, our research team is developing a novel method to automaticallyrecognize the traffic control gestures
There have been a few methods developed for traffic control gesture recognition in theliterature Fan Guo et al [6] recognized police gestures from the corresponding bodyparts on the color image plane The detection results of this method were heavilyaffected by background and outdoor illumination because traffic police in a complexscene is detected by extracting the reflective traffic vest of the traffic police usingcolor thresholding Yuan Tao et al [23] fixed on-body sensors on the back of eachhand of police to extract gesture data Although this accelerometer-based sensor mayoutput accurate hand positions, it gives extra hindrance to the police and requires aunique communication protocol for vehicles Meghna Singh et all [11] used Radontransform to recognize air marshals’ hand gestures for steering aircraft on the runway.However, since a relatively stationary background of video sequence is required thismethod is not practical for traffic scene
Human gesture recognition for traffic control purpose can be related with that forhuman-robot interaction Bauer et al [6] presented an interaction system where arobot asks a human for directions, and then interprets the given directions Thissystem includes a vision component where the full body pose is inferred from a stereoimage pair However, this fitting process is rather slow and does not work in real time.Waldherr et al [5] presented a template-based hand gesture recognition system for amobile robot, with gestures for the robot to stop or follow, and rudimentary pointing
As the gesture system is based on a color-based tracker, several limitations areimposed on the types of clothing and contrast with the background In [16], Van denBergh et al introduced a real-time hand gesture interaction system based on a Time-of-Flight (ToF) camera Both the depth images from the ToF camera and the colorimage from the RGB camera are used for a Haarletbased hand gesture classification.Similar ToF-based systems were also described in the literature [18][5][21] The use
of the ToF camera allows for a recognition system robust to all colors of clothing, tobackground noise and to other people standing around However, ToF cameras are
Trang 6expensive and suffer from a very low resolution and a narrow angle of view M V.Bergh et al [13] implemented a pointing hand gesture recognition algorithm based onKinect sensor to tell a robot where to go Although this system can be used for real-time robot control application, it can not be applied directly to traffic control situationbecause of the limitation of meaning gestures presented only by pointing of hands.
In Vietnamese traffic control system, a human traffic controller is able to assess thetraffic in visual range around the traffic intersection Based on his observation, hemakes intelligent decisions and give traffic signals in forms of his arms’ directionsand movements to all incoming vehicle drivers In this research, we only consider thedirections of arms for classifying traffic control commands Based on the observation
at real traffic intersection in Vietnam, we categorize control command into tree types
as shown in Table 1
directions
Left/right arm raises straight
up to
2 Stop all vehicle in front of and
behind the traffic police officer
Left/right arm raises to theleft/right to
behind the traffic police officer
Left/right arm raises to thefront to
Table 1 Three types of traffic control commandFrom these control command types, six traffic gestures can be constructed Eachtraffic gesture is a combination of the arms’ directions as listed in Table 2
Table 2 Six traffic gestures defined
As stated in previous section, human parts including arm directions can be presented
by a skeleton model consisting of 15 joints, namely, head, neck, torso, left shoulder,right shoulder, left elbow, right elbow, left hand, right hand, left hip, right hip, leftknee, right knee, left foot, right foot Therefore, the recognition of traffic gestures can
be done using skeleton model Figure 3 depicts two examples of traffic gestures andtheir skeletal joints
Trang 7Since skeleton model visualizes human parts simply by a set of relative joints,skeleton appears to have significant recognition advantage other than depth and colorinformation Therefore, instead of directly doing human parts recognition using depthand color images, we do skeleton recognition after preprocessing the Kinect’s depthimages by using OpenNI library.
Figure 1.Traffic gestures and skeletal joints
In this research, we separate two type of gesture for recognition: static and dynamicgestures Based on the description in Table 1, obviously the commands of trafficofficer are considered as static gestures We completed successfully the system forrecognizing static gesture, and doon-going approach of dynamic gesture recognition toimprove and extend the various kind of human gestures
Our completed approach presents a real-time human body gesture recognition methodfor road traffic control purpose In this method, 6 body gestures used by police officer
to control the flow direction of vehicles at a common intersection can be recognized
by Kinect from Microsoft In order to recognize the defined gesture, a depth sensor isinstalled is used to generate depth map of the scene where traffic police officerstands Then, a skeleton presentation of police officer body is computed A featurevector is created based on the joints of the skeleton model
Trang 82 Related work
2.1Human body parts recognition using Microsoft Kinect
The approach of using RGB images or video for human detection and recognitionfaces challenging problems due to variation in pose, clothing, lighting conditions andcomplexity of backgrounds These will result in the drop of detection and recognitionaccuracy or the increase of computational cost Therefore, the approach of using 3Dreconstruction information obtained from depth cameras has been focusedrecently[22][10][9][24] Depth images have several advantages over 2D intensity images: rangeimages are robust to the change in color and illumination; range images are simplerepresentations of 3D information However, earlier range sensors were expensive anddifficult to use in human environments because of lasers
a Microsoft Kinect for obtaining depth images
Recently, Microsoft has launched the Kinect, a peripheral designed as a video-gamecon-trolling device for the Microsoft’s X-Box Console But despite its initial purpose,
it facilitates the research in human detection, tracking and activity analysis thanks tothe combination of its high capabilities and low cost The sensor provides a depthresolution similar to the ToF cameras, but at a cost several times lower To obtain thedepth information, the device uses the Prime Sense’s Light Coding Technology [19],
in which Infra-Red (IR) light is projected as a dot pattern to the scene This projectedlight pattern creates textures that helps finding the correspondence between pixelseven in shiny or texture-less objects or with harsh lighting conditions In addition,because the pattern is fixed, there is no time domain variation other than themovements of the objects in the field of view of the camera This ensures a precisionsimilar to the ToF, but Prime Sense’s mounted IR receiver is a standard CMOSsensor, which reduces the price of the device drastically
Trang 9Figure 2 Block Diagram of the Prime Sense Reference Design [20]
Figure 2 depicts the block diagram of the reference design used by the Kinectsensor [20] The sensor is composed of one IR emitter, responsible of emitting thelight pattern to the scene, a depth sensor responsible of capturing the emitted pattern
It is also equipped with a standard RGB sensor that records the scene in visible light.Both depth and RGB sensors have a resolution of 640x480 pixels The matchingcalibration process between the depth and the RGB pixels and the 3D reconstructionare handled at chip level
b Human body pose recognition using depth images
For human body pose recognition purpose, Prime Sense has created a opensource library, Open Natural Interaction (OpenNI) [15], to promote the naturalinteraction OpenNI provides several algorithms for the use of Prime Sense’scompliant depth cameras, including Microsoft Kinect, in natural interaction fields.Some of these algorithms provide the extraction and tracking of a skeleton modelfrom the user who is interacting with the device The kinematic model of the skeleton
is a full skeleton model of the body consisting in 15 joints as shown in Figure2 Thealgorithms provide the 3D positions and orientations of every joint and update them atthe rate of 30fps Additionally they also provide the confidence of these measures areable to track up to four simultaneous skeletons
Figure 3 OpenNI’s kinematic model of the human body
Head Neck Right Shoulder
Right Elbow
Right Hand Right Hip
Right Knee
Right Foot
Torsor
Left Shoulder Left Elbow
Left Hand
Left Knee
Left Foot Left Hip
Trang 10Other researches using MS Kinect for human pose estimation have also beenaddressed In [7], J Charles et al proposed method for learning and recognizing 2Darticulated human pose models from a single depth image obtained from MicrosoftKinect Although the pose estimation is substantially recognized, the 2d presentation
of articulated human pose models makes the human activity recognition process moredifficult in comparing with 3D presentation of OpenNI In [14], L M Adolfo et al.presented a method for upper body pose estimation with online initialization of poseand anthropometric profile A likelihood evaluation is implemented to allow thesystem to run in real-time Although the method in [14] has a better performance, incomparing with OpenNI, in limb self occlusion cases, only upper presentation of bodypose is suitable for small range of recognition applications From these reasons, wechoose OpenNI to preprocess the depth images from MS Kinect to obtain the humanskeleton models
2.2Traffic gesture recognition
[6] presents an approach to recognize traffic gesture in Chinese traffic The Chinesetraffic police gesture system is defined and regulated by Chinese Ministry of PublicSecurity Figure 4 shows 2 in 10 types of gesture
Figure 4 Chinese traffic gesturesThe idea of this recognition system is based on rotation joint angle It can be seenfrom Figure 4 that these gestures need upper and lower arms keep certain angles to thevertical direction by rotating around shoulder or elbow joints, so the rotation jointangles are used to recognize gestures which makes it easy to add a new gesturewithout changing the existing angles Since the gestures may not be performedperfectly in real situation, we set the angles in certain rangenot a fixed value Let
θi( i=1 4) θidenotes the rotation angle related to each arm for the gestures,information about is provided in Table 3
Trang 11Gesture Left upper
arm (θ1) Left lower arm (θ2)
Right upperarm (θ3) Right lower arm (θ4)Stop signal [00, 100] [θ2-100, θ2+100] [1700, 1800] [θ4-100, θ4+100]Move
Even thought this approach takes some advantages such as no special clothingrequirement, unperfected-performed gesture recognition and efficient running withvideo;it is obvious to be seen this work must be work on RGB image which includeslot of noise.Based-on reflect clothes of traffic police is also a limitation to detectpolice
2.3 Real-time human pose recognition in parts from single depth images
In [8], the researchers propose a new method to quickly and accurately predict 3Dpositions of body joints from a single depth image, using no temporal information.They take an object recognition approach, designing an intermediate body partsrepresentation that maps the difficult pose estimation problem into a simpler per-pixelclassification problem Their large and highly varied training dataset allows theclassifier to estimate body parts invariant to pose, body shape, clothing, etc Finallythey generate confidence-scored 3D proposals of several body joints by reprojectingthe classification result and finding local models
2.4 Hand tracking and gesture recognition
Cristina Manresa et al [4](Hand tracking and gesture recognition) works aims at thecontrol of videogame based on hand gesture recognition They propose a newalgorithm to track and recognize hand gestures for interacting with videogame Thisalgorithm is based on three steps: hand segmentation, hand tracking and gesturerecognition from hand gesture For the hand segmentation step we use the color cuedue to the characteristic color values of human skin, its invariant properties and itscomputational simplicity To prevent errors from hand segmentation we add the handtracking as a second step Tracking is performed assuming a constant velocity modeland using a pixel labeling approach From the tracking process we extract several
Trang 12hand features that are fed into a finite state classifier which identifies the handconfiguration The hand can be classified into one of the four gesture classes or one ofthe four different movement directions.
3 Problem solution
3.1 Proposal approach
3.1.1Feature vector selection
3.1.1.1 Synthetic data collection
The human body is capable of an enormous range of poses which are difficult tosimulate Instead, we capture a database from a group of persons As we mentionedbefore, our work in this paper focus on 3 main pose of traffic police which illustrated
in Figure 1 and for the experiments in this paper, we separate each pose into 2 caseswith left hand and right hand, therefore we have 6 classes of human pose
Since the classifier uses no temporal information, we are interested only in static posesand not motion For each person we record each pose about 1000 frames withvariation in rotation about the vertical axis, mirroring left-right, scene position WithOpening and depth image, we can get the coordinates of all 15 skeletal joints Fromthese joints, we promote the feature vector for each frames and use this vector fortraining data Training data will have about 30000 vectors and each vector is classified
by its pose tag
3.1.1.2 Feature vector selection
We introduce a class of angle value features expressing geometric relations[12]between certain body points of a gesture As an example of this kind of features,consider the test whether the left arm raises straight up or to the left by calculation ofangle between left arm and shoulder Such geometric featuresare very robust to spatialvariations and allow the identificationof logically corresponding events in similarmotions.In particular, (user-specified) combinations of such qualitativefeaturesbecome a powerful tool in describing and specifyinggesture In our approach, wedefine a kind of geometric feature describing geometric relationbetween specifiedpoints of the kinematic chain for some fixed, isolated gestures To the end, we needthe notion of a feature which we describe mathematically as a function
Obviously, any attribute of feature functions is a function whichexpresses the relative position between body parts by calculating angle between them For classification purpose, a fixed size feature vector which is invariant to translation,rotation, and scaling must be extracted for each skeleton In in research, we propose
Trang 13the relative angle between joints for feature vector attributes due to these informationinvariant in real-space coordinate system.
Research [17] builds a 3D human model from depth image and extracts feature vectorbased on the angle of body part and coordinate axes Our work simplifies 3D humanmodel to a skeletal model – that constructed by a set of points (each joint isrepresented a body part) In our approach, we use Kinect with open source libraryOpenNI for data collection This library enables users to capture depth image, detecthuman and recognize human as well Moreover, human body parts are also segmentedand represented by a point in the center of respective part in depth map Therefore, 3Dhuman model is displayed in human skeletal model There are 15 body parts whichconstruct human skeleton model are denoted by 15 joints:
o Upper body: head, neck, torso,
o Upper left part: left shoulder, left elbow, left hand
o Upper right part: right shoulder, right elbow, right hand
o Lower left part: left hip, left knee, left foot
o Lower right part: right hip, right knee, right foot
We now consider a special class of geometrically meaningful feature functions As anexample, a geometric feature may express whether the arm raises to the left or to thefront for a fixed gesture More generally, let A is the degree of angle between left armand horizontal axis based on left to right shoulder
A
Based on the description of 6 traffic gestures, only upper body parts are moving the
of upper human body by defining:
where is one part of upper body
As can be seen from Fig.3, the arm, the hand, the shoulder and backbone (constructed
by neck and torso) illustrates different angles between pairs of them which vary sixgestures In our approach, the feature vector includes ten attributes Ten attributes areconstructed by angle of 2 vectors (each one is defined by start point and end pointrespectively):
Trang 14- (left elbow, left shoulder) and (left elbow, left hand)
- (right elbow, right shoulder) and (right elbow, right hand)
- (left shoulder, neck) and (left shoulder, left elbow)
- (right shoulder, neck) and (right shoulder, right elbow)
- (neck, head) and (neck, left shoulder)
- (neck, head) and (neck, right shoulder)
- (left shoulder, left hand) and (head, torso)
- (right shoulder, right hand) and (head, torso)
- (left shoulder, left hand) and (left shoulder, right shoulder)
- (right shoulder, right hand) and (left shoulder, right shoulder)
In addition, these joints includes three coordinate values which belongs to x axis(vertical axis), y axis (horizontal axis) and z axis – known as the depth value of thispoint (the distance from Kinect to current point) Angle between two vectors
The concept of such geometric features is simple but powerful, as we will illustrate bydescribe the below example Each attributes of feature function which expresses therelative angle of two body parts is measured by two vectors, each vector is constructed
by two body joints of human skeletal model
Trang 15For gesture: “left hand raises to the left”:
of angle
Real value
of angle
(left elbow, left shoulder) (left elbow, left hand) 3.141 (≈ π
rad)
2.986 (rad)
(right elbow, right
shoulder)
(right shoulder, neck) (right shoulder, right
elbow)
(right shoulder, right
Table 4 Rotation joint angle between two body parts
On the other hand, the feature F is invariant under global orientation and position, thesize of skeleton and various local spatial deviations such as vertical movements of theshoulder or neck Of course, F leaves any upper body movements unconsidered Ingeneral, feature functions define purely in terms of geometric entities that areexpressible by joint coordinates are invariant under global transforms suchasEuclidean motions and scaling
3.1.2 Training and classification