13 Figure 2.2: Motion segmentation and human object verification from image captured.. 22 Figure 3.2: Human object segmentation and head centroid estimation.. tracking, people identifica
Trang 1AUTOMATED HUMAN ACTIVITY RECOGNITION IN SMART ROOM
HENRY C C TAN
B.Eng (Hons)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2003
Trang 2My heartfelt gratitude goes to Dr Liyanage C De Silva, whose gentle guidance and strong belief in me accomplishing the project have made writing this thesis possible
Words are just not enough to express my sincere gratefulness to my family and my fiancée, Siok Luan, who have been constantly encouraging me and making sacrifices to see me through my postgraduate study during the last couple of years
I am also indebted to my fellow colleagues in the ‘Smart Room’ project and the Communications Lab, especially Mr Chathura R De Silva, Mr Jia Kui , Mr Soh Thian Ping, Miss Tin Lay Nwe, Dr Cao Yewen, Dr Huang Lei and Dr Zhi Wanjun They were my “run to” guys when I had a doubt or just want a quick answer
I am very appreciative of the 20 volunteers who have enthusiastically participated as subjects for the construction of the human activity database used in this project
To everyone, I owe you my big THANK YOU
Trang 3Table of Contents
Abstract……… v
List of Abbreviations……… ……….vi
List of Figures & Tables……… ……… vii
Chapter 1 Introduction 1
1.1 Background 1
1.2 Human Activity Recognition (HAR) 3
1.2.1 Related HAR Systems 4
1.2.2 Proposed HAR Classifiers 5
1.3 System Proposal 6
1.4 Thesis Organization 8
Chapter 2 Formulation of the Person Detector 10
2.1 Overview 10
2.2 Review of Motion Segmentation 10
2.3 Review of Human Object Verification 11
2.3.1 Object Classification 12
2.3.2 Face Detection 12
2.4 The Proposed Person Detector 13
2.4.1 Motion Segmentation 14
2.4.2 Object Classification 15
2.4.3 Face Detection 15
2.5 Summary 17
Chapter 3 Formulation of the Head Tracker 18
3.1 Overview 18
3.2 Review of Human Tracking 18
3.2.1 Model-based Tracking 19
3.2.2 Region-based Tracking 20
3.2.3 Active-contour-based Tracking 21
Trang 43.2.4 Feature-based Tracking 21
3.3 The Proposed Head Tracker 22
3.3.1 Human Object Segmentation & Head Centroid Estimation 23
3.3.2 Locations Estimation 24
3.3.3 HAR Features Extraction 26
3.4 Database from Feature Vectors Extraction 28
3.5 Method of Classification Performance Estimation 29
3.6 Summary 30
Chapter 4 Traditional Activity Classifiers 31
4.1 Overview 31
4.2 Nearest Neighbor Rule (NNR) 31
4.2.1 Review of NNR and k-NNR 31
4.2.2 Applying k-NNR to HAR 32
4.3 Hidden Markov Model (HMM) 33
4.3.1 Review of Discrete HMM 33
4.3.2 Applying HMM to HAR 38
4.4 Summary 40
Chapter 5 Proposed Activity Classifiers 41
5.1 Overview 41
5.2 Review of Neural Network (NN) 41
5.3 Elman Network (EN) 44
5.3.1 Motivation of using EN 44
5.3.2 Applying EN to HAR 45
5.4 HMM-NN Hybrid 48
5.4.1 Motivation of using HMM-NN 48
5.4.2 Applying HMM-NN hybrid to HAR 48
5.5 NN-HMM Hybrid 50
5.5.1 Motivation of using NN-HMM 50
5.5.2 Applying NN-HMM hybrid to HAR 52
5.6 Summary 53
Chapter 6 Results and Discussions 54
6.1 Overview 54
6.2 Experiments and Results 55
6.2.1 Recognition using the k-NNR 55
Trang 56.2.2 Recognition using the HMM 56
6.2.3 Recognition using the EN 57
6.2.4 Recognition using the HMM-NN 59
6.2.5 Recognition using the NN-HMM 60
6.3 Discussions 61
Chapter 7 Conclusions 64
7.1 The Person Detector 64
7.2 The Head Tracker 64
7.3 The Activity Classifier 65
7.4 Concluding Remarks 66
Chapter 8 Future Work 67
8.1 Enhancing the Person Detector 67
8.2 Enhancing the Head Tracker 68
8.2.1 True Identity 68
8.2.2 Automated Systems 68
8.2.3 Multiple Subjects Tracking 68
8.3 Enhancing the Activity Classifier 68
8.3.1 Model Selection 68
8.3.2 Training Algorithms Issues 69
8.3.3 Expanding the Database 69
8.3.4 Continuous Complex HAR 69
8.3.5 Real-time Activity Classifier 70
Author’s Related Publications 71
References 72
Trang 6Abstract
Traditionally, human activity recognition has been achieved mainly by the statistical pattern recognition techniques such as the Nearest Neighbor Rule (NNR), and the state-space methods, e.g the Hidden Markov Model (HMM) In this work, we propose three novel approaches – the use of the Elman partial Recurrent Neural Network (EN) and two hybrids
of Neural Network (NN) and HMM, i.e HMM-NN and NN-HMM, to recognize ten distinct human activities, e.g walking, sitting and squatting, in a smart room environment To achieve this, a three-level framework has been suggested, which first detects and verifies the presence of a person, then tracks the subject’s head movement over consecutive frames to extract the difference in coordinates as the feature vector that is invariant to the person’s sex, race and physique, and finally classifies the activities performed using one of the three proposed classifiers For performance comparison, the two traditional classifiers using NNR and HMM methods were also implemented Experimental results based on our database of
200 human activity color image sequences show that all the three proposed approaches perform better than the conventional methods The traditional NNR classifier implemented has the lowest recognition accuracy of only 85.5%, whilst the proposed HMM-NN hybrid attained the best performance of 96.5% Estimated time-complexity comparison also indicates that the HMM-NN and NN-HMM hybrids are only a few order higher than the traditional HMM method Given the higher trainability, flexibility and discriminative power
of the NN, the encouraging results not only reveal the significant performance improvement
of augmenting NN to the traditional HMM, but also demonstrate the greater potential of our proposals over the traditional classifiers in realizing recognition of continuous and complex activity in the increasingly popular human-activity-based applications
Trang 7k-NNR Generalized Nearest Neighbor Rule
YCbCr Luminance (Y) and Chrominance (CbCr) separated color space 2D, 3D 2- or 3-dimension
Trang 8List of Figures & Tables
Figure 1.1: Plan-view of the smart room and the camera locations 6
Figure 1.2: The three-module framework of our proposed system 7
Figure 2.1: Flow chart of our proposed Person Detector (PD) module 13
Figure 2.2: Motion segmentation and human object verification from image captured 14
Figure 2.3: Our human-skin-color model – area bounded by the four straight lines 16
Figure 3.1: Flow chart of our proposed Head Tracker (HT) module 22
Figure 3.2: Human object segmentation and head centroid estimation 23
Figure 3.3: Geometrical analysis using Cam 1 & 2 images to estimate the absolute locations 24
Figure 3.4: Exploiting the correspondence between image planes and the floor plan to estimate the absolute location based on the intersection of two x-planes 24
Figure 3.5: Different cameras offer different views that help to handle occlusions 25
Figure 3.6: A plan-view of the paths taken by two subjects in the room 26
Figure 3.7: The differences in coordinates over consecutive frames are extracted as the two features for HAR 27
Figure 3.8: Snapshots of three activity sequences performed by our subjects of different gender, race and physique 28
Figure 4.1: Flow chart of k-NNR HAR classifier 32
Figure 4.2: The graph structure of the three-state ergodic HMM employed 38
Figure 4.3: Block diagram of the HMM-based HAR classifier 39
Figure 5.1: Structure of a single-hidden-layer Multi-Layer Perceptron (MLP) 42
Figure 5.2: Structure of the Elman partial recurrent neural network (EN) 44
Figure 5.3: Block diagram of EN-based HAR classifier 46
Trang 9Figure 5.4: Block diagram of the HMM-NN hybrid classifier for HAR ………49
Figure 5.5: Block diagram of the NN-HMM hybrid classifier for HAR 51
Figure 6.1: k-NNR recognition rate as a function of the number of nearest neighbors used, k.
55
Figure 6.2: HMM recognition rate as a function of the number of states, S 56
Figure 6.3: Initial search: EN recognition rate as a function of the number of hidden units 57Figure 6.4: Refined search: EN recognition rate as a function of the number of hidden units
58Figure 6.5: HMM-NN recognition rate as a function of the number of MLP hidden units 59Figure 6.6: NN-HMM recognition rate and labelers’ classification rate as functions of the
number of MLP hidden units 60 Table 6.1: Recognition rate and time complexity comparisons for the five HAR classifiers
61Table 6.2: Classification results of 200 activity sequences using the proposed HMM-NN
hybrid 62
Trang 10Chapter 1
Introduction
The primary objective of this thesis is to show that human activity can be recognized efficiently and accurately using the artificial Neural Network (NN) and its combinations with the Hidden Markov Model (HMM), in the forms of HMM-NN and NN-HMM hybrids
1.1 Background
The attention given to human-motion-based research from the computer-vision community has been on the rise This is because of the rapid technological development of the image-capturing software and hardware, in addition to the omnipresence of reasonably low-cost high-performance personal computers These new technological advances have made vision-based research much more affordable and efficient than ever before The main motivation, however, comes from its application in a myriad different challenging but rewarding problems that include but not limited to automated surveillance systems, human-machine interaction, content-based retrieval, military simulation, clinical gait analysis and sports [1,2] Any of these projects usually involves one or more of the major vision-based research areas such as motion detection, human presence verification, human objects
Trang 11tracking, people identification by face recognition or other biometric means, advanced interface via facial expression recognition, action logging through motion and postures classification and human activity recognition
user-Growing efforts have been put into combining the various research areas such that more
‘intelligence’ is bestowed on the computer and its environment so as to enable them to understand and interact with the human users Some examples of recent work include the smart classroom by Ren and Xu [3] that recognizes the teacher’s natural complex action; the real-time distributed system using multiple calibrated cameras to track 3D locations of people in a smart room by Focken and Stiefelhagen [4] and the EasyLiving project by Krumm et al.[5] that reports the location and identity of the people in the intelligent living room equipped with two sets of color stereo cameras It is apparent that there is a common desire to come up with more intelligent machines and discerning vision systems, which synergize to produce ‘smart’ results unobtainable by each entity itself
Here at our laboratory, a ‘smart room’ has also been set up It is an enclosed office environment that aims to offer services to its users based on the information it perceives Many major areas of research are conducted here There is a facial expression recognition project that tries to understand the mood of the users judging from their expression Similar
to but using audio information instead, a speech recognition system distinguishes the users’ emotional states Work on a face recognition system that tries to maintain the identity of the users is also in progress There are also gesture recognition projects that interpret the users’ interaction based on their body and hand gestures The areas of interest for this thesis research project are to explore the tracking of the human objects and recognition of human activity within the smart room Knowing the identity of people, their locations and the activity taken place in the room is the most vital prerequisite for many of the services that the smart room can provide Examples of services are displaying an appropriate message on
a LCD panel for a particular user when he enters the room; zooming in for a close-up video
Trang 12capture when someone has been loitering around the cabinet containing ‘Confidential/ Secret’ documents for too long; sending an alarm to the security department if user X is found sitting and using the computer at the desk of user Y who has just left the room – signaling a case of gaining unauthorized access; etc All these require the knowledge of the whereabouts and identities of the users as well as recognition of the human activity in the smart room
To this end, we have developed algorithms for real-time detection and tracking of persons (in C++) and offline classifiers (in Matlab) for simple human activity in our smart room equipped with multiple cameras We focus principally on the detection and verification that humans are indeed present, tracking and estimating their locations while maintaining their identities, and ultimately, recognition of ten distinct activities, which are walking, sitting down, getting up, squatting down and standing up, in both lateral and frontal views
1.2 Human Activity Recognition (HAR)
According to Polana and Nelson [6], actions can be classified into three categories, namely
events, temporal textures and activities Events are isolated simple motions that do not
exhibit any temporal or spatial repetition; temporal textures exhibit only statistical
regularity, they are motion patterns of indeterminate spatial and temporal extent; whereas
activities consist of motion patterns that are temporally periodic and possess compact spatial
structure They cite ‘opening a door’, ‘ripples on water’ and ‘walking’ as examples for the three groups, respectively
Adhering to the definitions of activities and include also whole-body only motion events,
e.g standing up and sitting down, we deal with the recognition of this group of human activity in this work and compare our methods with some approaches in the same area
Trang 131.2.1 Related HAR Systems
More recent studies include Sun et al.[7], wherein they use a model-based approach for motion estimation The likelihood of the observed motion parameters is computed based on
a multivariate Gaussian probabilistic model The temporal change of the likelihood is modeled using the HMM, which is then applied to recognize eight simple human activities (turning of body from left to front, front to left, right to front, front to right, stand up, sit down, going to sit but return to standing and going to get up but return to sitting position) and eight more complex martial art activities High recognition rate, above 91%, on 160 test sequences has been reported
The system by Ben-Arie et al.[8] uses a novel template matching approach called Expansion Matching (EXM) for human body tracking, and a method of multidimensional hash table indexing followed by a sequence-based voting scheme for view-based recognition of eight human activities (jump, kneel, pick, put, run, sit, stand and walk), in various viewing directions It gives 100% recognition with 40 test videos
In the work of Ali and Aggarwal [9], by using background subtraction and skeletonization
to generate stick figure models for tracking, they recognize seven continuous activities (walk, sit, stand up, bend, get up, squat and rise), in 20 sequences in lateral view based on the Nearest Neighbor Rule (NNR) They achieve 76.92% accuracy
Another NNR classifier developed by Madabbushi and Aggarwal [10] classifies nine activities in lateral view (stand up, sit down, bend down, get up, walk, rise, squat, side bend and hug) and three in frontal view (frontal rise, squat and side bend), based on the movement of human head With 41 test sequences, their classification rate is 83%
In their earlier work, Madabbushi and Aggarwal [11] also track the head movement but using the Bayesian framework instead, to recognize ten activities (sit, stand, bend, get up, walk, hug, bend sideways, squat, rise from squat, fall down), some in lateral view, and some
Trang 14in frontal view Using a database of 77 action sequences, of which 39 are used for testing, the success rate is 79.74%
1.2.2 Proposed HAR Classifiers
As we have observed, the connectionist techniques, and their hybrids in the form of
HMM-NN or HMM-NN-HMM, have neither been exploited nor reported in the literature of HAR It is thus the objective of this thesis to approach the long-standing problem with three solutions based on the artificial Neural Network, and eventually compare their performance with that
of the popular traditional HAR classifiers – NNR and HMM
In the first proposal, the classifier based on the Elman model of the partial Recurrent Neural Network (RNN), or simply Elman Network (EN), is advocated Chosen for its internal representation of time, its ability to remember input from the previous frame and develop an
‘understanding’ of the context of the input makes it a suitable candidate for the time-varying recognition problem at hand
The second classifier consists of ten HMMs (each one is trained specifically for a class of activity) and a single-hidden-layer Multi-Layer Perceptron (MLP) NN The MLP, known for its better classification capability than the HMMs, is used to classify the activity based
on the likelihood functions for the ten classes computed by the HMMs This combination is known as the HMM-NN hybrid
In our final proposal, a NN-HMM hybrid, two MLPs are trained as labelers for ten HMMs, which are time-scale invariant classifiers at sequence level The MLP, being both trainable and discriminative, is better than the ordinary vector quantizer used in the traditional HMM, hence, this proposed hybrid is also expected to perform better than the traditional HMM classifier
Trang 151.3 System Proposal
Our smart room has a setup that includes three fixed Sony CCD color camera installations, one Sony camera adaptor, three Euresys Picolo frame grabbers of rate 25 fps and one Pentium-4 PC server running on Win 2000, installed with digital video recording and processing software, Video Savant v.3.0 Not calibrated for stereo vision, all cameras are used monocularly Cameras are installed as shown in Figure 1.1 Camera 1 is fixed on the wall directly facing the door, the other two are on one side of the room; Camera 2 is closer
to the door and Camera 3 is farther away They are all located at about the same height, approximately 1.85m from the floor Furniture in the room includes desks with PCs, three cupboards and a filing cabinet
Camera 1
Camera 3 Camera 2
Figure 1.1: Plan-view of the smart room and the camera locations
Before human activity can be recognized and interpreted from the digital video recordings, human detection, tracking and posture estimation, from one frame to another in the image sequence, must generally be accomplished [12] Deriving from this concept, we propose a system that follows a three-level framework for human activity recognition At the lowest level, the goal is to detect and verify that human is entering the smart room At the intermediate level, its aim is to keep track of the human whereabouts in the room and extract meaningful features for posture estimation and subsequent activity classification, at the topmost level
Trang 16To facilitate discussion and implementation, we make each of the three levels a module and name the low, intermediate and top levels the Person Detector (PD), the Head Tracker (HT) and the Activity Classifier (AC) respectively, as depicted in Figure 1.2
& Feature Estimation
Activity Classifier (AC):
EN, HMM-NN or NN-HMM (also NNR or HMM)
Video
Streams
Figure 1.2: The three-module framework of our proposed system
Briefly, they work as follows By using background subtraction for the motion segmentation, shape-based object classification and feature-invariant face detection techniques for human object verification, module PD verifies if a moving object entering the room is human As soon as it confirms that a person is indeed present, module HT starts to perform feature-based human tracking for locations estimation and posture estimation that is needed for subsequent activity recognition
Adopting the assumption made in the Hydra project [13], we set the constraint that our heads are always “above” our torsos and use the approximated head centroid as the sub-feature for human tracking This method yields two important elements (the x- and y-coordinates of the approximated head centroid) for posture estimation and extracts the difference in coordinates over consecutive frames in an activity sequence as the features for recognition of this particular activity via AC
In addition, as a means to keep track of the whereabouts of the person, or up to two persons presently, a straightforward geometrical analysis of the image planes and correspondence to the floor plan are used to estimate the absolute locations of the people in the room The ability to track and identify the various individuals would be useful for the study of
Trang 17multiple-subject activity recognition and understanding their interactions The module HT uses mean Red and Green values of the different subjects for maintaining their identities and handling occlusions
Finally, the offline module AC, which classifies single person activity, is implemented based on the best of our three proposed HAR classifiers (EN, HMM-NN and NN-HMM) that gives the highest performance And for comparison purpose, the two traditional HAR algorithms (NNR and HMM) are also used to implement the module AC
For the training and performance evaluation purposes of all the classifiers, we constructed a common database of 200 activity sequences, comprising ten activities performed by 20 subjects
1.4 Thesis Organization
The remainder of this thesis is structured as follows
Chapter 2 studies the techniques available for human detection and explains the formulation
of the module PD and how its objectives of detecting and verifying that a person is entering the room are met
Chapter 3 begins with a survey of the various human tracking approaches for posture estimation and activity recognition Then the focus is on formulating the module HT using the feature-based tracking method to achieve location estimation and feature extraction for the ensuing module, AC
Chapter 4 reviews the NNR and HMM algorithms and shows how these popular traditional HAR classifiers are integrated into our system, for subsequent comparison with our proposed classifiers
Trang 18Chapter 5 starts with an introduction of NN, specifically MLP It then describes the motivation and algorithms of the three proposed HAR classifiers The implementation of
AC using these proposals is also presented In addition, the modified EBP learning rule used
to achieve faster convergence for EN and MLP is detailed
Chapter 6 documents the main experiments conducted and presents the results obtained Comparison between the various classifiers’ recognition rate and time-complexity are made The chapter ends with some discussions on these results and how the proposed classifiers can be improved
Chapter 7 summarizes the main conclusions of this thesis and finally,
Chapter 8 sketches our future directions of research
Trang 19a human motion analysis system since the subsequent tracking and recognition processes depend heavily on it This process usually involves motion segmentation and human object verification
2.2 Review of Motion Segmentation
Motion segmentation in video sequences is known to be a significant and a difficult problem, which primarily aims at detecting regions corresponding to moving objects such as people in the scenes This provides a focus of attention for later processes because only those changing pixels need to be examined However, changes from illumination, shadow,
Trang 20repetitive motion from clutter, and even weather if it is an outdoor scene, make motion segmentation all the more difficult to process quickly and reliably
Many techniques are available for motion segmentation Using either temporal or spatial information of the images, they can be classified roughly into one of the four main
approaches In an example of the statistical motion segmentation method by Wren et al.[14],
human subject is segmented by modeling as a connected set of blobs, each of them has its own spatial and color statistics Alternatively, segmentation of the mobile subject can be
achieved via the optical flow estimation method of the object’s motion and position For
example, Meygret and Thonnat [15] combine both stereovision and optical flow motion
information to segment out mobile 3D objects Equally popular is the temporal differencing
approach for motion segmentation, such as the work by Huwer and Niemann [16], where the difference images of consecutive images are used to detect continuous moving subjects
Last but not least, the simple but effective background subtraction is also widely used to
detect moving human objects in video images As an example, Seki et al.[17] evaluate the Mahalanobis distance between the averages of the background image vectors and the newly observed image vectors to detect objects As the background subtraction was found to be computationally efficient and robust enough for our use in the indoor office environment, it was adopted as the means of motion segmentation in our PD implementation
2.3 Review of Human Object Verification
Different moving regions may correspond to different moving objects in a scene For instance, the image sequences captured in our office environment may include an office cleaner (human) guiding a vacuum cleaner (non-human) to sucks dirt from floor, shoving roller office chairs (non-human) around and sending them into motion The shadows and moving objects can be mistaken for mobile human if motion is the only criterion applied To analyze human activity, it is very necessary that we correctly distinguish the actor from the other non-human moving objects This can be accomplished by many different ways We
Trang 21shall only focus on uncomplicated automated vision-based approaches suitable for real-time use, such as employing object classification to verify that a moving object is human and then double-check the result using face detection techniques
2.3.1 Object Classification
The goal of object classification is to extract the region corresponding to people from all moving blobs obtained by the motion segmentation methods discussed before There are
two main categories of approaches towards moving object classification, namely the
shape-based and motion-shape-based classifications In the shape-shape-based classification, different
descriptions of shape information of motion regions such as representations of point, box, silhouette and blob are available for classifying moving objects [12] On the other hand, the motion-based classification uses the periodic property of non-rigid articulated human motion as the cue for moving object classification, as in the work of Cutler and Davis [18] Since not all of our activities of interest are periodic or self-similar, we rely on the shape-based classification method for clue as to whether the moving object is human or not
2.3.2 Face Detection
Knowing that the moving object in the scene could be a person, the presence of a human face could almost always confirm it While there are many image processing techniques to
delineate a human face in an image, the feature-invariant approach is the most commonly
used approach It is relatively faster than many other popular techniques such as the
knowledge-based, the template-matching and the appearance-based methods [19] The
feature-invariant method seeks to localize invariant features of human faces for detection The underlying assumption is based on the observation that humans can effortlessly detect faces and objects in different poses and lighting conditions, so there must exist properties or features that are invariant over the variability Using features such as facial features (e.g eyebrows, eyes, nose, mouth and hair-line), facial textual, skin color, shape and size of the
Trang 22face, many methods have been proposed and demonstrated to be efficient in localizing and detecting faces [19] One of our previous studies has found that detecting faces via skin-color is the fastest among all other facial features [20] Although skin color varies in people from different ethnic groups, studies have proved that all skin colors can be approximated to
a map in YCbCr space [21,22] It was also found that the luminance value (Y) of the skin color is heavily dependent on the camera and the environment; and all major differences between varying skin tones was observed to lie in their color intensities rather than in the facial skin color itself This implies that the luminance signal Y does not contain useful information pertaining facial skin-color detection In this work, we will discard the Y signal and employ only the more manageable CbCr-color space for skin-color detection, in addition to the facial shape and size feature-invariant face detection methods
2.4 The Proposed Person Detector
Putting together the above techniques and concepts, we formulate our first module of the system for human detection The flow chart of our implementation is shown in Figure 2.1
Start of Person Detector (PD)
Face Size > S No
Background Model
Face Shape OK?
Yes
Yes Head Tracker (HT)
No No
Trang 232.4.1 Motion Segmentation
Motion segmentation is achieved by background subtraction to extract the moving object from the background in each frame captured, as follows The background, as shown in Figure 2.2(a), was modeled by computing the mean for each pixel in the color images over a sequence of 50 frames, which were taken prior to presence of any motion, e.g Figure 2.2(b)
(a) Background model (b) Motion detection
(c) Background subtraction (d) Thresholding, filtering and morphology
(e) Skin color detection (f) Face size and shape check
Figure 2.2: Motion segmentation and human object verification from image captured
Trang 24Next, the background-subtracted image, Figure 2.2(c) was subjected to image thresholding, median filtering to remove ‘salt and pepper’ noise and finally, some standard morphological operations to segment out the moving object, as shown in Figure 2.2(d)
2.4.2 Object Classification
From this segmented region, a straightforward shape-based object classification is carried out Using the average number of pixels of ten human subjects entering the door (which is the farthest point in the room opposite Camera 1) as the threshold, a moving object is classified as ‘possibly a human’ if its total number of pixels in the image is greater than our
preset threshold, T, which is heuristically chosen as 18000 pixels (out of the possible
442368 pixels for our resolution of 576x768 pixels per image.)
2.4.3 Face Detection
From the ‘possibly a human’ segmented region, feature-invariant face detection then follows Here, a three-tiered approach is proposed We first detect for skin-color feature in the ‘possible human’ blob Then, for all the skin-color regions detected, if there is any, the face size, then the face shape are compared to our face model to ascertain that the moving object is indeed a human
A Skin color detection
For the development of our hypothesis, we gathered from the internet skin samples belonging to different ethnic groups such as Caucasian, Negroid, Asian, etc, to form our human-skin-color model Hence, the resultant skin region in the CbCr space could be approximated to the area surrounded by the four lines, as plotted in Figure 2.3
Therefore, by applying the image of the segmented blob as a mask over the original image when motion was detected, and examine only the pixels under the mask area in the original
image, a pixel is identified as having skin color if its Cr color value, r, fulfills the following
Trang 2560 80 100 120 140 160 110
where b is the Cb color value of that pixel
In this manner, all the skin color regions of the input image taken from Camera 1 are extracted, as shown in Figure 2.2(e)
B Face size check
To differentiate potential facial region from skin regions that are too small to be considered
as face, a threshold S of empirical values of about 700 pixels is used This value is obtained
Trang 26from the average number of pixels of the faces of ten people entering the door Thus, only
the connected pixels of area greater than S will be further examined for consideration as
being a human face
C Elliptical face shape model
Having tested and passed the skin-color and face size checks, the extracted ‘potential face area’ is further evaluated using a human face shape model By considering the contours of merged skin color regions that are greater than 700 pixels and approximating them to ellipses, the face can be easily singled out from other skin color body parts, e.g arms, shoulders, etc As established in [23], if the approximated ellipses have the following properties, it is a candidate to a face:
(Major Axis ÷ Minor Axis) < 1.7 (2.2)
As shown in Figure 2.2(f), the major and minor axes of the bounding box are used to approximate that of the ‘face’ area If (2.2) is satisfied, we say that a human face is present and a person has been detected
2.5 Summary
In this chapter, the role of human detection as the cornerstone of human activity recognition has been stressed It is typically achieved by motion segmentation followed by human object verification In this research, we adopted background subtraction for motion segmentation, blob size shape-based object classification and our recommended three-tiered feature-invariant face detection using a new skin-color model, face size and shape for the human object verification As there was no complex computation involved, our PD implementation has shown to be efficient as a real-time algorithm
Trang 273.2 Review of Human Tracking
It is a challenge to understand the postures of human objects in the scenes that made up an image sequence This is because a moving human is a non-rigid object, of deformable form and highly capable of adopting many different poses Human tracking is particularly useful
in human activity recognition since it serves as a means to prepare data for posture estimation and the ensuing activity recognition In contrast to human detection discussed in the previous chapter, human tracking belongs to a higher-level computer vision problem However, the tracking algorithms within the study of human motion analysis usually have considerable intersection with motion segmentation during processing
Trang 28Tracking can be divided into various categories according to different criteria [24] As far as tracked objects are concerned, tracking may be classified into tracking of human body parts such as face, hand and leg, and tracking of the whole human body If the dimension of tracking space is of particular interest, there is 2D versus 3D tracking And if the number of views is considered, there are single-view, multiple-view and omni-directional view tracking In addition, tracking can be grouped according to other criteria such as the number
of tracked human (single subject, multiple subjects, human groups), the tracking environment (indoors vs outdoors), the multiplicity of the sensor (monocular vs stereo), state of the camera (moving vs stationary), etc
Nevertheless, there are essentially four most widely used methods of tracking motion,
namely the model-based, region-based, active-contour-based and feature-based tracking
3.2.1 Model-based Tracking
In the model-based tracking method where the geometric structure of human body is used, the body segments can be approximated as lines (in the case of stick figure model), 2D ribbons (in the 2D contour model) or 3D volumes (in the volumetric model) The motion of joints provides a key to motion estimation and recognition in the stick figure model-based tracking For example, Guo et al.[25] represent the human body structure in the silhouette
by a stick figure model which has ten sticks articulated with six joints The motion estimation problem is transformed into finding a stick figure with minimal energy in a potential field Prediction and angle constraints of individual joints were also introduced to reduce the complexity of the matching process In the 2D contour model, the human body representation is directly relevant to the human body projection in the image plane In such description, human body segments are analogous to 2D ribbon or blobs For instance, Ju et al.[26] proposes a cardboard people model that represents the human limbs by a set of connected planar patches The parameterized image motion of these patches is constrained
to enforce the articulated movement and is used to deal with the analysis of articulated
Trang 29motion of human limbs As the camera’s angle poses some restrictions on the 2D model, considerable amount of research has been done to depict the geometric human body structure in more detail using some 3D models, such as elliptical cylinders, cones, spheres, etc Although the more complex 3D volumetric models yield better result, they require a lot more parameters resulting in much computationally expensive matching process However, the advantage of 3D human model lies in the ability to handle occlusion and more significant data can be obtained for activity analysis An example is the work of Kakadiaris and Metaxas [27], wherein they use three calibrated mutually orthogonal cameras to track human motion in the presence of occlusion The selection of a time varying set of cameras
at each time frame is based on the visibility of each part and the observability of its predicted motion from a given camera
3.2.2 Region-based Tracking
This second means of tracking motion is to identify a connected region associated with each moving object in an image and then track it over time using a cross-correlation measure For example, the Pfinder [14] uses small blob features to track a person in an indoor environment The human body is considered as a combination of some blobs representing various body parts such as head, torso and four limbs Modeling the human body and background scene with Gaussian distributions, the pixels belonging to the body are assigned
to different body parts blobs using the log-likelihood measure Hence, by tracking the region
of each small blob, the moving person can be tracked Generally, the region-based tracking approach works well except for two bad situations The first is when blobs associated with separate people are connected up due to their long shadows Fortunately, this may be resolved to some extent by making use of color or texture, because shadows are devoid of these properties The second and more serious issue is the congested situations where people partially occlude one another instead of being spatially isolated The task of segmenting individual humans may then require tracking systems using multiple cameras
Trang 303.2.3 Active-contour-based Tracking
The active-contour-based tracking method aims at directly extracting the shape of the subjects based on active contour models, also known as snakes The idea is to have a representation of the bounding contour of the object and keep dynamically updating it over time For instance, Peterfreund [28] uses image gradient and optical flow measurements along the contour as system measurement; his Kalman filter-based active contour model tracks rigid and non-rigid moving targets, such as people, in spatio-velocity space in the presence of occlusions and image clutter Clearly, as is the case with other active-contour-models, it requires a good initial fit – a well-separated contour for each moving object to be tracked Nevertheless, it is usually less computationally complex than the region-based tracking approach
3.2.4 Feature-based Tracking
The key idea in the last of the four commonly used motion tracking methods, feature-based tracking, is to track only sub-features instead of object as a whole Distinguishable sub-features such as points or lines on the object are selected because they usually remain visible even in the presence of partial occlusion; tracking then involves feature extraction and feature matching Low-level features such as points are easier to extract than the higher-level features like lines and blobs Thus, there is always a trade-off between features complexity and tracking efficiency As a good example of point-feature tracking, Polana and Nelson [29] use the centroid of the rectangular box that binds the human as the feature point for tracking As long as the velocity of the centroids can be distinguished, tracking of two subjects is still successful even when occlusion has occurred during tracking of the subjects
In a nutshell, depending upon whether information about the object shape is employed in the motion analysis, tracking over time involves feature extraction and establishing feature
correspondence between consecutive frames, using either the strategy of matching a priori
shape model or estimation of features related to position, velocity, shape, texture, color, etc
Trang 313.3 The Proposed Head Tracker
Since in our case, the head is observed to be always above the torso while executing any of our ten pre-defined classes of activity; and its position and relative movement in the video sequences are very indicative of the activity performed – we choose the head as the sub-feature for tracking To be exact, the approximated head centroids, represented by x-y coordinate pairs, extracted from the image frames captured from Camera 2 are used to form the basis of our feature vectors for tracking, posture estimation and subsequent human activity recognition The flow chart for the module HT is shown in Figure 3.1
Start of Head Tracker (HT)
Frame Level Human Object Segmentation
Yes No
Update Feature Vectors Plot Absolute Locations
End of Sequence?
Feature Vectors Conditioning
Database Formed?
Head Centroids Estimation
Figure 3.1: Flow chart of our proposed Head Tracker (HT) module
Trang 323.3.1 Human Object Segmentation & Head Centroid Estimation
As alluded to earlier, the human tracking algorithms usually have considerable intersection with that of motion segmentation during processing; our module HT is no exception Background subtraction is used again to detect the moving human object And as before, standard image processing techniques like thresholding, filtering and morphological operations are applied to obtain the human blob of interest, as shown in Figure 3.2(a) to (d)
To obtain the approximated head centroid in each frame, we located the tip of the blob in the frame and started box-bound the blob until the width ratio of the current row (of pixels belonging to the human blob) to the previous row was smaller than low or larger than high These two thresholds, low and high, were chosen empirically as 0.91 and 1.07 respectively
In cases where the neck portion was obvious in the image, low would be automatically compared with; else high would be used, which usually implied that the shoulder of the subject had been reached So, in either case, only the region belonging to the head of the human blob would be bound, as illustrated in Figure 3.2(e) The x- and y-coordinates of the centroid of this bounding box were then used to approximate the centroid of the human head
in that frame Succeeding these, the module goes on to fulfill its next two functions, locations estimation and HAR features extraction
(a) Background model (b) Motion detection
(c) Background (d) Standard image (e) Head centroid
subtraction processing estimation
Figure 3.2: Human object segmentation and head centroid estimation
Trang 333.3.2 Locations Estimation
This function aims to keep track of the people’s whereabouts in the room and have their trajectories plotted onto a 2D floor plan automatically Because all our cameras are stationary; so, there exist linear relationships between each pixel in all three image planes
By using simple geometrical analysis as shown in Figure 3.3, and knowing the correspondence between image planes and the floor plan, we can estimate the absolute location of each subject on the floor plan based on the intersection of two x-planes (i.e image columns) of the human heads centroids in the images, as depicted in Figure 3.4
Cam 1 Image Plane
Cam 2 Image Plane
Cam 1 Image Plane
Cam 2 Image Plane
Absolute Location
Figure 3.4: Exploiting the correspondence between image planes and the floor plan to estimate
the absolute location based on the intersection of two x-planes
Trang 34A Maintaining identities
When there are two persons in the room their identities need to be maintained at all times
To achieve this, the mean Red and Green values of the different human blobs in the first frame of the input sequence from all cameras are calculated and saved for comparison and differentiation of the subjects appeared in the later frames, so as to preserve each individual’s identity throughout the video sequence The RGB scheme is chosen because our images are inherently RGB, thus, color conversion is not necessary And because we found that the mean Blue values of different persons do not vary much, it is discarded so as
to expedite the whole identities resolution process To further streamline the real-time algorithm, this and the following occlusion handling steps are skipped if there is only one human blob detected in the first frame – one person in the rest of this sequence is assumed
B Handling occlusions
Because there are three cameras in our system and each provides different information (see Figure 3.5), occlusion can be easily handled by evaluating the absolute location of each subject using the geometrical analysis on any two cameras as before (default: Cam 1 & 2)
Images from: Synchronized image captures: Remarks:
Trang 35Figure 3.6: A plan-view of the paths taken by two subjects in the room
3.3.3 HAR Features Extraction
To understand human postures for activity recognition from images, knowing the absolute positions of the subjects in plan-view is not enough We also need the height information Applying the 3D people tracking method by Focken and Stiefelhagen [4] is certainly a solution But we observed that extracting 2D features is sufficient to meet our primary objective of studying the feasibility of connectionist HAR classifiers We also found that when a camera has a good view of most of the work space, and if the human activity is carried out within that space, information from just that camera is enough to classify the activity pretty satisfactorily For the study of HAR, we choose only the images from Camera
2, which sees the most part of the work space The estimated head centroids in these images are used to estimate the postures and form the basis of our HAR feature vectors
Trang 36On average, we observed that every class of the ten selected human activities could be completed comfortably within 28 frames by all actors Hence, the head centroid estimation (i.e bounding of head and extraction of the x-, y-coordinates) were repeated for 28 frames
in every activity sequence to form a 28x2 matrix for every subject Thus, if we let A be the activity matrix for each activity j performed by an individual i, we could write:
A i,j = [y x] i,j , (3.1)
where i=1,2, ,20; j=1,2, ,10; and y and x are the 28-element-column-vectors of the y- and
x-coordinates, respectively, of the approximated centroid of the head in all the 28 frames
To make the system invariant to the built, height and position of the subjects in the images,
we conditioned each A i,j by seeking the difference in coordinates over consecutive frames
This gave a new 27x2 difference matrix, D i,j , or two 27-elements feature vectors for an
activity j performed by an individual i, Y i,j and X i,j , i.e
D i,j = [y k+1 – y k x k+1 – x k]i,j = [Y X] i,j = [Y i,j X i,j], (3.2)
where k=1,2, ,28; is the index to the frames in an activity sequence
y
x
time
Figure 3.7: The differences in coordinates over consecutive frames are extracted as the two
features for HAR
Trang 37As illustrated in Figure 3.7, the x- and y-coordinates of the estimated head centroid indirectly implied the posture that the subject had adopted in a scene Thus, by tracking the head of a person walking across the room from a sequence of frames and seeking the difference in coordinates of the approximated head centroid (i.e the centroid of bounding box, represented by dotted cross hair) over consecutive frames, the results can be extracted
to form the feature vectors Y i,j and X i,j for recognition of a walking activity in lateral view
3.4 Database from Feature Vectors Extraction
For training and evaluation of the HAR classifiers, we built a database of 200 human activity sequences – ten different activities performed by 20 subjects, ten from each gender Subjects are of various height, built and race Refer to Figure 3.8 for some snapshots of the recorded sequences for three of the ten activities
Figure 3.8: Snapshots of three activity sequences performed by our subjects of different gender,
race and physique
From these 20 subjects and ten classes of activity sequences per subject, we extracted 200 of
the 27-elements feature vectors, Y i,j and X i,j, using the steps explained in the previous
Trang 38sub-section Hence, we could form two feature matrices D Y and DX for all our training and testing purposes, where
D Y =[Y i,j], ∀ i,j & DX =[X i,j], ∀ i,j (3.3)
Whenever it is more efficient to represent both features as only one feature vector (e.g in
the NNR and EN cases), the two feature matrices, D Y and D X, are concatenated and the new feature matrix is denoted as
which is of dimension 54 row by 200 columns
3.5 Method of Classification Performance Estimation
In order to accurately evaluate the performance of the proposed HAR classifiers (as well as that of the traditional classifiers), the training samples must not be used to test the classifiers These feature matrices must be divided into independent training and test
datasets There are many ways to achieve this But for our relatively small sample size,
k-fold Cross-Validation (CV) [30] was deemed to be most suitable and efficient In this study,
k=4 is conveniently chosen for all experiments
Basically, this four-fold CV scheme divides the available data, which can either be the extracted vectors database or the output from a previous stage in the hybrid case, into four mutually exclusive subsets (or ‘folds’) of equal size, consisting 50 samples each (i.e five samples from each class of activity) For each of the four CV runs, the classifier is trained
on three of the subsets and tested on the remaining one Hence, instead of reserving a portion of the available data solely for testing purpose, the rotation of role as test data among the subsets enables the use of every sample for training This greatly improves the efficiency of the limited dataset and at the same time, gives a more accurate estimation of the classifier’s performance The CV estimate of accuracy is obtained from the overall
Trang 39number of correct classifications from all four runs, divided by the total number of instances
in the dataset, which is 200 in this study
in the process, the HT algorithm has demonstrated to be fast and accurate for real-time use
The main focus of the module, however, is to extract the two features from the estimated head centroids for the next module, Activity Classifier (AC) The tracking yielded two important elements for posture estimation – the x- and y-coordinates; their differences in coordinate over consecutive frames from Camera 2 were used to form our feature vectors and to construct our database for the training and testing of the module AC
The chapter ended with some discussions on how this database could be used efficiently for more accurate performance evaluation of the AC In the following two chapters, we will see how the module AC can be implemented using either the traditional NNR and HMM classifiers, or the proposed NN-based approaches
Trang 40distance, with examples in the training set If we define T={T1, T2,…, T R } to be a set of R labeled training prototypes and let D*∈T denote the prototype nearest to a test sample D in the feature space Then, in a J-class problem, the NNR algorithm assigns D to the same
class label ωj as D*, where j∈{1, 2, …, J} Using the notation in equation (3.2) from section