Automated human activity recognition in smart room

13 Figure 2.2: Motion segmentation and human object verification from image captured.. 22 Figure 3.2: Human object segmentation and head centroid estimation.. tracking, people identifica

Trang 1

AUTOMATED HUMAN ACTIVITY RECOGNITION IN SMART ROOM

HENRY C C TAN

B.Eng (Hons)

A THESIS SUBMITTED

FOR THE DEGREE OF MASTER OF ENGINEERING

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

NATIONAL UNIVERSITY OF SINGAPORE

2003

Trang 2

My heartfelt gratitude goes to Dr Liyanage C De Silva, whose gentle guidance and strong belief in me accomplishing the project have made writing this thesis possible

Words are just not enough to express my sincere gratefulness to my family and my fiancée, Siok Luan, who have been constantly encouraging me and making sacrifices to see me through my postgraduate study during the last couple of years

I am also indebted to my fellow colleagues in the ‘Smart Room’ project and the Communications Lab, especially Mr Chathura R De Silva, Mr Jia Kui , Mr Soh Thian Ping, Miss Tin Lay Nwe, Dr Cao Yewen, Dr Huang Lei and Dr Zhi Wanjun They were my “run to” guys when I had a doubt or just want a quick answer

I am very appreciative of the 20 volunteers who have enthusiastically participated as subjects for the construction of the human activity database used in this project

To everyone, I owe you my big THANK YOU

Trang 3

Table of Contents

Abstract……… v

List of Abbreviations……… ……….vi

List of Figures & Tables……… ……… vii

Chapter 1 Introduction 1

1.1 Background 1

1.2 Human Activity Recognition (HAR) 3

1.2.1 Related HAR Systems 4

1.2.2 Proposed HAR Classifiers 5

1.3 System Proposal 6

1.4 Thesis Organization 8

Chapter 2 Formulation of the Person Detector 10

2.1 Overview 10

2.2 Review of Motion Segmentation 10

2.3 Review of Human Object Verification 11

2.3.1 Object Classification 12

2.3.2 Face Detection 12

2.4 The Proposed Person Detector 13

2.4.1 Motion Segmentation 14

2.4.2 Object Classification 15

2.4.3 Face Detection 15

2.5 Summary 17

Chapter 3 Formulation of the Head Tracker 18

3.1 Overview 18

3.2 Review of Human Tracking 18

3.2.1 Model-based Tracking 19

3.2.2 Region-based Tracking 20

3.2.3 Active-contour-based Tracking 21

Trang 4

3.2.4 Feature-based Tracking 21

3.3 The Proposed Head Tracker 22

3.3.1 Human Object Segmentation & Head Centroid Estimation 23

3.3.2 Locations Estimation 24

3.3.3 HAR Features Extraction 26

3.4 Database from Feature Vectors Extraction 28

3.5 Method of Classification Performance Estimation 29

3.6 Summary 30

Chapter 4 Traditional Activity Classifiers 31

4.1 Overview 31

4.2 Nearest Neighbor Rule (NNR) 31

4.2.1 Review of NNR and k-NNR 31

4.2.2 Applying k-NNR to HAR 32

4.3 Hidden Markov Model (HMM) 33

4.3.1 Review of Discrete HMM 33

4.3.2 Applying HMM to HAR 38

4.4 Summary 40

Chapter 5 Proposed Activity Classifiers 41

5.1 Overview 41

5.2 Review of Neural Network (NN) 41

5.3 Elman Network (EN) 44

5.3.1 Motivation of using EN 44

5.3.2 Applying EN to HAR 45

5.4 HMM-NN Hybrid 48

5.4.1 Motivation of using HMM-NN 48

5.4.2 Applying HMM-NN hybrid to HAR 48

5.5 NN-HMM Hybrid 50

5.5.1 Motivation of using NN-HMM 50

5.5.2 Applying NN-HMM hybrid to HAR 52

5.6 Summary 53

Chapter 6 Results and Discussions 54

6.1 Overview 54

6.2 Experiments and Results 55

6.2.1 Recognition using the k-NNR 55

Trang 5

6.2.2 Recognition using the HMM 56

6.2.3 Recognition using the EN 57

6.2.4 Recognition using the HMM-NN 59

6.2.5 Recognition using the NN-HMM 60

6.3 Discussions 61

Chapter 7 Conclusions 64

7.1 The Person Detector 64

7.2 The Head Tracker 64

7.3 The Activity Classifier 65

7.4 Concluding Remarks 66

Chapter 8 Future Work 67

8.1 Enhancing the Person Detector 67

8.2 Enhancing the Head Tracker 68

8.2.1 True Identity 68

8.2.2 Automated Systems 68

8.2.3 Multiple Subjects Tracking 68

8.3 Enhancing the Activity Classifier 68

8.3.1 Model Selection 68

8.3.2 Training Algorithms Issues 69

8.3.3 Expanding the Database 69

8.3.4 Continuous Complex HAR 69

8.3.5 Real-time Activity Classifier 70

Author’s Related Publications 71

References 72

Trang 6

Abstract

Traditionally, human activity recognition has been achieved mainly by the statistical pattern recognition techniques such as the Nearest Neighbor Rule (NNR), and the state-space methods, e.g the Hidden Markov Model (HMM) In this work, we propose three novel approaches – the use of the Elman partial Recurrent Neural Network (EN) and two hybrids

of Neural Network (NN) and HMM, i.e HMM-NN and NN-HMM, to recognize ten distinct human activities, e.g walking, sitting and squatting, in a smart room environment To achieve this, a three-level framework has been suggested, which first detects and verifies the presence of a person, then tracks the subject’s head movement over consecutive frames to extract the difference in coordinates as the feature vector that is invariant to the person’s sex, race and physique, and finally classifies the activities performed using one of the three proposed classifiers For performance comparison, the two traditional classifiers using NNR and HMM methods were also implemented Experimental results based on our database of

200 human activity color image sequences show that all the three proposed approaches perform better than the conventional methods The traditional NNR classifier implemented has the lowest recognition accuracy of only 85.5%, whilst the proposed HMM-NN hybrid attained the best performance of 96.5% Estimated time-complexity comparison also indicates that the HMM-NN and NN-HMM hybrids are only a few order higher than the traditional HMM method Given the higher trainability, flexibility and discriminative power

of the NN, the encouraging results not only reveal the significant performance improvement

of augmenting NN to the traditional HMM, but also demonstrate the greater potential of our proposals over the traditional classifiers in realizing recognition of continuous and complex activity in the increasingly popular human-activity-based applications

Trang 7

k-NNR Generalized Nearest Neighbor Rule

YCbCr Luminance (Y) and Chrominance (CbCr) separated color space 2D, 3D 2- or 3-dimension

Trang 8

List of Figures & Tables

Figure 1.1: Plan-view of the smart room and the camera locations 6

Figure 1.2: The three-module framework of our proposed system 7

Figure 2.1: Flow chart of our proposed Person Detector (PD) module 13

Figure 2.2: Motion segmentation and human object verification from image captured 14

Figure 2.3: Our human-skin-color model – area bounded by the four straight lines 16

Figure 3.1: Flow chart of our proposed Head Tracker (HT) module 22

Figure 3.2: Human object segmentation and head centroid estimation 23

Figure 3.3: Geometrical analysis using Cam 1 & 2 images to estimate the absolute locations 24

Figure 3.4: Exploiting the correspondence between image planes and the floor plan to estimate the absolute location based on the intersection of two x-planes 24

Figure 3.5: Different cameras offer different views that help to handle occlusions 25

Figure 3.6: A plan-view of the paths taken by two subjects in the room 26

Figure 3.7: The differences in coordinates over consecutive frames are extracted as the two features for HAR 27

Figure 3.8: Snapshots of three activity sequences performed by our subjects of different gender, race and physique 28

Figure 4.1: Flow chart of k-NNR HAR classifier 32

Figure 4.2: The graph structure of the three-state ergodic HMM employed 38

Figure 4.3: Block diagram of the HMM-based HAR classifier 39

Figure 5.1: Structure of a single-hidden-layer Multi-Layer Perceptron (MLP) 42

Figure 5.2: Structure of the Elman partial recurrent neural network (EN) 44

Figure 5.3: Block diagram of EN-based HAR classifier 46

Trang 9

Figure 5.4: Block diagram of the HMM-NN hybrid classifier for HAR ………49

Figure 5.5: Block diagram of the NN-HMM hybrid classifier for HAR 51

Figure 6.1: k-NNR recognition rate as a function of the number of nearest neighbors used, k.

55

Figure 6.2: HMM recognition rate as a function of the number of states, S 56

Figure 6.3: Initial search: EN recognition rate as a function of the number of hidden units 57Figure 6.4: Refined search: EN recognition rate as a function of the number of hidden units

58Figure 6.5: HMM-NN recognition rate as a function of the number of MLP hidden units 59Figure 6.6: NN-HMM recognition rate and labelers’ classification rate as functions of the

number of MLP hidden units 60 Table 6.1: Recognition rate and time complexity comparisons for the five HAR classifiers

61Table 6.2: Classification results of 200 activity sequences using the proposed HMM-NN

hybrid 62

Trang 10

Chapter 1

Introduction

The primary objective of this thesis is to show that human activity can be recognized efficiently and accurately using the artificial Neural Network (NN) and its combinations with the Hidden Markov Model (HMM), in the forms of HMM-NN and NN-HMM hybrids

1.1 Background

The attention given to human-motion-based research from the computer-vision community has been on the rise This is because of the rapid technological development of the image-capturing software and hardware, in addition to the omnipresence of reasonably low-cost high-performance personal computers These new technological advances have made vision-based research much more affordable and efficient than ever before The main motivation, however, comes from its application in a myriad different challenging but rewarding problems that include but not limited to automated surveillance systems, human-machine interaction, content-based retrieval, military simulation, clinical gait analysis and sports [1,2] Any of these projects usually involves one or more of the major vision-based research areas such as motion detection, human presence verification, human objects

Trang 11

tracking, people identification by face recognition or other biometric means, advanced interface via facial expression recognition, action logging through motion and postures classification and human activity recognition

user-Growing efforts have been put into combining the various research areas such that more

‘intelligence’ is bestowed on the computer and its environment so as to enable them to understand and interact with the human users Some examples of recent work include the smart classroom by Ren and Xu [3] that recognizes the teacher’s natural complex action; the real-time distributed system using multiple calibrated cameras to track 3D locations of people in a smart room by Focken and Stiefelhagen [4] and the EasyLiving project by Krumm et al.[5] that reports the location and identity of the people in the intelligent living room equipped with two sets of color stereo cameras It is apparent that there is a common desire to come up with more intelligent machines and discerning vision systems, which synergize to produce ‘smart’ results unobtainable by each entity itself

Here at our laboratory, a ‘smart room’ has also been set up It is an enclosed office environment that aims to offer services to its users based on the information it perceives Many major areas of research are conducted here There is a facial expression recognition project that tries to understand the mood of the users judging from their expression Similar

to but using audio information instead, a speech recognition system distinguishes the users’ emotional states Work on a face recognition system that tries to maintain the identity of the users is also in progress There are also gesture recognition projects that interpret the users’ interaction based on their body and hand gestures The areas of interest for this thesis research project are to explore the tracking of the human objects and recognition of human activity within the smart room Knowing the identity of people, their locations and the activity taken place in the room is the most vital prerequisite for many of the services that the smart room can provide Examples of services are displaying an appropriate message on

a LCD panel for a particular user when he enters the room; zooming in for a close-up video

Trang 12

capture when someone has been loitering around the cabinet containing ‘Confidential/ Secret’ documents for too long; sending an alarm to the security department if user X is found sitting and using the computer at the desk of user Y who has just left the room – signaling a case of gaining unauthorized access; etc All these require the knowledge of the whereabouts and identities of the users as well as recognition of the human activity in the smart room

To this end, we have developed algorithms for real-time detection and tracking of persons (in C++) and offline classifiers (in Matlab) for simple human activity in our smart room equipped with multiple cameras We focus principally on the detection and verification that humans are indeed present, tracking and estimating their locations while maintaining their identities, and ultimately, recognition of ten distinct activities, which are walking, sitting down, getting up, squatting down and standing up, in both lateral and frontal views

1.2 Human Activity Recognition (HAR)

According to Polana and Nelson [6], actions can be classified into three categories, namely

events, temporal textures and activities Events are isolated simple motions that do not

exhibit any temporal or spatial repetition; temporal textures exhibit only statistical

regularity, they are motion patterns of indeterminate spatial and temporal extent; whereas

activities consist of motion patterns that are temporally periodic and possess compact spatial

structure They cite ‘opening a door’, ‘ripples on water’ and ‘walking’ as examples for the three groups, respectively

Adhering to the definitions of activities and include also whole-body only motion events,

e.g standing up and sitting down, we deal with the recognition of this group of human activity in this work and compare our methods with some approaches in the same area

Trang 13

1.2.1 Related HAR Systems

More recent studies include Sun et al.[7], wherein they use a model-based approach for motion estimation The likelihood of the observed motion parameters is computed based on

a multivariate Gaussian probabilistic model The temporal change of the likelihood is modeled using the HMM, which is then applied to recognize eight simple human activities (turning of body from left to front, front to left, right to front, front to right, stand up, sit down, going to sit but return to standing and going to get up but return to sitting position) and eight more complex martial art activities High recognition rate, above 91%, on 160 test sequences has been reported

The system by Ben-Arie et al.[8] uses a novel template matching approach called Expansion Matching (EXM) for human body tracking, and a method of multidimensional hash table indexing followed by a sequence-based voting scheme for view-based recognition of eight human activities (jump, kneel, pick, put, run, sit, stand and walk), in various viewing directions It gives 100% recognition with 40 test videos

In the work of Ali and Aggarwal [9], by using background subtraction and skeletonization

to generate stick figure models for tracking, they recognize seven continuous activities (walk, sit, stand up, bend, get up, squat and rise), in 20 sequences in lateral view based on the Nearest Neighbor Rule (NNR) They achieve 76.92% accuracy

Another NNR classifier developed by Madabbushi and Aggarwal [10] classifies nine activities in lateral view (stand up, sit down, bend down, get up, walk, rise, squat, side bend and hug) and three in frontal view (frontal rise, squat and side bend), based on the movement of human head With 41 test sequences, their classification rate is 83%

In their earlier work, Madabbushi and Aggarwal [11] also track the head movement but using the Bayesian framework instead, to recognize ten activities (sit, stand, bend, get up, walk, hug, bend sideways, squat, rise from squat, fall down), some in lateral view, and some

Trang 14

in frontal view Using a database of 77 action sequences, of which 39 are used for testing, the success rate is 79.74%

1.2.2 Proposed HAR Classifiers

As we have observed, the connectionist techniques, and their hybrids in the form of

HMM-NN or HMM-NN-HMM, have neither been exploited nor reported in the literature of HAR It is thus the objective of this thesis to approach the long-standing problem with three solutions based on the artificial Neural Network, and eventually compare their performance with that

of the popular traditional HAR classifiers – NNR and HMM

In the first proposal, the classifier based on the Elman model of the partial Recurrent Neural Network (RNN), or simply Elman Network (EN), is advocated Chosen for its internal representation of time, its ability to remember input from the previous frame and develop an

‘understanding’ of the context of the input makes it a suitable candidate for the time-varying recognition problem at hand

The second classifier consists of ten HMMs (each one is trained specifically for a class of activity) and a single-hidden-layer Multi-Layer Perceptron (MLP) NN The MLP, known for its better classification capability than the HMMs, is used to classify the activity based

on the likelihood functions for the ten classes computed by the HMMs This combination is known as the HMM-NN hybrid

In our final proposal, a NN-HMM hybrid, two MLPs are trained as labelers for ten HMMs, which are time-scale invariant classifiers at sequence level The MLP, being both trainable and discriminative, is better than the ordinary vector quantizer used in the traditional HMM, hence, this proposed hybrid is also expected to perform better than the traditional HMM classifier

Trang 15

1.3 System Proposal

Our smart room has a setup that includes three fixed Sony CCD color camera installations, one Sony camera adaptor, three Euresys Picolo frame grabbers of rate 25 fps and one Pentium-4 PC server running on Win 2000, installed with digital video recording and processing software, Video Savant v.3.0 Not calibrated for stereo vision, all cameras are used monocularly Cameras are installed as shown in Figure 1.1 Camera 1 is fixed on the wall directly facing the door, the other two are on one side of the room; Camera 2 is closer

to the door and Camera 3 is farther away They are all located at about the same height, approximately 1.85m from the floor Furniture in the room includes desks with PCs, three cupboards and a filing cabinet

Camera 1

Camera 3 Camera 2

Figure 1.1: Plan-view of the smart room and the camera locations

Before human activity can be recognized and interpreted from the digital video recordings, human detection, tracking and posture estimation, from one frame to another in the image sequence, must generally be accomplished [12] Deriving from this concept, we propose a system that follows a three-level framework for human activity recognition At the lowest level, the goal is to detect and verify that human is entering the smart room At the intermediate level, its aim is to keep track of the human whereabouts in the room and extract meaningful features for posture estimation and subsequent activity classification, at the topmost level

Trang 16

To facilitate discussion and implementation, we make each of the three levels a module and name the low, intermediate and top levels the Person Detector (PD), the Head Tracker (HT) and the Activity Classifier (AC) respectively, as depicted in Figure 1.2

& Feature Estimation

Activity Classifier (AC):

EN, HMM-NN or NN-HMM (also NNR or HMM)

Video

Streams

Figure 1.2: The three-module framework of our proposed system

Briefly, they work as follows By using background subtraction for the motion segmentation, shape-based object classification and feature-invariant face detection techniques for human object verification, module PD verifies if a moving object entering the room is human As soon as it confirms that a person is indeed present, module HT starts to perform feature-based human tracking for locations estimation and posture estimation that is needed for subsequent activity recognition

Adopting the assumption made in the Hydra project [13], we set the constraint that our heads are always “above” our torsos and use the approximated head centroid as the sub-feature for human tracking This method yields two important elements (the x- and y-coordinates of the approximated head centroid) for posture estimation and extracts the difference in coordinates over consecutive frames in an activity sequence as the features for recognition of this particular activity via AC

In addition, as a means to keep track of the whereabouts of the person, or up to two persons presently, a straightforward geometrical analysis of the image planes and correspondence to the floor plan are used to estimate the absolute locations of the people in the room The ability to track and identify the various individuals would be useful for the study of

Trang 17

multiple-subject activity recognition and understanding their interactions The module HT uses mean Red and Green values of the different subjects for maintaining their identities and handling occlusions

Finally, the offline module AC, which classifies single person activity, is implemented based on the best of our three proposed HAR classifiers (EN, HMM-NN and NN-HMM) that gives the highest performance And for comparison purpose, the two traditional HAR algorithms (NNR and HMM) are also used to implement the module AC

For the training and performance evaluation purposes of all the classifiers, we constructed a common database of 200 activity sequences, comprising ten activities performed by 20 subjects

1.4 Thesis Organization

The remainder of this thesis is structured as follows

Chapter 2 studies the techniques available for human detection and explains the formulation

of the module PD and how its objectives of detecting and verifying that a person is entering the room are met

Chapter 3 begins with a survey of the various human tracking approaches for posture estimation and activity recognition Then the focus is on formulating the module HT using the feature-based tracking method to achieve location estimation and feature extraction for the ensuing module, AC

Chapter 4 reviews the NNR and HMM algorithms and shows how these popular traditional HAR classifiers are integrated into our system, for subsequent comparison with our proposed classifiers

Trang 18

Chapter 5 starts with an introduction of NN, specifically MLP It then describes the motivation and algorithms of the three proposed HAR classifiers The implementation of

AC using these proposals is also presented In addition, the modified EBP learning rule used

to achieve faster convergence for EN and MLP is detailed

Chapter 6 documents the main experiments conducted and presents the results obtained Comparison between the various classifiers’ recognition rate and time-complexity are made The chapter ends with some discussions on these results and how the proposed classifiers can be improved

Chapter 7 summarizes the main conclusions of this thesis and finally,

Chapter 8 sketches our future directions of research

Trang 19

a human motion analysis system since the subsequent tracking and recognition processes depend heavily on it This process usually involves motion segmentation and human object verification

2.2 Review of Motion Segmentation

Motion segmentation in video sequences is known to be a significant and a difficult problem, which primarily aims at detecting regions corresponding to moving objects such as people in the scenes This provides a focus of attention for later processes because only those changing pixels need to be examined However, changes from illumination, shadow,

Trang 20

repetitive motion from clutter, and even weather if it is an outdoor scene, make motion segmentation all the more difficult to process quickly and reliably

Many techniques are available for motion segmentation Using either temporal or spatial information of the images, they can be classified roughly into one of the four main

approaches In an example of the statistical motion segmentation method by Wren et al.[14],

human subject is segmented by modeling as a connected set of blobs, each of them has its own spatial and color statistics Alternatively, segmentation of the mobile subject can be

achieved via the optical flow estimation method of the object’s motion and position For

example, Meygret and Thonnat [15] combine both stereovision and optical flow motion

information to segment out mobile 3D objects Equally popular is the temporal differencing

approach for motion segmentation, such as the work by Huwer and Niemann [16], where the difference images of consecutive images are used to detect continuous moving subjects

Last but not least, the simple but effective background subtraction is also widely used to

detect moving human objects in video images As an example, Seki et al.[17] evaluate the Mahalanobis distance between the averages of the background image vectors and the newly observed image vectors to detect objects As the background subtraction was found to be computationally efficient and robust enough for our use in the indoor office environment, it was adopted as the means of motion segmentation in our PD implementation

2.3 Review of Human Object Verification

Different moving regions may correspond to different moving objects in a scene For instance, the image sequences captured in our office environment may include an office cleaner (human) guiding a vacuum cleaner (non-human) to sucks dirt from floor, shoving roller office chairs (non-human) around and sending them into motion The shadows and moving objects can be mistaken for mobile human if motion is the only criterion applied To analyze human activity, it is very necessary that we correctly distinguish the actor from the other non-human moving objects This can be accomplished by many different ways We

Trang 21

shall only focus on uncomplicated automated vision-based approaches suitable for real-time use, such as employing object classification to verify that a moving object is human and then double-check the result using face detection techniques

2.3.1 Object Classification

The goal of object classification is to extract the region corresponding to people from all moving blobs obtained by the motion segmentation methods discussed before There are

two main categories of approaches towards moving object classification, namely the

shape-based and motion-shape-based classifications In the shape-shape-based classification, different

descriptions of shape information of motion regions such as representations of point, box, silhouette and blob are available for classifying moving objects [12] On the other hand, the motion-based classification uses the periodic property of non-rigid articulated human motion as the cue for moving object classification, as in the work of Cutler and Davis [18] Since not all of our activities of interest are periodic or self-similar, we rely on the shape-based classification method for clue as to whether the moving object is human or not

2.3.2 Face Detection

Knowing that the moving object in the scene could be a person, the presence of a human face could almost always confirm it While there are many image processing techniques to

delineate a human face in an image, the feature-invariant approach is the most commonly

used approach It is relatively faster than many other popular techniques such as the

knowledge-based, the template-matching and the appearance-based methods [19] The

feature-invariant method seeks to localize invariant features of human faces for detection The underlying assumption is based on the observation that humans can effortlessly detect faces and objects in different poses and lighting conditions, so there must exist properties or features that are invariant over the variability Using features such as facial features (e.g eyebrows, eyes, nose, mouth and hair-line), facial textual, skin color, shape and size of the

Trang 22

face, many methods have been proposed and demonstrated to be efficient in localizing and detecting faces [19] One of our previous studies has found that detecting faces via skin-color is the fastest among all other facial features [20] Although skin color varies in people from different ethnic groups, studies have proved that all skin colors can be approximated to

a map in YCbCr space [21,22] It was also found that the luminance value (Y) of the skin color is heavily dependent on the camera and the environment; and all major differences between varying skin tones was observed to lie in their color intensities rather than in the facial skin color itself This implies that the luminance signal Y does not contain useful information pertaining facial skin-color detection In this work, we will discard the Y signal and employ only the more manageable CbCr-color space for skin-color detection, in addition to the facial shape and size feature-invariant face detection methods

2.4 The Proposed Person Detector

Putting together the above techniques and concepts, we formulate our first module of the system for human detection The flow chart of our implementation is shown in Figure 2.1

Start of Person Detector (PD)

Face Size > S No

Background Model

Face Shape OK?

Yes

Yes Head Tracker (HT)

No No

Trang 23

2.4.1 Motion Segmentation

Motion segmentation is achieved by background subtraction to extract the moving object from the background in each frame captured, as follows The background, as shown in Figure 2.2(a), was modeled by computing the mean for each pixel in the color images over a sequence of 50 frames, which were taken prior to presence of any motion, e.g Figure 2.2(b)

(a) Background model (b) Motion detection

(c) Background subtraction (d) Thresholding, filtering and morphology

(e) Skin color detection (f) Face size and shape check

Figure 2.2: Motion segmentation and human object verification from image captured

Trang 24

Next, the background-subtracted image, Figure 2.2(c) was subjected to image thresholding, median filtering to remove ‘salt and pepper’ noise and finally, some standard morphological operations to segment out the moving object, as shown in Figure 2.2(d)

2.4.2 Object Classification

From this segmented region, a straightforward shape-based object classification is carried out Using the average number of pixels of ten human subjects entering the door (which is the farthest point in the room opposite Camera 1) as the threshold, a moving object is classified as ‘possibly a human’ if its total number of pixels in the image is greater than our

preset threshold, T, which is heuristically chosen as 18000 pixels (out of the possible

442368 pixels for our resolution of 576x768 pixels per image.)

2.4.3 Face Detection

From the ‘possibly a human’ segmented region, feature-invariant face detection then follows Here, a three-tiered approach is proposed We first detect for skin-color feature in the ‘possible human’ blob Then, for all the skin-color regions detected, if there is any, the face size, then the face shape are compared to our face model to ascertain that the moving object is indeed a human

A Skin color detection

For the development of our hypothesis, we gathered from the internet skin samples belonging to different ethnic groups such as Caucasian, Negroid, Asian, etc, to form our human-skin-color model Hence, the resultant skin region in the CbCr space could be approximated to the area surrounded by the four lines, as plotted in Figure 2.3

Therefore, by applying the image of the segmented blob as a mask over the original image when motion was detected, and examine only the pixels under the mask area in the original

image, a pixel is identified as having skin color if its Cr color value, r, fulfills the following

Trang 25

60 80 100 120 140 160 110

where b is the Cb color value of that pixel

In this manner, all the skin color regions of the input image taken from Camera 1 are extracted, as shown in Figure 2.2(e)

B Face size check

To differentiate potential facial region from skin regions that are too small to be considered

as face, a threshold S of empirical values of about 700 pixels is used This value is obtained

Trang 26

from the average number of pixels of the faces of ten people entering the door Thus, only

the connected pixels of area greater than S will be further examined for consideration as

being a human face

C Elliptical face shape model

Having tested and passed the skin-color and face size checks, the extracted ‘potential face area’ is further evaluated using a human face shape model By considering the contours of merged skin color regions that are greater than 700 pixels and approximating them to ellipses, the face can be easily singled out from other skin color body parts, e.g arms, shoulders, etc As established in [23], if the approximated ellipses have the following properties, it is a candidate to a face:

(Major Axis ÷ Minor Axis) < 1.7 (2.2)

As shown in Figure 2.2(f), the major and minor axes of the bounding box are used to approximate that of the ‘face’ area If (2.2) is satisfied, we say that a human face is present and a person has been detected

2.5 Summary

In this chapter, the role of human detection as the cornerstone of human activity recognition has been stressed It is typically achieved by motion segmentation followed by human object verification In this research, we adopted background subtraction for motion segmentation, blob size shape-based object classification and our recommended three-tiered feature-invariant face detection using a new skin-color model, face size and shape for the human object verification As there was no complex computation involved, our PD implementation has shown to be efficient as a real-time algorithm

Trang 27

3.2 Review of Human Tracking

It is a challenge to understand the postures of human objects in the scenes that made up an image sequence This is because a moving human is a non-rigid object, of deformable form and highly capable of adopting many different poses Human tracking is particularly useful

in human activity recognition since it serves as a means to prepare data for posture estimation and the ensuing activity recognition In contrast to human detection discussed in the previous chapter, human tracking belongs to a higher-level computer vision problem However, the tracking algorithms within the study of human motion analysis usually have considerable intersection with motion segmentation during processing

Trang 28

Tracking can be divided into various categories according to different criteria [24] As far as tracked objects are concerned, tracking may be classified into tracking of human body parts such as face, hand and leg, and tracking of the whole human body If the dimension of tracking space is of particular interest, there is 2D versus 3D tracking And if the number of views is considered, there are single-view, multiple-view and omni-directional view tracking In addition, tracking can be grouped according to other criteria such as the number

of tracked human (single subject, multiple subjects, human groups), the tracking environment (indoors vs outdoors), the multiplicity of the sensor (monocular vs stereo), state of the camera (moving vs stationary), etc

Nevertheless, there are essentially four most widely used methods of tracking motion,

namely the model-based, region-based, active-contour-based and feature-based tracking

3.2.1 Model-based Tracking

In the model-based tracking method where the geometric structure of human body is used, the body segments can be approximated as lines (in the case of stick figure model), 2D ribbons (in the 2D contour model) or 3D volumes (in the volumetric model) The motion of joints provides a key to motion estimation and recognition in the stick figure model-based tracking For example, Guo et al.[25] represent the human body structure in the silhouette

by a stick figure model which has ten sticks articulated with six joints The motion estimation problem is transformed into finding a stick figure with minimal energy in a potential field Prediction and angle constraints of individual joints were also introduced to reduce the complexity of the matching process In the 2D contour model, the human body representation is directly relevant to the human body projection in the image plane In such description, human body segments are analogous to 2D ribbon or blobs For instance, Ju et al.[26] proposes a cardboard people model that represents the human limbs by a set of connected planar patches The parameterized image motion of these patches is constrained

to enforce the articulated movement and is used to deal with the analysis of articulated

Trang 29

motion of human limbs As the camera’s angle poses some restrictions on the 2D model, considerable amount of research has been done to depict the geometric human body structure in more detail using some 3D models, such as elliptical cylinders, cones, spheres, etc Although the more complex 3D volumetric models yield better result, they require a lot more parameters resulting in much computationally expensive matching process However, the advantage of 3D human model lies in the ability to handle occlusion and more significant data can be obtained for activity analysis An example is the work of Kakadiaris and Metaxas [27], wherein they use three calibrated mutually orthogonal cameras to track human motion in the presence of occlusion The selection of a time varying set of cameras

at each time frame is based on the visibility of each part and the observability of its predicted motion from a given camera

3.2.2 Region-based Tracking

This second means of tracking motion is to identify a connected region associated with each moving object in an image and then track it over time using a cross-correlation measure For example, the Pfinder [14] uses small blob features to track a person in an indoor environment The human body is considered as a combination of some blobs representing various body parts such as head, torso and four limbs Modeling the human body and background scene with Gaussian distributions, the pixels belonging to the body are assigned

to different body parts blobs using the log-likelihood measure Hence, by tracking the region

of each small blob, the moving person can be tracked Generally, the region-based tracking approach works well except for two bad situations The first is when blobs associated with separate people are connected up due to their long shadows Fortunately, this may be resolved to some extent by making use of color or texture, because shadows are devoid of these properties The second and more serious issue is the congested situations where people partially occlude one another instead of being spatially isolated The task of segmenting individual humans may then require tracking systems using multiple cameras

Trang 30

3.2.3 Active-contour-based Tracking

The active-contour-based tracking method aims at directly extracting the shape of the subjects based on active contour models, also known as snakes The idea is to have a representation of the bounding contour of the object and keep dynamically updating it over time For instance, Peterfreund [28] uses image gradient and optical flow measurements along the contour as system measurement; his Kalman filter-based active contour model tracks rigid and non-rigid moving targets, such as people, in spatio-velocity space in the presence of occlusions and image clutter Clearly, as is the case with other active-contour-models, it requires a good initial fit – a well-separated contour for each moving object to be tracked Nevertheless, it is usually less computationally complex than the region-based tracking approach

3.2.4 Feature-based Tracking

The key idea in the last of the four commonly used motion tracking methods, feature-based tracking, is to track only sub-features instead of object as a whole Distinguishable sub-features such as points or lines on the object are selected because they usually remain visible even in the presence of partial occlusion; tracking then involves feature extraction and feature matching Low-level features such as points are easier to extract than the higher-level features like lines and blobs Thus, there is always a trade-off between features complexity and tracking efficiency As a good example of point-feature tracking, Polana and Nelson [29] use the centroid of the rectangular box that binds the human as the feature point for tracking As long as the velocity of the centroids can be distinguished, tracking of two subjects is still successful even when occlusion has occurred during tracking of the subjects

In a nutshell, depending upon whether information about the object shape is employed in the motion analysis, tracking over time involves feature extraction and establishing feature

correspondence between consecutive frames, using either the strategy of matching a priori

shape model or estimation of features related to position, velocity, shape, texture, color, etc

Trang 31

3.3 The Proposed Head Tracker

Since in our case, the head is observed to be always above the torso while executing any of our ten pre-defined classes of activity; and its position and relative movement in the video sequences are very indicative of the activity performed – we choose the head as the sub-feature for tracking To be exact, the approximated head centroids, represented by x-y coordinate pairs, extracted from the image frames captured from Camera 2 are used to form the basis of our feature vectors for tracking, posture estimation and subsequent human activity recognition The flow chart for the module HT is shown in Figure 3.1

Start of Head Tracker (HT)

Frame Level Human Object Segmentation

Yes No

Update Feature Vectors Plot Absolute Locations

End of Sequence?

Feature Vectors Conditioning

Database Formed?

Head Centroids Estimation

Figure 3.1: Flow chart of our proposed Head Tracker (HT) module

Trang 32

3.3.1 Human Object Segmentation & Head Centroid Estimation

As alluded to earlier, the human tracking algorithms usually have considerable intersection with that of motion segmentation during processing; our module HT is no exception Background subtraction is used again to detect the moving human object And as before, standard image processing techniques like thresholding, filtering and morphological operations are applied to obtain the human blob of interest, as shown in Figure 3.2(a) to (d)

To obtain the approximated head centroid in each frame, we located the tip of the blob in the frame and started box-bound the blob until the width ratio of the current row (of pixels belonging to the human blob) to the previous row was smaller than low or larger than high These two thresholds, low and high, were chosen empirically as 0.91 and 1.07 respectively

In cases where the neck portion was obvious in the image, low would be automatically compared with; else high would be used, which usually implied that the shoulder of the subject had been reached So, in either case, only the region belonging to the head of the human blob would be bound, as illustrated in Figure 3.2(e) The x- and y-coordinates of the centroid of this bounding box were then used to approximate the centroid of the human head

in that frame Succeeding these, the module goes on to fulfill its next two functions, locations estimation and HAR features extraction

(a) Background model (b) Motion detection

(c) Background (d) Standard image (e) Head centroid

subtraction processing estimation

Figure 3.2: Human object segmentation and head centroid estimation

Trang 33

3.3.2 Locations Estimation

This function aims to keep track of the people’s whereabouts in the room and have their trajectories plotted onto a 2D floor plan automatically Because all our cameras are stationary; so, there exist linear relationships between each pixel in all three image planes

By using simple geometrical analysis as shown in Figure 3.3, and knowing the correspondence between image planes and the floor plan, we can estimate the absolute location of each subject on the floor plan based on the intersection of two x-planes (i.e image columns) of the human heads centroids in the images, as depicted in Figure 3.4

Cam 1 Image Plane

Absolute Location

Figure 3.4: Exploiting the correspondence between image planes and the floor plan to estimate

the absolute location based on the intersection of two x-planes

Trang 34

A Maintaining identities

When there are two persons in the room their identities need to be maintained at all times

To achieve this, the mean Red and Green values of the different human blobs in the first frame of the input sequence from all cameras are calculated and saved for comparison and differentiation of the subjects appeared in the later frames, so as to preserve each individual’s identity throughout the video sequence The RGB scheme is chosen because our images are inherently RGB, thus, color conversion is not necessary And because we found that the mean Blue values of different persons do not vary much, it is discarded so as

to expedite the whole identities resolution process To further streamline the real-time algorithm, this and the following occlusion handling steps are skipped if there is only one human blob detected in the first frame – one person in the rest of this sequence is assumed

B Handling occlusions

Because there are three cameras in our system and each provides different information (see Figure 3.5), occlusion can be easily handled by evaluating the absolute location of each subject using the geometrical analysis on any two cameras as before (default: Cam 1 & 2)

Images from: Synchronized image captures: Remarks:

Trang 35

Figure 3.6: A plan-view of the paths taken by two subjects in the room

3.3.3 HAR Features Extraction

To understand human postures for activity recognition from images, knowing the absolute positions of the subjects in plan-view is not enough We also need the height information Applying the 3D people tracking method by Focken and Stiefelhagen [4] is certainly a solution But we observed that extracting 2D features is sufficient to meet our primary objective of studying the feasibility of connectionist HAR classifiers We also found that when a camera has a good view of most of the work space, and if the human activity is carried out within that space, information from just that camera is enough to classify the activity pretty satisfactorily For the study of HAR, we choose only the images from Camera

2, which sees the most part of the work space The estimated head centroids in these images are used to estimate the postures and form the basis of our HAR feature vectors

Trang 36

On average, we observed that every class of the ten selected human activities could be completed comfortably within 28 frames by all actors Hence, the head centroid estimation (i.e bounding of head and extraction of the x-, y-coordinates) were repeated for 28 frames

in every activity sequence to form a 28x2 matrix for every subject Thus, if we let A be the activity matrix for each activity j performed by an individual i, we could write:

A i,j = [y x] i,j , (3.1)

where i=1,2, ,20; j=1,2, ,10; and y and x are the 28-element-column-vectors of the y- and

x-coordinates, respectively, of the approximated centroid of the head in all the 28 frames

To make the system invariant to the built, height and position of the subjects in the images,

we conditioned each A i,j by seeking the difference in coordinates over consecutive frames

This gave a new 27x2 difference matrix, D i,j , or two 27-elements feature vectors for an

activity j performed by an individual i, Y i,j and X i,j , i.e

D i,j = [y k+1 – y k x k+1 – x k]i,j = [Y X] i,j = [Y i,j X i,j], (3.2)

where k=1,2, ,28; is the index to the frames in an activity sequence

y

x

time

Figure 3.7: The differences in coordinates over consecutive frames are extracted as the two

features for HAR

Trang 37

As illustrated in Figure 3.7, the x- and y-coordinates of the estimated head centroid indirectly implied the posture that the subject had adopted in a scene Thus, by tracking the head of a person walking across the room from a sequence of frames and seeking the difference in coordinates of the approximated head centroid (i.e the centroid of bounding box, represented by dotted cross hair) over consecutive frames, the results can be extracted

to form the feature vectors Y i,j and X i,j for recognition of a walking activity in lateral view

3.4 Database from Feature Vectors Extraction

For training and evaluation of the HAR classifiers, we built a database of 200 human activity sequences – ten different activities performed by 20 subjects, ten from each gender Subjects are of various height, built and race Refer to Figure 3.8 for some snapshots of the recorded sequences for three of the ten activities

Figure 3.8: Snapshots of three activity sequences performed by our subjects of different gender,

race and physique

From these 20 subjects and ten classes of activity sequences per subject, we extracted 200 of

the 27-elements feature vectors, Y i,j and X i,j, using the steps explained in the previous

Trang 38

sub-section Hence, we could form two feature matrices D Y and DX for all our training and testing purposes, where

D Y =[Y i,j], ∀ i,j & DX =[X i,j], ∀ i,j (3.3)

Whenever it is more efficient to represent both features as only one feature vector (e.g in

the NNR and EN cases), the two feature matrices, D Y and D X, are concatenated and the new feature matrix is denoted as

which is of dimension 54 row by 200 columns

3.5 Method of Classification Performance Estimation

In order to accurately evaluate the performance of the proposed HAR classifiers (as well as that of the traditional classifiers), the training samples must not be used to test the classifiers These feature matrices must be divided into independent training and test

datasets There are many ways to achieve this But for our relatively small sample size,

k-fold Cross-Validation (CV) [30] was deemed to be most suitable and efficient In this study,

k=4 is conveniently chosen for all experiments

Basically, this four-fold CV scheme divides the available data, which can either be the extracted vectors database or the output from a previous stage in the hybrid case, into four mutually exclusive subsets (or ‘folds’) of equal size, consisting 50 samples each (i.e five samples from each class of activity) For each of the four CV runs, the classifier is trained

on three of the subsets and tested on the remaining one Hence, instead of reserving a portion of the available data solely for testing purpose, the rotation of role as test data among the subsets enables the use of every sample for training This greatly improves the efficiency of the limited dataset and at the same time, gives a more accurate estimation of the classifier’s performance The CV estimate of accuracy is obtained from the overall

Trang 39

number of correct classifications from all four runs, divided by the total number of instances

in the dataset, which is 200 in this study

in the process, the HT algorithm has demonstrated to be fast and accurate for real-time use

The main focus of the module, however, is to extract the two features from the estimated head centroids for the next module, Activity Classifier (AC) The tracking yielded two important elements for posture estimation – the x- and y-coordinates; their differences in coordinate over consecutive frames from Camera 2 were used to form our feature vectors and to construct our database for the training and testing of the module AC

The chapter ended with some discussions on how this database could be used efficiently for more accurate performance evaluation of the AC In the following two chapters, we will see how the module AC can be implemented using either the traditional NNR and HMM classifiers, or the proposed NN-based approaches

Trang 40

distance, with examples in the training set If we define T={T1, T2,…, T R } to be a set of R labeled training prototypes and let D*∈T denote the prototype nearest to a test sample D in the feature space Then, in a J-class problem, the NNR algorithm assigns D to the same

class label ωj as D*, where j∈{1, 2, …, J} Using the notation in equation (3.2) from section

Định dạng
Số trang	85
Dung lượng	5,36 MB