Advanced machine perception model for activity aware application

Homecare Action recognition with single or multiple subjects, using handcraft feature on traditional machine learning: The homecare of elderly people has become an important issue that

Trang 1

National Chung Cheng University Electrical Engineering of Department

ADVANCED MACHINE PERCEPTION MODEL FOR

ACTIVITY-AWARE APPLICATION

Student: Manh-Hung Ha

Advisor: Professor Oscal Tzu Chiang Chen

June 2021

Trang 2

Acknowledgments

It would have been impossible to work on a thesis without the support of many

people who all have my deep and sincere gratitude First of all, I would like to thank

Professor Oscal T.-C Chen, my academic supervisor, for having taken me on as a PhD

student, and having trusted me for years I could not find any better adviser than

Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me

pursue my thoughts freely He was an excellent theoretical point of reference and

critical to testing my theories He continues to impress me with his systematic style,

with compassion and humility, above all, to become a better researcher as well as a

better person Thank you for introducing me to the world of computer vision and

research and for taking me as a young researcher in my formative years

Thank you to my thesis committee - Prof Wei-Yang Lin, Prof Oscal T.-C Chen,

Prof Wen-Nung Lie, Prof Sung-Nien Yu and Prof Rachel Chiang for their time, kind

advice and feedback for my thesis You have taught me a lot and guided me on

improving the quality of my research

I was incredibly happy to have amazing mentors at CCU over the years I am

grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn

the ropes in the first few years I am thankful to Professor Alan Liu, Professor Sung

Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me

on my academic courses, helping me to expand my field of study and teaching me to

be a systematic experimental experimentalist

I'd like to express my gratitude to all of my friends and colleagues in the

Department of Electrical Engineering There were great staffs who taught me

Trang 3

researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I

thank them for working with them, and hope that there are even more opportunities to

learn from them in the future I would like to thank my many laboratory partners in

VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng

Lu, Han-Wen Liu, Yu-Ting Liu, etc I learned many things from each of them and their

support and guidance helped me overcome some stressful times Finally, thank you for

sharing your thoughts, documentations, data sets and coding programs All of this work

wasn't even considered and thanks to many of my colleagues working with computer

vision and machine learning My many amazing co-authors, I want to thank as well

I was also incredibly lucky to have made many friends in my time in CCU Also,

thanks to the badminton team in EE department, VSACCU badminton team showed me

all the awesome activities around the PhD journey!

Finally, without the valuable support, encouragement, love and patience of my

family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a

student, they trained me for this by showing me the importance of hard work, critical

thinking and patience I thank them for their support, trust and innumerable sacrifices;

I have worked on this study thousands of kilometers from home and missed many

significant life events And speaking of love and encouragement, my wife Nguyen

Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful

years and many well-wrought weekends and vacations, and above all for staying by my

side, even when a long distance was between us Thank you for raising me today and

for having always been there and trusting me

Trang 4

Abstract

People are one of the most important entities that computer vision systems would

need to understand to be useful and omnipresent in various applications Most of this

awareness is based on the recognition of human activities for the homecare systems

which observe people and support elderly people Human beings do this well on their

own: we look at others and describe each action in detail Moreover, we can reason

about those actions over time, even predict the possible actions in the future On the

other hand, computer vision algorithms were well behind the challenge In this study,

my research aim is to create learning models, which can automatically induce

representations of human activities, especially their structure and feature meaning, in

order to solve several higher-level action tasks and approach to context-aware engine

for various action recognition

In this dissertation, we explore techniques to improve human action

understanding from video inputs which are common and may be found in daily

activities such as surveillance, traffic, education, movies, sports, etc on challenging

large-scale benchmark datasets and our own panoramic video dataset This dissertation

targets the action recognition and action detection of humans in videos The most

important insight is that actions depend on global features parameterized by a scene,

objects, and others, apart from their own local features parameterized by body pose

characteristics Additionally, modeling the temporal features by optical flow from

motions of people and objects in the scene can further help in recognizing human

actions These dependencies are exploited in five key fords: (1) Detecting moving

subjects using the background subtraction scheme, tracking extrcated subjects using the

Trang 5

system with a lightweight model capable of learning from a portable device; (3) Using

capsule networks and skeleton-based map generation to attend to the subjects, and

building their correlation and attention context; (4) Exploring the integrated action

recognition model based on correlations and attention of subjects and scene; (5)

Developing systems based on the refined highway aggregating model

In summary, this dissertation presents several novel and significant solutions for

efficient DNN architecture analysis, acquisition, and distribution on large-scale video

data We show that the DNNs using multiple streams, combined model, hybrid structure

on conditional context, feature input representation, global features, local features,

spatiotemporal attention and the modified belief Capsnet have efficiently achieved high

quality results The consistent improvements from using these components of our

DNNs are addressed to achieve state-of-the-art results on popularly-used datasets

Furthermore, we also observe that the largest improvements are indeed achieved in

action classes involving human-to-human and human-to-object interactions, and

visualizations of our network show that it is focusing on scene context that is intuitively

relevant to action recognition

Keywords: attention mechanism, activity recognition, action detection, deep

neural network, convolutional neural network, recurrent neural network, capsule

network, spatiotemporal attention, skeleton

Trang 6

TABLE OF CONTENTS

PAGES

ACKNOWLEDGMENTS i

ABSTRACT iii

TABLE OF CONTENTS v

LIST OF FIGURES viii

LIST OF TABLES xi

I INTRODUCTION 1

II MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE 8

2.1 Introduction……… 9

2.2 Technical Approach 11

2.2.1 Handcraft feature extraction by locally body subject estimation 12

2.2.2 Proposed action recognition on single subject 17

2.2.3 Proposed action recognition on multiple subjects 20

2.3 Experiment Results and Discussion 27

2.3.1 Effectiveness of our proposal to single subject on action recognition28 2.3.2 Effectiveness of our proposal to multiple subjects on action recognition 31

2.4 Summary and Discussion 35

III ACTION RECOGNITION USING A LIGHTWEIGH MODEL 38

3.1 Introduction 37

3.2 Related Work 40

3.3 Action recognition by a lightweight model 41

Trang 7

IV ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY, USING DNN AND CAPSNET 50

4.1 Introduction 51

4.2 Related Previous Work 55

4.2.1 Diverse Spatio-Temporal Feature Generation 55

4.2.2 Capsule Neural Network 59

4.3 Proposed DNNs for Action Recognition 59

4.3.1 Proposed Generic DNN with Spatiotemporal Attentions 59

4.3.2 Proposed CapsNet-Based DNNs 68

4.4 Experiments, Comparisons of Proposed DNN 72

4.4.1 Datasets and Parameter Setup for Simulations 72

4.4.2 Analyses and Comparisons of Experimental Results 74

4.4.3 Analyses of Computational Time, and Cost 84

4.4.4 Visualization 85

V ACTION RECOGNITION ENHANCED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE 89

5.1 Introduction 89

5.2 Related work 91

5.3 Proposed DNN 92

5.3.1 Projection of SBB to ERB in the Feature Domain 92

5.3.2 Map Convolutional Fused-Depth Layer 93

5.3.3 Attention Mechanisms in SA and TE Layers 93

5.3.4 Generation of Subject Feature Maps 95

5.4 Experiments and Discussion 97

Trang 8

5.4.1 Datasets and Parameter Setup for Implements Details 97

5.4.2 Analyses and Comparisons of Experimental Results 98

VI SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO CONTEXT 101

6.1 Introduction 102

6.2 Related work 107

6.2.1 Action Recognition with DNNs 107

6.2.2 Attention Mechanisms 108

6.2.3 Bounding Boxes Detector for Action Detection 109

6.3 Proposed Methodology 110

6.3.1 Action Refined-Highway Network 112

6.3.2 Action Detection 118

6.3.3 End-to-End Network Architecture on Action Detection 123

6.4 Experimental Results and Discussion 123

6.4.1 Datasets 123

6.4.2 Implementation Details 125

6.4.3 Ablation Studies 127

VII CONCLUSION AND FUTURE WORK 135

REFERENCES 139

APPENDIX A 152

Trang 9

LIST OF FIGURES

2.1 Schematic diagram of height estimation 13

2.2 Distance estimation at the situation (I) 14

2.3 Estimated Distance at the situation (II) 15

2.4 Distance curve pattern of the measure of the standing subject 16

2.5 Proposal flow chart of our action recognition system 18

2.6 Flowchart detection of TV on/off 19

2.7 Proposed activity recognition system 21

2.8 Example of shape BLOBs generation from forground 22

2.9 Illustration of tracking by the Kalman filter method 23

2.10 Proposed FSM 24

2.11 Estimates of activity states in the overlapping interval 26

2.12 Proposed incremental majority voting 26

2.13 Room layout and experiment scenario 27

2.14 Examples of five activities recorded from the panoramic camera 28

2.15 Total accuracy rate (A) versus p and r 32

2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2 32

3.1 Proposed recognition system 41

3.2 Functional blocks of the proposed MoBiG 42

3.3 Proposed finite state machine 43

3.4 Proposed incremental majority voting 44

3.5 Confusion matrix of MoBiG identifying four activities 48

4.1 CapsNets integrated in a generic DNN 55

4.2 Block diagram of the proposed generic DNN 58

Trang 10

4.3 Three skeleton channels 61

4.4 One example of the transformed skeleton maps from an input segment 63

4.5 Block diagrams of the proposed AJA and AJM 64

4.6 Block diagram of the proposed A_RNN 67

4.7 Proposed capsule network for TC_DNN and MC_DNN 69

4.8 Block diagrams of the proposed CapsNet-based DNNs 71

4.9 Examples of the panoramic videos about 12 actions where subjects maked by

red rectangular dash-line boxes for observing only 74

4.10 Visualization of the outputs from the intermediate layers of the proposed TC_DNN 87

4.11 Visualization of the outputs from the intermediate layers of two A_RNNs 87

5.1 Block diagram of the proposed deep neural network 92

5.2 Block diagram of the SA generation layer 95

5.3 We plot the comparison performance of the AFS and ROS stream for each action classes 98 5.4 JHMDB21 confusion matrix 99

6.1 Refined highway block for 3D attention 104

6.2 Overview of the proposed architecture for action recognition and detection 110 6.3 Temporal bilinear inception module 113

6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant RNNs 114

6.5 Schematic recurrent 3D Refined-Highway depth by three RH block 116

6.6 3DConvDM layer correlating the feature map X 118

6.7 The details of GAA module 122

6.8 mAP for per-category on AVA 132

Trang 11

6.12 Qualitative results of R_FSRH on action recognition and detection on the

JHMDB21 dataset 140

6.13 Qualitative results of top predictions for some of the classes using proposed

model on Ava 141

A.1 12 category visualization of panoramic camera data in two and three

dimensions with the scatter plot filter 151

A.2 T-SNE test data visualization where each of data points represented by the

shots of a frame sequence on UCF101 151

Trang 12

LIST OF TABLES

1.1 List of published and submitted papers in this dissertation 7

2.1 Features used for posture recognition 18

2.2 States and transistions of our FSM 24

2.3 State estimates at the overlapping interval 25

2.4 Confusion matrix of TV on/off detection 28

2.5 Confusion matrix of activity recognition (I) 29

2.6 Comparison of features and performance of the proposal and conventinal activity recognition 30

2.7 Average accuracies of four activities at the type-I experiment 33

2.8 Confusion matix of activity recognition (II) 34

2.9 Example I of activity accuracies of two subjects at the type-II experiment 34

2.10 Example II of activity accuracies of two subjects at the type-II experiment 34

2.11 Example III of activity accuracies of three subjects at the type-II experiment35 3.1 Features of original mobilenetV2 and MOBIG 46

3.2 Performance comparison of pre-trained mobilenetV2 and MOBIG using the panoramic camera dataset 47

3.3 Accuracies improved by MOBIG plus FSM and IMV 48

3.4 Accuracy, complexity, and model size of the proposed system, and the other DNNs 49

4.1 Performance of three types of DNNs using the datasets of UCF101, HMDB51, and panoramic videos 75

4.2 Accuracies of 12 actions recognized by three types of DNNs using the

Trang 13

4.4 Performance of the proposed generic DNN and three CapNet-based DNNs

with one, two and three input streams 79

4.5 Average accuracies of the proposed TC_DNN at four merging models and

different values of F 81

4.6 Performance comparisons of TC_DNN using different approaches of

generating Tske maps 82

4.7 Performance comparisons of the proposed and conventinal DNNs using only

an RGB stream 82

4.8 Performance comparisons of the proposed and conventinal DNNs using the

HMDB51 and UCF101 datasets 83

4.9 Parameter amounts and inference time of the generic DNN, MC_DNN,

DC_DNN, and TC_DNN 84

5.1 Accuracies of 8 actions recognized by three types 99

6.1 Comparisons of DNNs with the R_FSRH layers at different numbers of RH

modules on TP, JHMDB-21 and UCF101-24 datasets 128

6.2 Results of the DNN based on I3D+ R_FSRH using different scalar values, γ,

A.1 Comparison of the features and performance of our proposed and generic

activitity recognition systems on panoramic video dataset 152

Trang 14

I INTRODUCTION

The main goal of visual understanding, image classification, computer vision, and

artificial intelligence is to aid people to do their work more efficiently From support

requests to assistive services, the potential impact of vision and AI on aging society is

immeasurable, and it has grown at an unprecedented rate in recent years While these

applications are still an active field of study, note that they share a common theme: they

all require systems to interact with and understand humans Hence, developing

technologies capable of understanding people is critical in achieving the goal of

ubiquitous AI Human understanding, however, can not be done in isolation by just

observing the person because of being influenced by the objects we interact with as well

as by the environment we exist in

The development of deep learning has led to rapid improvements in various

fundamental vision problems Given large labeled datasets, CNNs are able to learn

robust and efficient representations that exceed human performance on video

classification and perform exceedingly well in action recognition, object detection and

key point estimation tasks But what about tasks that do not have well defined datasets

or are much harder to label, such as 3D structures of objects or all actions afforded by

a scene Moreover, one of the limitation in scale of models up for a higher level

understanding of human intent and actions over a continuous video stream is considered

This thesis takes a step towards building and improving systems capable of

understanding human actions and intentions from both benchmark large-scale dataset

and our own dataset These systems would need to reason about the scene layout in

unison with the humans to perform well The thesis is divided into three parts to explore

Trang 15

for smart device (iii) locally (using poses), temporally (using optical flow), appearance

(using RGB), capsule networks (classifier) (iv) incremental action recognition model

by using correlations and attention of subjects and scenes (v) refined highway, globally

and locally, contextually on action recognition with and without localization

Homecare Action recognition with single or multiple subjects, using

handcraft feature on traditional machine learning: The homecare of elderly people

has become an important issue that requires being aware of activity recognition of

multiple subjects at home We start by understanding humans via focusing directly on

them, and especially, their body poses Features from body pose, typically defined using

a background subtraction and estimation method, provide a useful signal of a person’s

external action state As a handcraft feature, extracting just a few body keypoints over

time can be enough to recognize human actions Towards that goal, we build systems

to detect and track these features efficiently and accurately in videos We then use those

features to direct classify and recognize actions of single and multiple subjects We

show that the new model using a panoramic camera can enable several novel

applications in home-care systems

Computation-affordable recognition system with a lightweight model:

Activity recognition systems have been widely used in different areas, including

healthcare, security, as well as other areas like entertainment, are experiencing rapid

growth Currently, edge devices are very resource limited This doesn't allow for high

computation Previous work has successfully developed various DNN models for action

recognition, but these models are computationally intensive and therefore inefficient to

apply to edge devices One of the most promising solutions today is cloud storage that

can be used to transfer data by several methods for further analyses However,

continuously sending signals to the cloud demands an increasing bandwith Routine

Trang 16

activities at the edge can be identified before being sent to the cloud, improving

response time and lowering cost Accordingly, we propose a lightweight DNN model

for action recognition tasks that require less processing power, making them suitable

for use in portable devices The performance was conducted on the five daily activities

from videos recorded by a panoramic camera As a result, this proposed model performs

better than current techniques

Human Action Recognition Using Capsule Networks and Skeleton-Based on

the subjects' attendance and their correlation attention context: The topic of

activity recognition has received intensive studies in recent years The DNN devises a

spatial-temporal skeleton-based attention mechanism and incorporates multiple

CapsNet mechanisms to better understand human activity in abundant space and

temporal-based contexts videos The primary schemes used in activity recognition are

presented First, the popularly-used DNN for activity recognition on two RGB and

optical flow inputs is briefly reviewed Second, we propose a joint point based spatial

attention mechanism to weight the skeleton for action recognition Finally, how to

increase inference classifiers in Capsule networks is illustrated

Incremental action recognition model by using correlations and attention of

subjects and scenes: To that end, we develop systems for efficient and accurate video

recognition, as well as an analysis of the current state of human action understanding,

which consists of three stages At the first stage, we developed an adaptive object

detector YoLov4 approach for detecting humans in videos and forming a temporal

stream based on human features Second, project the human region to the coresponding

feature which 3DCNN extracted the feature of structural and semantics information

Trang 17

objects individually The attention mechanisms in spatial attention and temporal

encoder are implemented to find out meaningful action information on spatial and

temporal domains for enhancing recognition performance

Human action recognition on multiple labels, with or without localization,

using a refined highway aggregating global and local contextual information over

time: Finally, we explore the applicability of global and local contextual temporal

reasoning for recognizing actions Having focused on people, we now zoom out and

consider people in their context Scenes, objects, and other people in a scene all have

strong precedence over what a human does In fact, humans perform actions to either

change their own state, or to change the state of their environment, often interacting

with or responding to other humans on the scene In both cases, and especially the latter,

scene and object affordances directly define what actions a human will perform We

propose the Refined Highway (RH) model and Global Atomic Spatial Attention (GAA)

to investigate the role of residual and highway connections in DNN for activities

recognition enhancement Residual and skip connections are very similar in nature to

action behavior, and play a significant role in the current generative model, which is

one of the most commonly employed recognition enhancement approaches RH, on the

other hand, can be thought of as a mash-up of transformed and carried features by

attention-driven The proposed unified DNN is able to localize and recognize actions

based on 3DCNN The video is first sliced into equal segments, and then created using

Regions Proposal Network (RPN) features For this, we are extending the 3DCNN by

using the R_FSRH framework and GAA module to handle the RoI localization as an

action recognition problem First, the regions of interest are detected on each frame,

and then their respective maps are produced from these results into bounding boxes

The recognition system can also be used on the general issue of human recognition

Trang 18

Compared to the state-of-arts, our experiments are showing superior results on

large-scale datasets

In this thesis, we verifed our proposed DNNs on multiple public large-scale

datasets and our panoramic video dataset

Dataset collection for homecare assistance is the recognition of actions using a

portable panoramic camera that captures four 720 x 480 pixel images to produce 360

degree-view stitched images Input data are collected and labeled with annotated ground

truth, and detailed evaluation methods, which makes the methods comparable with each

other It is possible to use an existing dataset and this is what has been done during

these experiments Typically, the dataset is divided into three parts: training, validation,

and testing In this thesis, we name the dataset recorded by panoramic camera as PanoV

which is used in section 2, 3, 4

The absence of standards and benchmarks makes comparing different deep

learning models harder In recent years, UCF101 and HMDB51 have been the most

popular large-scale datasets and scientific research is usually done on one or both of

these datasets, but not without problems Looking at the number of generated frames,

UCF101 seems to be coherent with the magnitude of ImageNet but it lacks diversity

and so the generalization can be difficult Concerning HMDB51, the dataset suffers

from the same issues and has a greater number of frames than the average The dataset

of Traffic Police (TP), which contains videos of Chinese traffic police with 8 kinds of

Chinese traffic police commanding poses by analyzing visual information, was used for

simulations To solve the action recognition problems in this study, we used three

datasets to evaluate a proposed model in sections 4, 5, and 6

Trang 19

Precise people's boxes are inferred in JHMDB-21 from human silhouettes as well as the

ground truths of labelled actions The UCF101-24 and JHMDB-21 annotations are

accompanied by spatiotemporal ground truths containing the corresponding spatial

temporal annotations Additionally, each video can contain multiple action instances of

the same class label but different spatial and temporal boundaries AVA and Charades

also feature scenes with multiple actors Each video clip is untrimmed and contains

multiple action labels inside the overlapped temporal durations Both of these

phenomena are embedded in two datasets that are reasonable and challenging, since

these two datasets provide a variety of contextual information that aids in action

detection We conducted detailed ablation studies in this work using the AVA v2.1 and

Charades dataset

Recently, several techniques using deep learning have emerged for different

modalities Many computer vision benchmarks are showing that the disparity between

advanced algorithms and human performance is growing These networks learn

hierarchies of features As the depth increases, these hierarchies can describe

increasingly abstract concepts These advances suggest that we are studying the

application of training methods for the recognition/detection of actions This thesis is based on a preliminary study and five articles, which contribute to the ﬁeld of action

recognition as shown in Table 1.1

The relevant publication list for each section is as follows:

Section 2: Efficient handcraft feature estimation for single human classification

[1]; Effective handcraft feature estimation for multiple subject action

recognition [2]

Section 3: Lightweight model for action recognition [3]

Trang 20

Section 4: Attentional, capsule net for action recognition [4]

Section 5: Enhanced model by correlations and attention of subject and scene

[5]

Section 6: Action recognition and detection using RH network [6]

T ABLE 1.1 L IST OF PAPERS IN THIS DISSERTATION

[1] Activity Recognition Using a Panoramic

Camera for Homecare

IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017

[2] Activity Recognition of Multiple

Subjects for Homecare

International Conference on Knowledge and Smart Technology (KST), 2018 [3]

Computation-Affordable Recognition

System for Activity Identification Using

a Smart Phone at Home

IEEE International Symposium on Circuits and Systems (ISCAS), 2020

[4]

Deep Neural Networks Using Capsule

Networks and Skeleton-Based Attention

for Action Recognition

IEEE Access, 2020

[5]

M.-H Ha, and Oscal T.-C Chen,

“Action recognition improved by

correlations and attention of subjects and

Deep Neural Network Using Fast Slow

Refined Highway for Human Action

Detection and Classification

IEEE Transaction on Multimedia, 2021 (in preparation for submission)

Trang 21

II MULTI-MODAL APPROACHES LOCALLY ON

SINGLE AND MULTIPLE SUBJECTS FOR

HOMECARE

To understand human action recognition, we begin in this section by considering

homecare on human activities, specifically using the handcraft features on traditional

machine learning into single subject captured on the Portable Panoramic Video

(PPV) in the living room The variables related to the subject height, the

distance between the centroid and the subject contour, and the distance from the camera

to the subject are extracted for motion analysis As home care for the elderly becomes

an important issue that necessitates recognizing the activities of multiple subjects for

homecare, we next use a PPV in the indoor to capture and process daily actions that are

then analyzed using background subtraction Each identified single subject is extracted

with six parameters for classification to obtain an initial activity estimation for each

image Specifically, when the subjects are overlapped, the previous state of activity

associated with each subject is maintained The activity of each subject is not activated

until the subjects are separated In addition, the status of an overlapping period is fairly

calculated using the boundary image section states neighboring to the start and end of

the overlapping interval Regarding the issue that necessitates recognizing the activities

of multiple subjects for homecare

Similarity of the single subject, with the daily activity recognition of multiple

subjects, we adopt a PPV located at the indoor to capture daily subject action by

subtracting background, generating Binary Large Object (BLOB), using the Kalman

filter for tracking, producing local hand-carft features, applying Gaussian Mixture

Model (GMM) for classification, and adopting incremental majority voting and Finite

Trang 22

State Machine (FSM) Based on the features extracted by a single subject, six forms of

parameters using the GMM model provide the initial activity estimation per subject on

the framework A majority voting approach on a 10-frame segment of input pictures,

and then the finite state machine is addressed to further validate the identified action

and get rid of those infeasible states Specifically, when objects overlap, the previous

activity state related to each subject is maintained The activity of each subject is not

activated until the subjects are separated In addition, the states of the neighboring to

the boundary are defined by the states of the overlapping interval, and increasing the

performance of action recognition using incremental majority voting is used to establish

and illustrated by the probability model The simulation results show that the average

accuracy of four categories can reach 93.0% when there are multiple subjects in the

living room Compared to the existing work, our system demonstrates comparable

outcome for home-care applications

2.1 Introduction

In recent years, applications for home care have paid more attention It is clear

that life expectancy has increased with the number of elderly population with the

development of medical technology and an aging society As a result, there is a

desperate need for home care for the elderly Additionally, living alone has become

more frequently reported by elderly people today In previous work, most non-invasive

or discrete detection methods used a camera to capture moving objects and then perform

activity detection

So as to understand human action recognition, many spatiotemporal recognition

schemes have recently been proposed [6-12] In [6], the point-selection scheme

Trang 23

system for low resolution monitoring 3D depth representation was used to perceive

human action [9], [10] Chen et al [13] used the Gaussian Mixture Model (GMM) to

classify three actions of falling, standing and sitting via a panoramic camera However,

most people's actions in life are so complicated that we must consider how humans live

in their broader surroundings in a complex environment In addition, there are

completely different ways to describe an activity Besides, it is also important to be

aware of the activities of multiple subjects to realize home care with a camera This

recognition task could be summarized as a mix of feature generation, tracking, subject

extraction, person monitoring, machine learning, and activity detection To extract

features effectively, many researchers have used a variety of strategies to track multiple

subjects [14-17] In addition, activity identification [18], and behavior analysis [19],

and pose estimation [1,13], were investigated As a result, these techniques can be used

in home care to understand multiple subjects' daily behaviors and to pre-diagnose

activity symptoms

This section focuses on the issue of recognizing human behavior in a panoramic

video weher the characteristics of each person are evaluated in the time domain There

are numerous difficulties, such as changes in pose, occlusions, multiple overlaps, and

specific behavioral categories Capturing multiple people with a camera can lead to

many problems such as the number of people, interactions among people, and tracking

people that are solved We may estimate the number of people by counting the number

of Binary Large Objects (BLOBs) [19] and determining subject paths by the Kallman

filter in adjacent images [20] The subject interaction in this case is the overlapping of

subjects in the frame, the state of which must be precisely identified This is because

only one type of appearance can be extracted from overlapping objects for recognition,

leading to misjudgments Nevertheless, the illumination variation, scene change,

Trang 24

moving unrelated objects, etc related to the background can also give the wrong subject

A small problem is determining the action of overlapping subjects in the complex

time-varying background In this task, we use a portable panoramic camera to capture a

360-degree view image Background subtraction based on the recorded image is used to

detect moving objects and can be used to identify blobs The Kalman filter tracks the

centroid points of these BLOBs accompanied by subjects in a video clip to remove

noisy and understand subject interactions When a single subject is totally extracted,

the six features related to that subject are formed by using the Gaussian Mixture Model

(GMM) for classification activities To avoid misjudgments in single or few images,

action recognition is completed to recognize each subject in a 10-image segment In

addition, we approach a finite state machine to drive the state transitions of falling,

sitting, walking, and standing up actions precisely There is only one extracted blob

matching the group of subjects After separating the overlapped subjects, the activity of

each object is recognized and evaluated through the proposed FSM The boundary

segments of the start and end of the overlap interval are most important to distinguish

the expanding and stationary states, while the transition states are mostly dependent on

the amount of period changes

To achieve higher performance, a majority voting is necessary to drive the state

of the boundary image segment in which the probabilistic model is constructed As a

result of the simulations, the proposed system shows good performance, reaching

93.0% for video-level with multiple subjects at home

2.2 Technical Approach

In this study, we describe in detail how to extract the features by the handcraft

Trang 25

2.2.1 Handcraft Feature Extraction by locally body subject estimation

Based on the background image, the moving subjects and objects are discovered

Additionally, the functions of erosion and expansion are fulfilled to minimize the noise,

and obtain the converged objects In particular, the shadows of moving objects are

eliminated to some extent by an adequate threshold of GMM in generation of the

background image After obtaining a moving subject, it is detected whether the head

and feet of a subject are located at the upper and lower boundaries of the captured

picture If yes, the height estimate can refer to the top and bottom boundaries of vertical

pixels in a subject area and the triangular measurement manner estimates the distance

between the subject and the camera Otherwise, the proposed system only estimates the

distance by variations of body widths during subject moving To recognize the posture,

the outline feature selection scheme is used to compute the distances between the

centroid point and contour points of a moving object Here, 36 representative contour

points are adopted where the neighboring contour points relative to the centroid point

have a 10-degree difference These heights, distances, and outline features of moving

subjects to construct a model for action recognition

Detection of moving areas: We apply the background subtraction method to

capture the moving areas of panoramic video clip and remove shadows effectively

Here, the parameters of a GMM model are updated by the pixel level-based recursive

equations for each pixel In this general background subtraction scheme, one should

begin by constructing a Gaussian distribution, evaluating the background models, and

then adjust the background models to identify foreground instances Five Gaussian

distributions associated with the first 100 frames produce a histogram for each pixel

Furthermore, to ensure both the foreground and background remain identifiable, a

background model is built The environmental context is tracked by the updated

Trang 26

background model against gradual illumination changes in which learning weight is set

to 0.001 and the background weight threshold is 0.7 Then we use the erosion and

morphological expansion procedures to remove the noise, followed by morphing to

connect the multiple small size relation areas to become large areas This entire large

area is considered to be a moving subject

Height Estimation: Estimated height is important for improved performance in

this research Regardless of the depth field, as shown in Fig 2.1, it's the same height

from the middle row of the captured picture that we are able to determine the height

The height estimation scheme is proposed to determine a moving subject with their

head and legs within the image and then calculate their height as the ratio of the number

of vertical pixels, h, in the people contour, the number of vertex pixels between the

middle row and the bottom of the people contour, z, and the location height of the

proposed panoramic camera platform, H, is as following:

Fig 2.1 Schematic sketch of geometry involved height estimation

In Equation 2.1, the estimated height of a subject, P, can be located, tracked, and

identified as moving objects in a continuous image to ensure that the head and feet are

completely within the frame boundaries of the captured image In this case, you can do

Trang 27

Distance estimation: The proposed distance estimation took two factors into

account: (i) the subject's feet within the frame's boundary and (ii) the subject's feet

outside the frame's boundary In the case of (i) in Fig 2.2, the distance estimation

associated with subject variations is addressed using a triangular measurement method

at different frames In the case of (ii) shown in Fig 2.3, the measurement of distance is

dependent on subject widths at different frames

Fig 2.2 Distance estimation at the situation (I)

In Fig 2.2, we configure the equipment and specifications for a camera's fixed

height, H, and vertical view angle,  The distance D C is acquired by the triangular

measurement manner while the feet of the subject reach the frame boundary,

The feet of the subject become closer to the middle row, corresponding with a

subject's movement from point C to point A Thus, the distance, DA, can be calculated

using the vertical pixel shift of the subject contour below the middle row in conjunction

with the triangulation process

𝑝1

𝐻1 = 𝑝2

𝐻2, (2.3) and

𝐷𝐴 = 𝐻2

𝑡𝑎𝑛 𝜃1 = 𝐻1 ×𝑝2

𝑝1×𝑡𝑎𝑛 𝜃1 (2.4)

Trang 28

where p1 is the vertical pixel of the subject contour from the feet to the middle row, and

p2 is the vertical pixel of the subject contour from the bottom of the image to the middle

row The camera height is denoted by H1 equals H, and H2 is the real height representing

p2 in the image

Fig 2.3 Estimated Distance at the situation (II)

The feet are outside the frame if the subject is near to the camera Equation (4)

can not be applied As shown in Fig 2.3, we calculate the distance instead with

variations of subject width in frames We denoted the uppercase symbols to represent

real lengths and lowercase letters as the number of pixels in the frame To begin, the

horizontal view width of the camera at point C can be calculated as follows:

𝑊 = 2 × 𝐷𝐶 × 𝑡𝑎𝑛 𝜃2, (2.5) where 2 is the horizontal view angle of a camera The ratio of the horizontal width, W,

to the number of pixels at the width of an image, f 1, can be used to determine the body

width of a subject, B w

Trang 29

where b1 is the number of pixels associated with the width of a subject contour at point

C When the subject moves to point B, the environmental width, W2, can be found by

𝑊2 =𝑓2×𝐵𝑊

𝑏2 , (2.7)

where b2 equals the width of a subject contour at point B, and f 2 equals f 1 As a result,

the distance between the camera and the subject, denoted by the D B,, is

2×𝑡𝑎𝑛 𝜃2 = 𝐵𝑊×𝑓2

Outline feature generation: The posture feature was generated by calculating

Euclidean distances between contour points and a moving subject at the center of the

image, which were then converted to a distance curve [22] This distance curve is

constructed using the distances between the 36 contour points and the point's center,

beginning at the bottom left of the contour point and working clockwise through the

selection of 36 contour points The angle between the center point and the point's

neighboring contour is approximately 10 degrees Three peaks in the distance curve

correspond to the head and feet of a standing subject, as illustrated 2.4.2

Fig 2.4 Distance curve pattern of the measure of the standing subject

Trang 30

2.2.2 Proposed action recognition on single subject

A panorama camera platform including a set of 4 cameras, one of them has a fixed

focal length, 45 degrees vertical, and 95-degree horizontal angle, is used for the

proposed action recognition system The suggested camera platform is placed ideally

on a table or furniture at about 90 centimeters in height to effectively capture the body

height The panoramic camera associated with stitching four images can be obtained in

the environment of a 360-degree scene

Fig 2.5 reveals the flowchart of the proposed action classification system First

of all, a video clip is recorded using the proposed panoramic camera platform

Background subtraction is used to capture a moving subject from the video clip for

extracted features of height, distance between the camera and the subject, and the

contour The six formulated features associated with action characteristics of height,

distance, and contour are calculated, and classified by the SVM scheme in which four

activities of falling, standing, walking, and sitting are recognized To be accurate for

human action recognition We adopt an action recognition based on an input segment

of 10 pictures In the watching TV category, TV switching on/off is to detect that the

TV is turned on The actions of TV are dependent on whether the subject is sitting or

not

Fig 2.5 shows the proposed flow chart of our action recognition system based on

10 pictures of an input segment for four actions and TV on/off By handcraft extracting

features, the six design formulations are implemented and defined as shown in Table

2.1 The support vector machine model from the LIBSVM library of toolbox is used for

the classifier layer task based on the six types of features, and action recognition on the

Trang 31

validation to derive the objective task [23] Finally, majority voting is used to classify

five actions in the picture segment based on sitting and turning on the television

Fig 2.5 Proposal flow chart of our action recognition system

T ABLE 2.1 F EATURES E XTRACTED FOR A CTION RECOGNITION

𝑑 𝑚𝑖𝑛

𝑑𝑚𝑎𝑥

d max and d min are the largest and smallest distances from the contour points to the

centroid point, respectively The ratio of d min to d max reveals the stretched or shrunk body

ℎ

𝑏

h and b denote the height and width of a subject, respectively The postures of

standing, sitting and lying are likely to have large, medium, and small ratios of

h/b, respectively

𝑏 × 𝐷 b multiplied by D is to show the differences among the three postures where D

is the estimated distance from the camera to the subject

Trang 32

Detection of TV on/off: The primary goal of the TV identification task is to

determine whether the TV screen is to be turned on or off In section 2.3.1, moving area

is detected changes in the scene of the TV screen where the closed-form boundary is

attained The approximate four lines associated with the horizontal and vertical lines

are analyzed to determine the intersection points of the shape Then we calculate the

corner adjacent angles of this shape and analyze them together to see if this shape is

close to the parallelogram If so, the TV is on or off Otherwise, this object is not a TV

In Fig 2.6, TV shows the state of the on/off divided into 4 steps First, the

foreground edge detection of the image is extracted by the Canny algorithm from the

background subtraction Then, using the Hough transform, these edge detections are

converted into line segments Third, the intersection points of lines are found from

multiple edges across together and then estimate their angles from the perspective

transformation The center point of these intersection points is used to detect possible

boundary points Fourth, the largest parallelogram represents the TV boundary is

determined by the corner angles of boundary points which are close to 90 degrees

When the TV is on, there may be multiple foreground objects in the TV area Otherwise,

when the TV is turned off, there are no foreground objects in the TV area of the

subsequent frame

Lines detection

Rectangle detection Finding intersection lines

Perspective transform Find corner of Polygonal

Edge detection

Trang 33

2.2.3 Proposed Action Recognition on Multiple Subjects

Our previous task has focused on the action recognition of single movement

subject at homecare [1], [13] Herein, this task is to study the action recognition of

multiple movement subjects by panoramic video clips from a portable camera product

that includes four 720×480-pixel images for stitching

Fig 2 7 displays the flow diagram from the proposed action recognition system To

begin, the Gaussian mixture model (GMM) is the most effective method for background

modeling The first GMM from the OpenCV MOG2 toolbox is used with adaptive

learning rates and an adjustable weight in the foreground The MOG2 has good

adaptability where moving objects can be detected instantly and effectively Although

this model can eliminate noise and shadows in a superior way, false detection may occur

when there are fast moving objects, such as shaking trees and sudden changes of

illumination in the background This GMM uses five Gaussian distributions to

characterize pixels in the same position over 100 frames The background and

foreground pixel values are derived by using the expectation maximization method of

the Gaussian distribution with higher and lower intensities Second, BLOBs associated

with the foreground are tracked by using the Kalman filters based on their centroid

points Third, the overlapping and non-overlapping circumstances are calculated If

their previous operation states overlap, let's keep them Otherwise, the features from the

BLOB are used to recognize an action using the second GMM scheme This classifier

finds the optimized number of Gaussian components for each class associated with

BLOB feature based on the separability criterion and then determines the parameters of

these Gaussian components by using the expectation maximization algorithm Fourth,

the incremental majority voting for action recognition of each subject based on 10

images of a picture segment is carried out to get a result which is further adjusted or

Trang 34

verified by the proposed finite state machine In particular, the proposed finite state

machine also considers the subject interactions to make proper estimates of activity

states

Standing (3)

Walking (2)

Sitting (4)

Falling

(6)

Coming in (1)

100/0 010/ 0 001/ 0 010/ 0

001/0 010/ 0

Blobs data

Mixture model Gaussian Segmentation

Image aquisition

Foreground

Motion subject

Height estimation

Distance estimation

Contour detection

Feature generation

GMM Classification

Fig 2.7 Proposed action recognition system

Trang 35

morphology To extract dependable BLOBs related to a certain range of sizes are found,

and the eight connected component analysis is performed to create arbitrary large

objects of which size, convex hulls, bounding circles, bounding boxes, bounding box

ratios are calculated These determined features are to detect the appropriate subject that

meets those parameters, and to fill the holes of the subject considered as BLOBs The

example of generation BLOB is seen in Fig 2.8

Fig 2.8 Example of shape BLOBs generation from forground

Trajectory Modeling: Multiple standard techniques such as mean transfer,

surfing, optical flow etc are used for trajectory tracking [24] In this work, we conduct

the Kalman fitter to keep location on multiple subjects in sequence frames First, in the

currently processed frame, the center point of the subject BLOB is determined Second,

the Kalman filter predicts the location of the center point of the respective BLOB on

the next frame and optimizes the estimate by using the motion model that the constant

movement vectors of the neighboring frames in order to reduce noise as shown in Fig

2.9

Finite State Machine: Three circumstances of separation, interaction and

overlapping can describe the subject interactions In this study, the Gaussian mixture

models, majority voting, and finite state machine satisfy the action recognition of

subjects in circumstances of separation and interaction The proposed finite state

machine predicts the intermediate activity state of subject based on their conditions

before and after overlapping circumstances According to the state of graph transition

Trang 36

among four actions of walking, falling, sitting, and standing, the proposed finite state

machine can be designed as described in Fig 2.10

Fig 2.9 Illustration of tracking by the Kalman filter method (a) Original image (b) its foreground (c) Subject boundary extracted (d) Two subjects seperated (e) Two subjects contacted (f) Two subjects overlapped

Fig 2.10 shows the FSM as a constraint diagram in which edges and states are

corresponding to transformations and nodes The initial state is created when a subject

appears in the sequence frame Afterwards, the action recognition state is determined

by the proposed finite state machine due to some state of action not occurring For

example, the falling state can not directly transfer to the walking or standing state where

the sitting state should be included

Trang 37

Fig 2.10 Proposed FSM

Table 2.1 lists the states and transitions of the proposed FSM

Notably, the background subtraction system can not generate the foreground related to

the subject staying still for a while The proposed FSM can sustain its previous state

because the subject does not leave In most ordinary people from various environments,

the proposed FSM in Fig 4 can be used widely Besides, the overlapping interval states

can also be fairly inferred when subjects are overlapped We predict a short path

between states before and after overlapping interval based on the appropriate state

transformations, as shown in Table 2.3 As you can see in Fig 2.11, the estimates of

action states overlap There are a lot of cases which can not be handled with a short

path procedure Nevertheless, this is a difficult challenge due to the lack of information

T ABLE 2.2 S TATES AND TRANSISTIONS OF OUR FINITE STATE MACHINE

States Current States Inputs Next States Alarm

Trang 38

Walking Standing Sitting Falling

Walking (W) W S td , W S it , S td , W F, S it , S td , W

Standing (S td ) W, S td S td S it , S td F, S it , S td

Sitting (S it ) S td , S it S td , S it S it F, S it

Falling (F) W, F S td , F S it , F F

Incremental Majority Voting: The precise prediction of activity states of the

boundary picture segments can bring out reliable estimates of states in the overlapping interval In order to increase the accuracy of state prediction, the incremental majority voting approach is proposed According to Fig 2.11, not only is the state of the boundary picture segment, P1, considered, but the states of P2, P3, P4, or/and P5 are also referred to know whether their states are consistent That is because the state transition does not occur frequently in most situations Accordingly, we can find out whether the states of P1 and P2 are the same first If yes, the state of the boundary picture segment

is confirmed Otherwise, P3 joins P1 and P2 to do the majority voting If they are different, P4 participates in the voting When P1, P2, P3, and P4, belong to four distinct activity states, P5 helps with making the final decision Fig 2.12 displays the operational flow and probability model of the proposed incremental majority voting where steps of I, II, III and IV indicate the majority voting from (P1, P2), (P1, P2, P3), (P1, P2, P3, P4), and (P1, P2, P3, P4, P5), respectively Here, p is the average accuracy rate

of state predictions among four activities, and r is the probability of state transition at a

picture segment Owing to the incremental majority voting, the accuracy probabilities

of four steps are accumulated to become

3



Trang 39

P5

BEFORE OVERLAPPING AFTER OVERLAPPING

Incremental majority voting

Activity Segment set

States

Q3 Q4 Q5Activity Predicted

Aj

Fig 2.11 Estimates of activity states in the overlapping interval

Feature Generation for Gaussian mixture models: With reference to our

previous work [1], [13] the features related to the subject height, distance are located

and extracted In addition, the features from the motion vectors of the centroid points

and the features concerning activity characteristics from the information on height,

distance, contour and motion are used to train Gaussian mixture models to recognize

2 ( − ) ( − ) ( − ) ( − ) ( − )

Trang 40

2.3 Experiment Results and Discussion

These experiments were conducted by 10 volunteers, 6 women and 4 males, who

captured video clips from each subject, including standing, falling, sitting, and watching

TV activities Fig 2.13 shows the configuration of the room and the experimental

scenarios The experimental room is 3.9, 3.1 and 3.0 meters in size The five actions in

these video clips are shown in Fig 2.14, which indicates a quite complicated

background

Fig 2.13 Room layout and experiment scenario

The only subject’s movement has been tracked, identified, and understood by

background substraction, when their entire bodies have been captured in a frame Our

experiments are good for a height estimate of 5,175 effective ones out of 6,000 frames

1,243 and 2,583 frames with errors of less than 2% and 5% are available Finally, the

outcome shows that the average error rate for a subject with a height of 170 cm is around

5.5%, which indicates a height estimated between 161cm-179cm We requested the

subjects who move around the camera at 6 separate distances of 3m, 2.5m, 2m, 1.5m,

1m and 0.5m, described by pointed circles in Fig 2.13 for their distance estimate For

Định dạng
Số trang	166
Dung lượng	6,24 MB