Homecare Action recognition with single or multiple subjects, using handcraft feature on traditional machine learning: The homecare of elderly people has become an important issue that
Trang 1National Chung Cheng University Electrical Engineering of Department
ADVANCED MACHINE PERCEPTION MODEL FOR
ACTIVITY-AWARE APPLICATION
Student: Manh-Hung Ha
Advisor: Professor Oscal Tzu Chiang Chen
June 2021
Trang 2Acknowledgments
It would have been impossible to work on a thesis without the support of many
people who all have my deep and sincere gratitude First of all, I would like to thank
Professor Oscal T.-C Chen, my academic supervisor, for having taken me on as a PhD
student, and having trusted me for years I could not find any better adviser than
Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me
pursue my thoughts freely He was an excellent theoretical point of reference and
critical to testing my theories He continues to impress me with his systematic style,
with compassion and humility, above all, to become a better researcher as well as a
better person Thank you for introducing me to the world of computer vision and
research and for taking me as a young researcher in my formative years
Thank you to my thesis committee - Prof Wei-Yang Lin, Prof Oscal T.-C Chen,
Prof Wen-Nung Lie, Prof Sung-Nien Yu and Prof Rachel Chiang for their time, kind
advice and feedback for my thesis You have taught me a lot and guided me on
improving the quality of my research
I was incredibly happy to have amazing mentors at CCU over the years I am
grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn
the ropes in the first few years I am thankful to Professor Alan Liu, Professor Sung
Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me
on my academic courses, helping me to expand my field of study and teaching me to
be a systematic experimental experimentalist
I'd like to express my gratitude to all of my friends and colleagues in the
Department of Electrical Engineering There were great staffs who taught me
Trang 3researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I
thank them for working with them, and hope that there are even more opportunities to
learn from them in the future I would like to thank my many laboratory partners in
VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng
Lu, Han-Wen Liu, Yu-Ting Liu, etc I learned many things from each of them and their
support and guidance helped me overcome some stressful times Finally, thank you for
sharing your thoughts, documentations, data sets and coding programs All of this work
wasn't even considered and thanks to many of my colleagues working with computer
vision and machine learning My many amazing co-authors, I want to thank as well
I was also incredibly lucky to have made many friends in my time in CCU Also,
thanks to the badminton team in EE department, VSACCU badminton team showed me
all the awesome activities around the PhD journey!
Finally, without the valuable support, encouragement, love and patience of my
family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a
student, they trained me for this by showing me the importance of hard work, critical
thinking and patience I thank them for their support, trust and innumerable sacrifices;
I have worked on this study thousands of kilometers from home and missed many
significant life events And speaking of love and encouragement, my wife Nguyen
Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful
years and many well-wrought weekends and vacations, and above all for staying by my
side, even when a long distance was between us Thank you for raising me today and
for having always been there and trusting me
Trang 4
Abstract
People are one of the most important entities that computer vision systems would
need to understand to be useful and omnipresent in various applications Most of this
awareness is based on the recognition of human activities for the homecare systems
which observe people and support elderly people Human beings do this well on their
own: we look at others and describe each action in detail Moreover, we can reason
about those actions over time, even predict the possible actions in the future On the
other hand, computer vision algorithms were well behind the challenge In this study,
my research aim is to create learning models, which can automatically induce
representations of human activities, especially their structure and feature meaning, in
order to solve several higher-level action tasks and approach to context-aware engine
for various action recognition
In this dissertation, we explore techniques to improve human action
understanding from video inputs which are common and may be found in daily
activities such as surveillance, traffic, education, movies, sports, etc on challenging
large-scale benchmark datasets and our own panoramic video dataset This dissertation
targets the action recognition and action detection of humans in videos The most
important insight is that actions depend on global features parameterized by a scene,
objects, and others, apart from their own local features parameterized by body pose
characteristics Additionally, modeling the temporal features by optical flow from
motions of people and objects in the scene can further help in recognizing human
actions These dependencies are exploited in five key fords: (1) Detecting moving
subjects using the background subtraction scheme, tracking extrcated subjects using the
Trang 5system with a lightweight model capable of learning from a portable device; (3) Using
capsule networks and skeleton-based map generation to attend to the subjects, and
building their correlation and attention context; (4) Exploring the integrated action
recognition model based on correlations and attention of subjects and scene; (5)
Developing systems based on the refined highway aggregating model
In summary, this dissertation presents several novel and significant solutions for
efficient DNN architecture analysis, acquisition, and distribution on large-scale video
data We show that the DNNs using multiple streams, combined model, hybrid structure
on conditional context, feature input representation, global features, local features,
spatiotemporal attention and the modified belief Capsnet have efficiently achieved high
quality results The consistent improvements from using these components of our
DNNs are addressed to achieve state-of-the-art results on popularly-used datasets
Furthermore, we also observe that the largest improvements are indeed achieved in
action classes involving human-to-human and human-to-object interactions, and
visualizations of our network show that it is focusing on scene context that is intuitively
relevant to action recognition
Keywords: attention mechanism, activity recognition, action detection, deep
neural network, convolutional neural network, recurrent neural network, capsule
network, spatiotemporal attention, skeleton
Trang 6TABLE OF CONTENTS
PAGES
ACKNOWLEDGMENTS i
ABSTRACT iii
TABLE OF CONTENTS v
LIST OF FIGURES viii
LIST OF TABLES xi
I INTRODUCTION 1
II MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE 8
2.1 Introduction……… 9
2.2 Technical Approach 11
2.2.1 Handcraft feature extraction by locally body subject estimation 12
2.2.2 Proposed action recognition on single subject 17
2.2.3 Proposed action recognition on multiple subjects 20
2.3 Experiment Results and Discussion 27
2.3.1 Effectiveness of our proposal to single subject on action recognition28 2.3.2 Effectiveness of our proposal to multiple subjects on action recognition 31
2.4 Summary and Discussion 35
III ACTION RECOGNITION USING A LIGHTWEIGH MODEL 38
3.1 Introduction 37
3.2 Related Work 40
3.3 Action recognition by a lightweight model 41
Trang 73.5 Summary and Discussion 49
IV ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY, USING DNN AND CAPSNET 50
4.1 Introduction 51
4.2 Related Previous Work 55
4.2.1 Diverse Spatio-Temporal Feature Generation 55
4.2.2 Capsule Neural Network 59
4.3 Proposed DNNs for Action Recognition 59
4.3.1 Proposed Generic DNN with Spatiotemporal Attentions 59
4.3.2 Proposed CapsNet-Based DNNs 68
4.4 Experiments, Comparisons of Proposed DNN 72
4.4.1 Datasets and Parameter Setup for Simulations 72
4.4.2 Analyses and Comparisons of Experimental Results 74
4.4.3 Analyses of Computational Time, and Cost 84
4.4.4 Visualization 85
4.5 Summary and Discussion 88
V ACTION RECOGNITION ENHANCED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE 89
5.1 Introduction 89
5.2 Related work 91
5.3 Proposed DNN 92
5.3.1 Projection of SBB to ERB in the Feature Domain 92
5.3.2 Map Convolutional Fused-Depth Layer 93
5.3.3 Attention Mechanisms in SA and TE Layers 93
5.3.4 Generation of Subject Feature Maps 95
5.4 Experiments and Discussion 97
Trang 85.4.1 Datasets and Parameter Setup for Implements Details 97
5.4.2 Analyses and Comparisons of Experimental Results 98
5.5 Summary and Discussion 100
VI SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO CONTEXT 101
6.1 Introduction 102
6.2 Related work 107
6.2.1 Action Recognition with DNNs 107
6.2.2 Attention Mechanisms 108
6.2.3 Bounding Boxes Detector for Action Detection 109
6.3 Proposed Methodology 110
6.3.1 Action Refined-Highway Network 112
6.3.2 Action Detection 118
6.3.3 End-to-End Network Architecture on Action Detection 123
6.4 Experimental Results and Discussion 123
6.4.1 Datasets 123
6.4.2 Implementation Details 125
6.4.3 Ablation Studies 127
6.5 Summary and Discussion 133
VII CONCLUSION AND FUTURE WORK 135
REFERENCES 139
APPENDIX A 152
Trang 9LIST OF FIGURES
2.1 Schematic diagram of height estimation 13
2.2 Distance estimation at the situation (I) 14
2.3 Estimated Distance at the situation (II) 15
2.4 Distance curve pattern of the measure of the standing subject 16
2.5 Proposal flow chart of our action recognition system 18
2.6 Flowchart detection of TV on/off 19
2.7 Proposed activity recognition system 21
2.8 Example of shape BLOBs generation from forground 22
2.9 Illustration of tracking by the Kalman filter method 23
2.10 Proposed FSM 24
2.11 Estimates of activity states in the overlapping interval 26
2.12 Proposed incremental majority voting 26
2.13 Room layout and experiment scenario 27
2.14 Examples of five activities recorded from the panoramic camera 28
2.15 Total accuracy rate (A) versus p and r 32
2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2 32
3.1 Proposed recognition system 41
3.2 Functional blocks of the proposed MoBiG 42
3.3 Proposed finite state machine 43
3.4 Proposed incremental majority voting 44
3.5 Confusion matrix of MoBiG identifying four activities 48
4.1 CapsNets integrated in a generic DNN 55
4.2 Block diagram of the proposed generic DNN 58
Trang 104.3 Three skeleton channels 61
4.4 One example of the transformed skeleton maps from an input segment 63
4.5 Block diagrams of the proposed AJA and AJM 64
4.6 Block diagram of the proposed A_RNN 67
4.7 Proposed capsule network for TC_DNN and MC_DNN 69
4.8 Block diagrams of the proposed CapsNet-based DNNs 71
4.9 Examples of the panoramic videos about 12 actions where subjects maked by
red rectangular dash-line boxes for observing only 74
4.10 Visualization of the outputs from the intermediate layers of the proposed TC_DNN 87
4.11 Visualization of the outputs from the intermediate layers of two A_RNNs 87
5.1 Block diagram of the proposed deep neural network 92
5.2 Block diagram of the SA generation layer 95
5.3 We plot the comparison performance of the AFS and ROS stream for each action classes 98 5.4 JHMDB21 confusion matrix 99
6.1 Refined highway block for 3D attention 104
6.2 Overview of the proposed architecture for action recognition and detection 110 6.3 Temporal bilinear inception module 113
6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant RNNs 114
6.5 Schematic recurrent 3D Refined-Highway depth by three RH block 116
6.6 3DConvDM layer correlating the feature map X 118
6.7 The details of GAA module 122
6.8 mAP for per-category on AVA 132
Trang 116.12 Qualitative results of R_FSRH on action recognition and detection on the
JHMDB21 dataset 140
6.13 Qualitative results of top predictions for some of the classes using proposed
model on Ava 141
A.1 12 category visualization of panoramic camera data in two and three
dimensions with the scatter plot filter 151
A.2 T-SNE test data visualization where each of data points represented by the
shots of a frame sequence on UCF101 151
Trang 12LIST OF TABLES
1.1 List of published and submitted papers in this dissertation 7
2.1 Features used for posture recognition 18
2.2 States and transistions of our FSM 24
2.3 State estimates at the overlapping interval 25
2.4 Confusion matrix of TV on/off detection 28
2.5 Confusion matrix of activity recognition (I) 29
2.6 Comparison of features and performance of the proposal and conventinal activity recognition 30
2.7 Average accuracies of four activities at the type-I experiment 33
2.8 Confusion matix of activity recognition (II) 34
2.9 Example I of activity accuracies of two subjects at the type-II experiment 34
2.10 Example II of activity accuracies of two subjects at the type-II experiment 34
2.11 Example III of activity accuracies of three subjects at the type-II experiment35 3.1 Features of original mobilenetV2 and MOBIG 46
3.2 Performance comparison of pre-trained mobilenetV2 and MOBIG using the panoramic camera dataset 47
3.3 Accuracies improved by MOBIG plus FSM and IMV 48
3.4 Accuracy, complexity, and model size of the proposed system, and the other DNNs 49
4.1 Performance of three types of DNNs using the datasets of UCF101, HMDB51, and panoramic videos 75
4.2 Accuracies of 12 actions recognized by three types of DNNs using the
Trang 134.4 Performance of the proposed generic DNN and three CapNet-based DNNs
with one, two and three input streams 79
4.5 Average accuracies of the proposed TC_DNN at four merging models and
different values of F 81
4.6 Performance comparisons of TC_DNN using different approaches of
generating Tske maps 82
4.7 Performance comparisons of the proposed and conventinal DNNs using only
an RGB stream 82
4.8 Performance comparisons of the proposed and conventinal DNNs using the
HMDB51 and UCF101 datasets 83
4.9 Parameter amounts and inference time of the generic DNN, MC_DNN,
DC_DNN, and TC_DNN 84
5.1 Accuracies of 8 actions recognized by three types 99
6.1 Comparisons of DNNs with the R_FSRH layers at different numbers of RH
modules on TP, JHMDB-21 and UCF101-24 datasets 128
6.2 Results of the DNN based on I3D+ R_FSRH using different scalar values, γ,
A.1 Comparison of the features and performance of our proposed and generic
activitity recognition systems on panoramic video dataset 152
Trang 14I INTRODUCTION
The main goal of visual understanding, image classification, computer vision, and
artificial intelligence is to aid people to do their work more efficiently From support
requests to assistive services, the potential impact of vision and AI on aging society is
immeasurable, and it has grown at an unprecedented rate in recent years While these
applications are still an active field of study, note that they share a common theme: they
all require systems to interact with and understand humans Hence, developing
technologies capable of understanding people is critical in achieving the goal of
ubiquitous AI Human understanding, however, can not be done in isolation by just
observing the person because of being influenced by the objects we interact with as well
as by the environment we exist in
The development of deep learning has led to rapid improvements in various
fundamental vision problems Given large labeled datasets, CNNs are able to learn
robust and efficient representations that exceed human performance on video
classification and perform exceedingly well in action recognition, object detection and
key point estimation tasks But what about tasks that do not have well defined datasets
or are much harder to label, such as 3D structures of objects or all actions afforded by
a scene Moreover, one of the limitation in scale of models up for a higher level
understanding of human intent and actions over a continuous video stream is considered
This thesis takes a step towards building and improving systems capable of
understanding human actions and intentions from both benchmark large-scale dataset
and our own dataset These systems would need to reason about the scene layout in
unison with the humans to perform well The thesis is divided into three parts to explore
Trang 15for smart device (iii) locally (using poses), temporally (using optical flow), appearance
(using RGB), capsule networks (classifier) (iv) incremental action recognition model
by using correlations and attention of subjects and scenes (v) refined highway, globally
and locally, contextually on action recognition with and without localization
Homecare Action recognition with single or multiple subjects, using
handcraft feature on traditional machine learning: The homecare of elderly people
has become an important issue that requires being aware of activity recognition of
multiple subjects at home We start by understanding humans via focusing directly on
them, and especially, their body poses Features from body pose, typically defined using
a background subtraction and estimation method, provide a useful signal of a person’s
external action state As a handcraft feature, extracting just a few body keypoints over
time can be enough to recognize human actions Towards that goal, we build systems
to detect and track these features efficiently and accurately in videos We then use those
features to direct classify and recognize actions of single and multiple subjects We
show that the new model using a panoramic camera can enable several novel
applications in home-care systems
Computation-affordable recognition system with a lightweight model:
Activity recognition systems have been widely used in different areas, including
healthcare, security, as well as other areas like entertainment, are experiencing rapid
growth Currently, edge devices are very resource limited This doesn't allow for high
computation Previous work has successfully developed various DNN models for action
recognition, but these models are computationally intensive and therefore inefficient to
apply to edge devices One of the most promising solutions today is cloud storage that
can be used to transfer data by several methods for further analyses However,
continuously sending signals to the cloud demands an increasing bandwith Routine
Trang 16activities at the edge can be identified before being sent to the cloud, improving
response time and lowering cost Accordingly, we propose a lightweight DNN model
for action recognition tasks that require less processing power, making them suitable
for use in portable devices The performance was conducted on the five daily activities
from videos recorded by a panoramic camera As a result, this proposed model performs
better than current techniques
Human Action Recognition Using Capsule Networks and Skeleton-Based on
the subjects' attendance and their correlation attention context: The topic of
activity recognition has received intensive studies in recent years The DNN devises a
spatial-temporal skeleton-based attention mechanism and incorporates multiple
CapsNet mechanisms to better understand human activity in abundant space and
temporal-based contexts videos The primary schemes used in activity recognition are
presented First, the popularly-used DNN for activity recognition on two RGB and
optical flow inputs is briefly reviewed Second, we propose a joint point based spatial
attention mechanism to weight the skeleton for action recognition Finally, how to
increase inference classifiers in Capsule networks is illustrated
Incremental action recognition model by using correlations and attention of
subjects and scenes: To that end, we develop systems for efficient and accurate video
recognition, as well as an analysis of the current state of human action understanding,
which consists of three stages At the first stage, we developed an adaptive object
detector YoLov4 approach for detecting humans in videos and forming a temporal
stream based on human features Second, project the human region to the coresponding
feature which 3DCNN extracted the feature of structural and semantics information
Trang 17objects individually The attention mechanisms in spatial attention and temporal
encoder are implemented to find out meaningful action information on spatial and
temporal domains for enhancing recognition performance
Human action recognition on multiple labels, with or without localization,
using a refined highway aggregating global and local contextual information over
time: Finally, we explore the applicability of global and local contextual temporal
reasoning for recognizing actions Having focused on people, we now zoom out and
consider people in their context Scenes, objects, and other people in a scene all have
strong precedence over what a human does In fact, humans perform actions to either
change their own state, or to change the state of their environment, often interacting
with or responding to other humans on the scene In both cases, and especially the latter,
scene and object affordances directly define what actions a human will perform We
propose the Refined Highway (RH) model and Global Atomic Spatial Attention (GAA)
to investigate the role of residual and highway connections in DNN for activities
recognition enhancement Residual and skip connections are very similar in nature to
action behavior, and play a significant role in the current generative model, which is
one of the most commonly employed recognition enhancement approaches RH, on the
other hand, can be thought of as a mash-up of transformed and carried features by
attention-driven The proposed unified DNN is able to localize and recognize actions
based on 3DCNN The video is first sliced into equal segments, and then created using
Regions Proposal Network (RPN) features For this, we are extending the 3DCNN by
using the R_FSRH framework and GAA module to handle the RoI localization as an
action recognition problem First, the regions of interest are detected on each frame,
and then their respective maps are produced from these results into bounding boxes
The recognition system can also be used on the general issue of human recognition
Trang 18Compared to the state-of-arts, our experiments are showing superior results on
large-scale datasets
In this thesis, we verifed our proposed DNNs on multiple public large-scale
datasets and our panoramic video dataset
Dataset collection for homecare assistance is the recognition of actions using a
portable panoramic camera that captures four 720 x 480 pixel images to produce 360
degree-view stitched images Input data are collected and labeled with annotated ground
truth, and detailed evaluation methods, which makes the methods comparable with each
other It is possible to use an existing dataset and this is what has been done during
these experiments Typically, the dataset is divided into three parts: training, validation,
and testing In this thesis, we name the dataset recorded by panoramic camera as PanoV
which is used in section 2, 3, 4
The absence of standards and benchmarks makes comparing different deep
learning models harder In recent years, UCF101 and HMDB51 have been the most
popular large-scale datasets and scientific research is usually done on one or both of
these datasets, but not without problems Looking at the number of generated frames,
UCF101 seems to be coherent with the magnitude of ImageNet but it lacks diversity
and so the generalization can be difficult Concerning HMDB51, the dataset suffers
from the same issues and has a greater number of frames than the average The dataset
of Traffic Police (TP), which contains videos of Chinese traffic police with 8 kinds of
Chinese traffic police commanding poses by analyzing visual information, was used for
simulations To solve the action recognition problems in this study, we used three
datasets to evaluate a proposed model in sections 4, 5, and 6
Trang 19Precise people's boxes are inferred in JHMDB-21 from human silhouettes as well as the
ground truths of labelled actions The UCF101-24 and JHMDB-21 annotations are
accompanied by spatiotemporal ground truths containing the corresponding spatial
temporal annotations Additionally, each video can contain multiple action instances of
the same class label but different spatial and temporal boundaries AVA and Charades
also feature scenes with multiple actors Each video clip is untrimmed and contains
multiple action labels inside the overlapped temporal durations Both of these
phenomena are embedded in two datasets that are reasonable and challenging, since
these two datasets provide a variety of contextual information that aids in action
detection We conducted detailed ablation studies in this work using the AVA v2.1 and
Charades dataset
Recently, several techniques using deep learning have emerged for different
modalities Many computer vision benchmarks are showing that the disparity between
advanced algorithms and human performance is growing These networks learn
hierarchies of features As the depth increases, these hierarchies can describe
increasingly abstract concepts These advances suggest that we are studying the
application of training methods for the recognition/detection of actions This thesis is based on a preliminary study and five articles, which contribute to the field of action
recognition as shown in Table 1.1
The relevant publication list for each section is as follows:
Section 2: Efficient handcraft feature estimation for single human classification
[1]; Effective handcraft feature estimation for multiple subject action
recognition [2]
Section 3: Lightweight model for action recognition [3]
Trang 20Section 4: Attentional, capsule net for action recognition [4]
Section 5: Enhanced model by correlations and attention of subject and scene
[5]
Section 6: Action recognition and detection using RH network [6]
T ABLE 1.1 L IST OF PAPERS IN THIS DISSERTATION
[1] Activity Recognition Using a Panoramic
Camera for Homecare
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017
[2] Activity Recognition of Multiple
Subjects for Homecare
International Conference on Knowledge and Smart Technology (KST), 2018 [3]
Computation-Affordable Recognition
System for Activity Identification Using
a Smart Phone at Home
IEEE International Symposium on Circuits and Systems (ISCAS), 2020
[4]
Deep Neural Networks Using Capsule
Networks and Skeleton-Based Attention
for Action Recognition
IEEE Access, 2020
[5]
M.-H Ha, and Oscal T.-C Chen,
“Action recognition improved by
correlations and attention of subjects and
Deep Neural Network Using Fast Slow
Refined Highway for Human Action
Detection and Classification
IEEE Transaction on Multimedia, 2021 (in preparation for submission)
Trang 21II MULTI-MODAL APPROACHES LOCALLY ON
SINGLE AND MULTIPLE SUBJECTS FOR
HOMECARE
To understand human action recognition, we begin in this section by considering
homecare on human activities, specifically using the handcraft features on traditional
machine learning into single subject captured on the Portable Panoramic Video
(PPV) in the living room The variables related to the subject height, the
distance between the centroid and the subject contour, and the distance from the camera
to the subject are extracted for motion analysis As home care for the elderly becomes
an important issue that necessitates recognizing the activities of multiple subjects for
homecare, we next use a PPV in the indoor to capture and process daily actions that are
then analyzed using background subtraction Each identified single subject is extracted
with six parameters for classification to obtain an initial activity estimation for each
image Specifically, when the subjects are overlapped, the previous state of activity
associated with each subject is maintained The activity of each subject is not activated
until the subjects are separated In addition, the status of an overlapping period is fairly
calculated using the boundary image section states neighboring to the start and end of
the overlapping interval Regarding the issue that necessitates recognizing the activities
of multiple subjects for homecare
Similarity of the single subject, with the daily activity recognition of multiple
subjects, we adopt a PPV located at the indoor to capture daily subject action by
subtracting background, generating Binary Large Object (BLOB), using the Kalman
filter for tracking, producing local hand-carft features, applying Gaussian Mixture
Model (GMM) for classification, and adopting incremental majority voting and Finite
Trang 22State Machine (FSM) Based on the features extracted by a single subject, six forms of
parameters using the GMM model provide the initial activity estimation per subject on
the framework A majority voting approach on a 10-frame segment of input pictures,
and then the finite state machine is addressed to further validate the identified action
and get rid of those infeasible states Specifically, when objects overlap, the previous
activity state related to each subject is maintained The activity of each subject is not
activated until the subjects are separated In addition, the states of the neighboring to
the boundary are defined by the states of the overlapping interval, and increasing the
performance of action recognition using incremental majority voting is used to establish
and illustrated by the probability model The simulation results show that the average
accuracy of four categories can reach 93.0% when there are multiple subjects in the
living room Compared to the existing work, our system demonstrates comparable
outcome for home-care applications
2.1 Introduction
In recent years, applications for home care have paid more attention It is clear
that life expectancy has increased with the number of elderly population with the
development of medical technology and an aging society As a result, there is a
desperate need for home care for the elderly Additionally, living alone has become
more frequently reported by elderly people today In previous work, most non-invasive
or discrete detection methods used a camera to capture moving objects and then perform
activity detection
So as to understand human action recognition, many spatiotemporal recognition
schemes have recently been proposed [6-12] In [6], the point-selection scheme
Trang 23system for low resolution monitoring 3D depth representation was used to perceive
human action [9], [10] Chen et al [13] used the Gaussian Mixture Model (GMM) to
classify three actions of falling, standing and sitting via a panoramic camera However,
most people's actions in life are so complicated that we must consider how humans live
in their broader surroundings in a complex environment In addition, there are
completely different ways to describe an activity Besides, it is also important to be
aware of the activities of multiple subjects to realize home care with a camera This
recognition task could be summarized as a mix of feature generation, tracking, subject
extraction, person monitoring, machine learning, and activity detection To extract
features effectively, many researchers have used a variety of strategies to track multiple
subjects [14-17] In addition, activity identification [18], and behavior analysis [19],
and pose estimation [1,13], were investigated As a result, these techniques can be used
in home care to understand multiple subjects' daily behaviors and to pre-diagnose
activity symptoms
This section focuses on the issue of recognizing human behavior in a panoramic
video weher the characteristics of each person are evaluated in the time domain There
are numerous difficulties, such as changes in pose, occlusions, multiple overlaps, and
specific behavioral categories Capturing multiple people with a camera can lead to
many problems such as the number of people, interactions among people, and tracking
people that are solved We may estimate the number of people by counting the number
of Binary Large Objects (BLOBs) [19] and determining subject paths by the Kallman
filter in adjacent images [20] The subject interaction in this case is the overlapping of
subjects in the frame, the state of which must be precisely identified This is because
only one type of appearance can be extracted from overlapping objects for recognition,
leading to misjudgments Nevertheless, the illumination variation, scene change,
Trang 24moving unrelated objects, etc related to the background can also give the wrong subject
A small problem is determining the action of overlapping subjects in the complex
time-varying background In this task, we use a portable panoramic camera to capture a
360-degree view image Background subtraction based on the recorded image is used to
detect moving objects and can be used to identify blobs The Kalman filter tracks the
centroid points of these BLOBs accompanied by subjects in a video clip to remove
noisy and understand subject interactions When a single subject is totally extracted,
the six features related to that subject are formed by using the Gaussian Mixture Model
(GMM) for classification activities To avoid misjudgments in single or few images,
action recognition is completed to recognize each subject in a 10-image segment In
addition, we approach a finite state machine to drive the state transitions of falling,
sitting, walking, and standing up actions precisely There is only one extracted blob
matching the group of subjects After separating the overlapped subjects, the activity of
each object is recognized and evaluated through the proposed FSM The boundary
segments of the start and end of the overlap interval are most important to distinguish
the expanding and stationary states, while the transition states are mostly dependent on
the amount of period changes
To achieve higher performance, a majority voting is necessary to drive the state
of the boundary image segment in which the probabilistic model is constructed As a
result of the simulations, the proposed system shows good performance, reaching
93.0% for video-level with multiple subjects at home
2.2 Technical Approach
In this study, we describe in detail how to extract the features by the handcraft
Trang 252.2.1 Handcraft Feature Extraction by locally body subject estimation
Based on the background image, the moving subjects and objects are discovered
Additionally, the functions of erosion and expansion are fulfilled to minimize the noise,
and obtain the converged objects In particular, the shadows of moving objects are
eliminated to some extent by an adequate threshold of GMM in generation of the
background image After obtaining a moving subject, it is detected whether the head
and feet of a subject are located at the upper and lower boundaries of the captured
picture If yes, the height estimate can refer to the top and bottom boundaries of vertical
pixels in a subject area and the triangular measurement manner estimates the distance
between the subject and the camera Otherwise, the proposed system only estimates the
distance by variations of body widths during subject moving To recognize the posture,
the outline feature selection scheme is used to compute the distances between the
centroid point and contour points of a moving object Here, 36 representative contour
points are adopted where the neighboring contour points relative to the centroid point
have a 10-degree difference These heights, distances, and outline features of moving
subjects to construct a model for action recognition
Detection of moving areas: We apply the background subtraction method to
capture the moving areas of panoramic video clip and remove shadows effectively
Here, the parameters of a GMM model are updated by the pixel level-based recursive
equations for each pixel In this general background subtraction scheme, one should
begin by constructing a Gaussian distribution, evaluating the background models, and
then adjust the background models to identify foreground instances Five Gaussian
distributions associated with the first 100 frames produce a histogram for each pixel
Furthermore, to ensure both the foreground and background remain identifiable, a
background model is built The environmental context is tracked by the updated
Trang 26background model against gradual illumination changes in which learning weight is set
to 0.001 and the background weight threshold is 0.7 Then we use the erosion and
morphological expansion procedures to remove the noise, followed by morphing to
connect the multiple small size relation areas to become large areas This entire large
area is considered to be a moving subject
Height Estimation: Estimated height is important for improved performance in
this research Regardless of the depth field, as shown in Fig 2.1, it's the same height
from the middle row of the captured picture that we are able to determine the height
The height estimation scheme is proposed to determine a moving subject with their
head and legs within the image and then calculate their height as the ratio of the number
of vertical pixels, h, in the people contour, the number of vertex pixels between the
middle row and the bottom of the people contour, z, and the location height of the
proposed panoramic camera platform, H, is as following:
Fig 2.1 Schematic sketch of geometry involved height estimation
In Equation 2.1, the estimated height of a subject, P, can be located, tracked, and
identified as moving objects in a continuous image to ensure that the head and feet are
completely within the frame boundaries of the captured image In this case, you can do
Trang 27Distance estimation: The proposed distance estimation took two factors into
account: (i) the subject's feet within the frame's boundary and (ii) the subject's feet
outside the frame's boundary In the case of (i) in Fig 2.2, the distance estimation
associated with subject variations is addressed using a triangular measurement method
at different frames In the case of (ii) shown in Fig 2.3, the measurement of distance is
dependent on subject widths at different frames
Fig 2.2 Distance estimation at the situation (I)
In Fig 2.2, we configure the equipment and specifications for a camera's fixed
height, H, and vertical view angle, The distance D C is acquired by the triangular
measurement manner while the feet of the subject reach the frame boundary,
The feet of the subject become closer to the middle row, corresponding with a
subject's movement from point C to point A Thus, the distance, DA, can be calculated
using the vertical pixel shift of the subject contour below the middle row in conjunction
with the triangulation process
𝑝1
𝐻1 = 𝑝2
𝐻2, (2.3) and
𝐷𝐴 = 𝐻2
𝑡𝑎𝑛 𝜃1 = 𝐻1 ×𝑝2
𝑝1×𝑡𝑎𝑛 𝜃1 (2.4)
Trang 28where p1 is the vertical pixel of the subject contour from the feet to the middle row, and
p2 is the vertical pixel of the subject contour from the bottom of the image to the middle
row The camera height is denoted by H1 equals H, and H2 is the real height representing
p2 in the image
Fig 2.3 Estimated Distance at the situation (II)
The feet are outside the frame if the subject is near to the camera Equation (4)
can not be applied As shown in Fig 2.3, we calculate the distance instead with
variations of subject width in frames We denoted the uppercase symbols to represent
real lengths and lowercase letters as the number of pixels in the frame To begin, the
horizontal view width of the camera at point C can be calculated as follows:
𝑊 = 2 × 𝐷𝐶 × 𝑡𝑎𝑛 𝜃2, (2.5) where 2 is the horizontal view angle of a camera The ratio of the horizontal width, W,
to the number of pixels at the width of an image, f 1, can be used to determine the body
width of a subject, B w
Trang 29where b1 is the number of pixels associated with the width of a subject contour at point
C When the subject moves to point B, the environmental width, W2, can be found by
𝑊2 =𝑓2×𝐵𝑊
𝑏2 , (2.7)
where b2 equals the width of a subject contour at point B, and f 2 equals f 1 As a result,
the distance between the camera and the subject, denoted by the D B,, is
2×𝑡𝑎𝑛 𝜃2 = 𝐵𝑊×𝑓2
Outline feature generation: The posture feature was generated by calculating
Euclidean distances between contour points and a moving subject at the center of the
image, which were then converted to a distance curve [22] This distance curve is
constructed using the distances between the 36 contour points and the point's center,
beginning at the bottom left of the contour point and working clockwise through the
selection of 36 contour points The angle between the center point and the point's
neighboring contour is approximately 10 degrees Three peaks in the distance curve
correspond to the head and feet of a standing subject, as illustrated 2.4.2
Fig 2.4 Distance curve pattern of the measure of the standing subject
Trang 30
2.2.2 Proposed action recognition on single subject
A panorama camera platform including a set of 4 cameras, one of them has a fixed
focal length, 45 degrees vertical, and 95-degree horizontal angle, is used for the
proposed action recognition system The suggested camera platform is placed ideally
on a table or furniture at about 90 centimeters in height to effectively capture the body
height The panoramic camera associated with stitching four images can be obtained in
the environment of a 360-degree scene
Fig 2.5 reveals the flowchart of the proposed action classification system First
of all, a video clip is recorded using the proposed panoramic camera platform
Background subtraction is used to capture a moving subject from the video clip for
extracted features of height, distance between the camera and the subject, and the
contour The six formulated features associated with action characteristics of height,
distance, and contour are calculated, and classified by the SVM scheme in which four
activities of falling, standing, walking, and sitting are recognized To be accurate for
human action recognition We adopt an action recognition based on an input segment
of 10 pictures In the watching TV category, TV switching on/off is to detect that the
TV is turned on The actions of TV are dependent on whether the subject is sitting or
not
Fig 2.5 shows the proposed flow chart of our action recognition system based on
10 pictures of an input segment for four actions and TV on/off By handcraft extracting
features, the six design formulations are implemented and defined as shown in Table
2.1 The support vector machine model from the LIBSVM library of toolbox is used for
the classifier layer task based on the six types of features, and action recognition on the
Trang 31validation to derive the objective task [23] Finally, majority voting is used to classify
five actions in the picture segment based on sitting and turning on the television
Fig 2.5 Proposal flow chart of our action recognition system
T ABLE 2.1 F EATURES E XTRACTED FOR A CTION RECOGNITION
𝑑 𝑚𝑖𝑛
𝑑𝑚𝑎𝑥
d max and d min are the largest and smallest distances from the contour points to the
centroid point, respectively The ratio of d min to d max reveals the stretched or shrunk body
ℎ
𝑏
h and b denote the height and width of a subject, respectively The postures of
standing, sitting and lying are likely to have large, medium, and small ratios of
h/b, respectively
𝑏 × 𝐷 b multiplied by D is to show the differences among the three postures where D
is the estimated distance from the camera to the subject
Trang 32Detection of TV on/off: The primary goal of the TV identification task is to
determine whether the TV screen is to be turned on or off In section 2.3.1, moving area
is detected changes in the scene of the TV screen where the closed-form boundary is
attained The approximate four lines associated with the horizontal and vertical lines
are analyzed to determine the intersection points of the shape Then we calculate the
corner adjacent angles of this shape and analyze them together to see if this shape is
close to the parallelogram If so, the TV is on or off Otherwise, this object is not a TV
In Fig 2.6, TV shows the state of the on/off divided into 4 steps First, the
foreground edge detection of the image is extracted by the Canny algorithm from the
background subtraction Then, using the Hough transform, these edge detections are
converted into line segments Third, the intersection points of lines are found from
multiple edges across together and then estimate their angles from the perspective
transformation The center point of these intersection points is used to detect possible
boundary points Fourth, the largest parallelogram represents the TV boundary is
determined by the corner angles of boundary points which are close to 90 degrees
When the TV is on, there may be multiple foreground objects in the TV area Otherwise,
when the TV is turned off, there are no foreground objects in the TV area of the
subsequent frame
Lines detection
Rectangle detection Finding intersection lines
Perspective transform Find corner of Polygonal
Edge detection
Trang 332.2.3 Proposed Action Recognition on Multiple Subjects
Our previous task has focused on the action recognition of single movement
subject at homecare [1], [13] Herein, this task is to study the action recognition of
multiple movement subjects by panoramic video clips from a portable camera product
that includes four 720×480-pixel images for stitching
Fig 2 7 displays the flow diagram from the proposed action recognition system To
begin, the Gaussian mixture model (GMM) is the most effective method for background
modeling The first GMM from the OpenCV MOG2 toolbox is used with adaptive
learning rates and an adjustable weight in the foreground The MOG2 has good
adaptability where moving objects can be detected instantly and effectively Although
this model can eliminate noise and shadows in a superior way, false detection may occur
when there are fast moving objects, such as shaking trees and sudden changes of
illumination in the background This GMM uses five Gaussian distributions to
characterize pixels in the same position over 100 frames The background and
foreground pixel values are derived by using the expectation maximization method of
the Gaussian distribution with higher and lower intensities Second, BLOBs associated
with the foreground are tracked by using the Kalman filters based on their centroid
points Third, the overlapping and non-overlapping circumstances are calculated If
their previous operation states overlap, let's keep them Otherwise, the features from the
BLOB are used to recognize an action using the second GMM scheme This classifier
finds the optimized number of Gaussian components for each class associated with
BLOB feature based on the separability criterion and then determines the parameters of
these Gaussian components by using the expectation maximization algorithm Fourth,
the incremental majority voting for action recognition of each subject based on 10
images of a picture segment is carried out to get a result which is further adjusted or
Trang 34verified by the proposed finite state machine In particular, the proposed finite state
machine also considers the subject interactions to make proper estimates of activity
states
Standing (3)
Walking (2)
Sitting (4)
Falling
(6)
Coming in (1)
100/0 010/ 0 001/ 0 010/ 0
001/0 010/ 0
Blobs data
Mixture model Gaussian Segmentation
Image aquisition
Foreground
Motion subject
Height estimation
Distance estimation
Contour detection
Feature generation
GMM Classification
Fig 2.7 Proposed action recognition system
Trang 35morphology To extract dependable BLOBs related to a certain range of sizes are found,
and the eight connected component analysis is performed to create arbitrary large
objects of which size, convex hulls, bounding circles, bounding boxes, bounding box
ratios are calculated These determined features are to detect the appropriate subject that
meets those parameters, and to fill the holes of the subject considered as BLOBs The
example of generation BLOB is seen in Fig 2.8
Fig 2.8 Example of shape BLOBs generation from forground
Trajectory Modeling: Multiple standard techniques such as mean transfer,
surfing, optical flow etc are used for trajectory tracking [24] In this work, we conduct
the Kalman fitter to keep location on multiple subjects in sequence frames First, in the
currently processed frame, the center point of the subject BLOB is determined Second,
the Kalman filter predicts the location of the center point of the respective BLOB on
the next frame and optimizes the estimate by using the motion model that the constant
movement vectors of the neighboring frames in order to reduce noise as shown in Fig
2.9
Finite State Machine: Three circumstances of separation, interaction and
overlapping can describe the subject interactions In this study, the Gaussian mixture
models, majority voting, and finite state machine satisfy the action recognition of
subjects in circumstances of separation and interaction The proposed finite state
machine predicts the intermediate activity state of subject based on their conditions
before and after overlapping circumstances According to the state of graph transition
Trang 36among four actions of walking, falling, sitting, and standing, the proposed finite state
machine can be designed as described in Fig 2.10
Fig 2.9 Illustration of tracking by the Kalman filter method (a) Original image (b) its foreground (c) Subject boundary extracted (d) Two subjects seperated (e) Two subjects contacted (f) Two subjects overlapped
Fig 2.10 shows the FSM as a constraint diagram in which edges and states are
corresponding to transformations and nodes The initial state is created when a subject
appears in the sequence frame Afterwards, the action recognition state is determined
by the proposed finite state machine due to some state of action not occurring For
example, the falling state can not directly transfer to the walking or standing state where
the sitting state should be included
Trang 37Fig 2.10 Proposed FSM
Table 2.1 lists the states and transitions of the proposed FSM
Notably, the background subtraction system can not generate the foreground related to
the subject staying still for a while The proposed FSM can sustain its previous state
because the subject does not leave In most ordinary people from various environments,
the proposed FSM in Fig 4 can be used widely Besides, the overlapping interval states
can also be fairly inferred when subjects are overlapped We predict a short path
between states before and after overlapping interval based on the appropriate state
transformations, as shown in Table 2.3 As you can see in Fig 2.11, the estimates of
action states overlap There are a lot of cases which can not be handled with a short
path procedure Nevertheless, this is a difficult challenge due to the lack of information
T ABLE 2.2 S TATES AND TRANSISTIONS OF OUR FINITE STATE MACHINE
States Current States Inputs Next States Alarm
Trang 38Walking Standing Sitting Falling
Walking (W) W S td , W S it , S td , W F, S it , S td , W
Standing (S td ) W, S td S td S it , S td F, S it , S td
Sitting (S it ) S td , S it S td , S it S it F, S it
Falling (F) W, F S td , F S it , F F
Incremental Majority Voting: The precise prediction of activity states of the
boundary picture segments can bring out reliable estimates of states in the overlapping interval In order to increase the accuracy of state prediction, the incremental majority voting approach is proposed According to Fig 2.11, not only is the state of the boundary picture segment, P1, considered, but the states of P2, P3, P4, or/and P5 are also referred to know whether their states are consistent That is because the state transition does not occur frequently in most situations Accordingly, we can find out whether the states of P1 and P2 are the same first If yes, the state of the boundary picture segment
is confirmed Otherwise, P3 joins P1 and P2 to do the majority voting If they are different, P4 participates in the voting When P1, P2, P3, and P4, belong to four distinct activity states, P5 helps with making the final decision Fig 2.12 displays the operational flow and probability model of the proposed incremental majority voting where steps of I, II, III and IV indicate the majority voting from (P1, P2), (P1, P2, P3), (P1, P2, P3, P4), and (P1, P2, P3, P4, P5), respectively Here, p is the average accuracy rate
of state predictions among four activities, and r is the probability of state transition at a
picture segment Owing to the incremental majority voting, the accuracy probabilities
of four steps are accumulated to become
3
Trang 39P5
BEFORE OVERLAPPING AFTER OVERLAPPING
Incremental majority voting
Activity Segment set
States
Q3 Q4 Q5Activity Predicted
Aj
Fig 2.11 Estimates of activity states in the overlapping interval
Feature Generation for Gaussian mixture models: With reference to our
previous work [1], [13] the features related to the subject height, distance are located
and extracted In addition, the features from the motion vectors of the centroid points
and the features concerning activity characteristics from the information on height,
distance, contour and motion are used to train Gaussian mixture models to recognize
2 ( − ) ( − ) ( − ) ( − ) ( − )
Trang 402.3 Experiment Results and Discussion
These experiments were conducted by 10 volunteers, 6 women and 4 males, who
captured video clips from each subject, including standing, falling, sitting, and watching
TV activities Fig 2.13 shows the configuration of the room and the experimental
scenarios The experimental room is 3.9, 3.1 and 3.0 meters in size The five actions in
these video clips are shown in Fig 2.14, which indicates a quite complicated
background
Fig 2.13 Room layout and experiment scenario
The only subject’s movement has been tracked, identified, and understood by
background substraction, when their entire bodies have been captured in a frame Our
experiments are good for a height estimate of 5,175 effective ones out of 6,000 frames
1,243 and 2,583 frames with errors of less than 2% and 5% are available Finally, the
outcome shows that the average error rate for a subject with a height of 170 cm is around
5.5%, which indicates a height estimated between 161cm-179cm We requested the
subjects who move around the camera at 6 separate distances of 3m, 2.5m, 2m, 1.5m,
1m and 0.5m, described by pointed circles in Fig 2.13 for their distance estimate For