MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKE
Trang 1
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
PHAM DINH TAN
A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA
DOCTORAL DISSERTATION IN COMPUTER ENGINEERING
Hanai—2022
Trang 2
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
PHAM DINH TAN
A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA
Major: Computer Engineering
Code: 9480106
DOCTORAL DISSERTATION IN COMPLTER ENGINEERING
SUPERVISORS:
1 Assoc Prof Vu Hai
2 Assoc Prof Le Phi Lan
Hanoi—2022
Trang 3
DECLAILATION OF AUTHORSIHP
T, Pham Dinh Tan, declare that the dissertation titled “A stndy on deep learning,
Veclutiques for hua: action represcatation and recoguition with ekeleton data" Lass
been entirely composed hy myself T assnre same points as follows
a This work was done wholly or mainly while in candidature for a #h.D research degree wt Hanoi Universily of Scicuce and Techuolagy
a ‘The work has not been snhmitted for any other degree or qualifications at Hanoi
Univesity of Scieace aud Technology or aay other institution
= Appropriate acknowledzment has been given within this dissertation, where ref
erence has been made to the published work of others
a The dissertation cubinitued is ay own, except where work in tie collaboration as been included The collaborative contributions have been indicated
Henoi, May 8, 2022
PHD Student
Pham Dinh Tan
SUPERVISORS
1 Assoc Prof Vu Hai
2 Assoc Prof Le Thi Lan
Trang 4ACKNOWLEDGEMENT
This clissertation is composed eluting my Ph.D at the Compnter Vision Department, MICA Lustitule, Hanui University of Seicuce and Tedinolugy Tam grateful to all people who comtrihnte in differant: ways ta my Ph.D journey
First, I would like to express sincere thanks to my supervisors Assoc Prof Vu Hai and Assor Prof Le Thi Lan for their suicance and support
T would Like te thunk oll MICA meuiers for their Lelp during my PLD study My
sincere thanks to Dr Memyen Viet Son, Asane Prof Dao Tring Kien, and Aseoe
Prof Tran Thi Thal Hai for giving ae a let of support and valuable wdviee Many thanks to Dr Nenyen 'Thny Dinh, Ngnyen Mong Qnan, Iloang Van Nam Nguyen Tien Nam, Pham Quang Tien, ed Nguyen Tien Thanh Zor their support
T would like to thank my colleagues at the Hanoi University of Mining and Geology
for their support duriug my PL.D study Speciel thaaks te tay landly for understanding
amy hours shied ta the eampnter screen
Hanoi, May 18, 2022
PLD Student
Trang 5ABSTRACT
Taman action recognition (TIAR} fram color and depth sensors (RGB-D) especially derived information such as skolcton data, receives the research community's attculion dne to ite wide applications HAT has many practical applications such as abnor
mal event detection from camera surveillance gaming, human-machine interaction,
elderiy memitoring, and virtnal/angmented ee: In addition to the advantages of
fust computation, low storage, and imunutubilily with Inman appewanee, skeleton data
have shortcomings The shortcomings include pose estimation errors skeleton noise
a challenging dataset with noise in skeleton datz, and NTU RGB D - a worldwide
benchmark among the ‘arge-scale datasers Therefore, these datasets cover different
datascl scales as well as the quality of skeleton data
Ta overcome the ‘imitations of the skeletan data, the diseertation presents techniqnes
in different approaches First, as joints have different levels of engagemem in each action, techniques for selecting joints that play an important role in hnman actions are proposed, including both preset Joint subset selection and sulomutic joint subset selection Two drameworks are evaluated to shaw the periarmance of using a snbset of joints for action representation, The first fanework etuploys Dynaaie Taw Warping (DTW) aud Fourier Temporal Pyramic (FTP), while the second one uses Covariance Descriptors extracted on joint position end velocity Experimental results show thot joint snbsect selection helps improve action recognition periarmance on datasers with
uoise in skeleton data
Tlowever TIAR using nandcratted feature extraction cond not exploit the inherent
graph strucure of the human skeleton Recuut Graph Couvolution Networks (GCNs}
are studies to handle these issues Among GCN madels, Attention-enhanced Adaptive
‘N} is used o3 the baseline model AACN achieves
state of the art performance on large scale datasets such as NTU RGRLD and Kinet,
Convolutional Network (AAC
ics, However, AACN employs only joint information ‘Therefore, a Feature Fusion
(PF) module is proposed in this dissertation The new model is named FF AAGCN
F-AACICN is evaluated ou the large-scale dataset NTU ROB-D
and CMDPALL The evaluation results show that the proposed method is rabust to
‘The performance
Trang 6noise and invariant to the skeleton transtation Particularly, FF-AAGCON achieves re-
auarkable results ou challouging datasets Finally, w the computing expacity of edge devices is limited, a lightweight deep learning model is expected for application de- ployment A lightwcight GCN architecture is propused to show (hat the complexity of
‘The proposed lightweight model is suilwsle ‘or application development on edge devives GCN architecture can still be reduced depending on the dataset’s characteristi
Hanoi, May th, 2022
Ph.D) Student
Trang 71⁄2, Ân overview of action recoguition 2 hy nhớ 6
1.3 Data invdulities for wlio recogHiOH và cà ce eee eetre neers 9
1.4.1 Dato collection from motion capture systems 14 1.4.8 Data collection rom TRRR—T sensors - - 14 1.4.3 Dato collection from pose estimation 16
1.0,1.1 Joinf-based action recognition shunrh cac 48 1.6.1.2 Body part-based action rccosnition 25
Trang 81.6.2 Deep learning-based mcthods - - 28
EPreser loint 8ubset Selection ào coi oi ST 2.2.1.1 Spatial-TEmporal Ieprcscntation 39 3.3.1.2 Dynamie 'Lime Wiarping ¬— 39
Antomatie Joint #ubset 8election à cà cà cà 40 2.2.2.1 Joint weight assignment - cee - 41 2.2.2.2, Most informative joint selection khe chat 42 2.2.2.5, Human action cecognition based on ML joints 43
.L Lvaluation mtetri€s "— - do 2.3.2 Preset Joint Subset Selection - - 46 3.4.8 Automatic Joint 8ubset 8election 48
4, Conclusion of the chapter Bĩ
CHAPTER 4 THE PROPOSED LIGHTWBEIGHT GRAPRH CONVOLU- TIONAL NEPTWORK cu KH nhìn He nà nà nh he
85
4⁄2 Relatod work ou Lightweight Graph Couvolutivual Networks 8 4.3, Propused mGVHOÔ cào co cà cài settee BF
Trang 10Attontion-enhanced Adaptive Graph Convolutional Network
Adaptive Graph Convolutional Network
Adaptive number of Most Lufuriative Joints
Action Bet
Actional-Structural Graph Convolutional Network
Batch Normalization
Body Part Location
Chanmel Atsention Module
Close-Circuit Television
Convolutional Neural Network
Covariance Descriptor on Most Informative Joints Ceutral Processing Cuit
Croes Subject
Cross-View Discrete Fourier Transtorm
Dynamic Time Warping
Fully Connected Feature Fusion Floating Point OP eration
Fixed number of Most Informative Joints Frans per second
Fourier Temporal Pyramid
Graph Couvolutivna! Nework Graph based Convolutional Nenral Network
Graphical Processing Unit
Gated Recurrent Unit
Human Action Recognition
HunuucCornputer Intersection
Trang 11Joint Subset Selection
Tie Algebra Relative Pair
Long-Short Term Memory
Most Informative Joint
Motion capture system
Marke Random Field
Multi-Task Learnmg Network
Overlapping
Richly Activated Grapa Convolutional Network Rectified Linear Unit
Residual Neural Network
Relative Joim Position
Recurrent Neural Network Relevance Vector Machine
Spatial Attantion Module Sattware Development Kit
Special Euclideua group Special Orthogonal group Souliul' Temporal Graph Couvolutioual Network Spatial Temporal Channel Attention Module
Support Veetor Machine
+ Distributed Stochastic Neighbor Embedding
Temporal Atrention Module
Temporal Covariance Descriptor
Temporal Convolutional Network uanned Aeral Veliele
Very Fast Decision Trees
Trang 12SYMBOLS
mwư The adjacency matrix of the graph
2 Ay Normalized adjacency matrix
3 Ở The number of action classes
4 COV(%) ‘Lemporal covariance descriptor of joint positions
5 COV(%) Temporal covartuace descriptor of joint velucities
6 — COVeaupie(Sp) The sample covariance descriptor of joint positions
7 COVeampie(Sp) The sample covariaucy descriptur of joint velocities
8 2 The cost fiction in Dynamic Time Warping
9 »D The watrix of Lhe graph,
mM #y The cet of intra skeleton edges
1l ếm ‘The sez of inter-frame edges
2 € The cet of graph cdges
15 Ty Tnpnt featnre to the Channel Attention Modnle
16 đấy Input feature vo the Spatial Avtention Module
7 i, Tnpnt featnre to the Temporal Attention Module
wok ‘Lhe number of layers in the temporal hierarchy
3L Ks Spatial kernel size
2M ‘The number of most informative joints
23° Me Output of the Clauuel Attention Module
24M, Outpnt of the Spatial Attention Modnie
35 MẸ Output of the Tumoral Attention Module
28 ON The number of joints in the skeleton
3? Nb The number of suuples in the cf action class
28 m(H Joint position of the # joint at the #" frame
29° ReLU The Rectified Linear Unit activation function
30 Spt} The coordinate vector of iuformative joiuts at the frece
Trang 13The length of a skeleton seqnence
The maximum length of skeleton sequences
The hyperbolic tangent act:varion fimction
‘The weight of the #* ‘oint for the j sample im an action class (<j < Nhe)
‘The weight of the ¢ joint
Matrix of trainaisle weights The node corresponding, to the #* joint at the # frame Joint velocity at the ¢* frune
The eet of vertexes of the graph Coordinate of the i** joint along the z-uxis at the t” frame
joint along the y axis at the ¢®? frame Coordinase of the i joint along the z-axis at the i“ frome
Coordinate of the i**
Trang 14Comparison auoug dats iodal:ties
Datasets with different data modalities: Skeleton (5), Depth (D), Accel-
List of actions in MSH-ActionsD 0 2
List of actions in CMDFALL
List of actions in NTU RGRLD
Accuracy (%) comparison ox MSR-Actiou3D The reference [58] refers
to the paper with correction available at: http://raviteiaw.weebly.cont
Ablation study on MSK-Action3D) by accuracy (%)
Computational time (ms) of Preset 38 on MSM Action3D
Performance evaluation for Preset JSS on CMDFALL
The arenrary (%) abtained by the proposed method with different nem,
bers of layers und features fur MSR-Actiow3D und CMDFALL
Acenracy (%) evatuation on MSR Action3D The reference [58] refers to
(ue paper with conection available al: Ltup://ravityjav.wecbly cou
Performance evaluation on CMDFALL
Computational time (ms} of FM1J/AMLJ on MSK-Action3D
Comparison between Preset J88 using Covariance Descriptors and AMUL
Dataset parameters
Iuput aad output data slnages of FF-AAGCN on MICA-Actiou3D
Software package version information on Lbnntn 18.114,
Ablation study with AAGCN |104) ioplenentation ou CMDFALL,
Performance of ['/-AAGON on CMDI'ALL using velocity with different
frame offscts
Compusison of compulativn time between FE-AAGON and the baseline
AAGCN on CMDPALL Testing time is ca:culated per sample
Ablation study by accuracy (%) oa NTU RGB 1D
Performance evaluation by acenracy (4) on NTU RGB+D
Compwison of traiuiag/tebting tine between FF-AAGON aud tbe base-
line on NTU RGB-+D with cross-subject (CS) benchmark ‘raining time
is calculated in hours (h) ‘Testing time per sample is calculated in mil-
Trang 15jine on NTU RGB | D with cross-view (CV) benchmark Traiuing time
is calculated in hours (h) ‘Testing time per sample is calculated in mil-
lisecuutl: (uss)
Accuracy (%} comparison of MSR-Action3L), The reference [53] refers
lo the paper with correction available at: hity:/ /ravitejuy.weebly.com „
Performance evaluation on MICA Action3T
Ablation study of FE-AAGON with different numbers of basic blocks on
CMDEALL
Inpu: znd output data shapes on MICA-Action3D
Ablation study on CMDFALL Performance scores are in percentage
Abbreviations in use include Feature Huston (FF), Lightweight (LW),
and Joint Subset Selection (JSS)
80
80
al
88 9U
OL
Performance comparison on CMDFALL Performance scores are in percentage.92 Model parameters and compntation requirement on MIDRAT,I,
Memory cost comparison on CMDEALL, 2 :
‘Training and testing time on CMDFALL Testing time ic calculated per
Accuracy (%t) comparicon of LW FF AAGCN with other methods on
MSR-Action3D The reference [58{ refers to the paper with correction
available at: http://ravirejav-weebly.com
Performance evaluation on MICA-ActiondL Performance scores are in
Comparison on model parameters, FLOPs, and accuracy (%} on NLU
‘Training time and testing time on NTU RCB—D with cross-subjeet (CS)
henchmark Testing time per sample in calenlated per sample
‘Training time and testing time on NTU RGB+D with cross-view (CV)
Denchioark ‘Testing time por sample is calculated per sample
Performance evaluation on subsets of N‘I'U RGL+D with C8 benchmark
Trang 16Approaches for Luman activa recogaition J13|
Sample depth data in M8H-ActionäD [I7] ¬¬
A skeleton representation of the hammer action in MSR-Action3D 12 Sample acceleration date im CMDEALL [20| 12
Microsoft Kinect sensor (left) and human computer interaction (right) [10] 15
Noise in the skeleton data of MSR-Action3D is marked with red hoxes 16
Skclelou data estinated by a Kiseet seasor with 20 joints 7
Temporal bierrachy or covarieaee đoscripeop || 38
Features formed by skeleton pose, saeod, duổ acccleration |ö4| 26
Subsets of 7 joints (a), LL joints (b), and 15 joints (c) are selected from
Lie algebro-based action recognition [8| 2
Relative position, speed, and acceleration features [39] 27
Mast, informative jaint selection using, jaint, angles [44] 28
Temporal Convolutional Newurk (TCN) with residuals [66] 2a Skeleton representation in SkeleMotion method [ 3U
Skeleton to image mapping using tree structure [75]
Attentional Recurrent Kelational Network-LSTM [76]
Joint trajectories of the tua kend wane action in MICA ActionaD 3?
Joint Subset Section tedinigues van be used i couibination with dif
ferent action representation and recognition methods 37
System diagram of the proposed Preset JS8 ¬ 38
Preset ISS: A subset of 14 blne joints ic selected tram the 20 joint skele
ton model Red joints are omitted sẻ 38
Joints in FMIJ are marked in red for the high throw: action in MST Action3D.43
Trang 17stem diagram of the proposed TMILI/AMI1
‘Twolayer hierarchy with (a) weuc-uverlapping aud (b) overlappitig (OL}
Confusion matrix on MSR-Action3D ASL
Confusion matrix on MBK-ActionsD ASZ 2
Confusion mutrix ou MSR-Actiou3D AS3 2 wee
Acnracies {%) for MSR Action3D and CMDTALT when changing tne
anunber of inost information joiats with FMI ¬
Aocnrariee (%) for MSH-Action41) and COMJ)IALL with differant thresh-
Accuracy (%) on three subsets ASI, AS2, and ASY of MSR-ActionsD
with different wumbers of layers uid fentures 6
TT score (%} on CMDTALT, with different members of layers and featnres
) of MICA-Actiou3D with different pose estimation methods
Contnsion matrix of MICA-Action3D when applying OpenPose for skele-
Avcuracy (2
lon estimation (eon color date,
Confusion matrix of MICA-Action3b when applying Kinect SDK for
skelctou estimation frou color date
Trajectaries of draw X and draw circle actions im MICA Actian3D
Graph modeling in ST-GƠN
System diagram af AR GƠN [89|,
SGE-EGR framework [IUU|
MS-GSD system diagram |102|
System diagram of EfficientGON [103]
AAGCN eystem diagram [104]
The eysrem dị
‘am consists of a Featnre Pision (FF) module, a BN
layer, ten Spacial-Temporal Basic Blocks, a GAP layer, and a Softmax layer Spalial-Tanporel Busic Block |104]
RIP for (a) Microsoft Kimect v1 skeleton with 20 joints (b) Microsoft
Kincct v2 skeleton with 25 joints
Joint velocity is defined as two-sided position displacement [109]
Components in the Feature Fusion module
Adaptive mechanism in AAGON
Contfnsion matrix on CMDFALTL using AAGON
Confusion mazrix on CMDEALL using F'l-AAGCN,
Similarity betwoon left fall and right fall actions in CMDPALL
59
60 g1 6L
62
63
65
Trang 18Distribntian of 20 action classss ïn CMDPAII, ahrained by AAQON
(left) aud FF-AAGON (right) usiug t-SNE
Loss and accuracy curves of AAGCN on CMDFALL,
Loss aud accuracy curves of FF-AAGON ou CMDFALL
Loss and accuracy curves of AAGCN on MICA-ActionsD
Loss aud uccuracy curves of FF-AAGCN ou MICA-Action3D
Attention map for actions in MICA Action?) nsing the jet cofermap
vf Matplollib [L14] Buel row ia the altention unap eorruspouds vo eae
Joint Dark color represents the Targe weight and light color reprasents
the sinall weight
Confusion matrix on MICA
ction3D using AAGCH
Confusion matrix on MICA-Actiou3D using FF-AAGCN,
Graph detinition in WGCN [119!
System dingram of the proposed LW-FF-AAGCN
(a) Preset: selection of 13 blne joints from the skeleton moe! of 241 joints
(b) Graph tyne A (JSS-A) is dedined by the solid green edges connecting,
13 blue joints {c) Graph type B (J88-B) is defined by combining JSS-A
with edges between symmetricel joints Only the spatial dimension of
the araph is shown for simplicity
(a) Preset selection of 18 blue joints from the skeleton mode! of 25 joints
(b} Graph type A (ISS Aj te defined hy the solid green edges connecting
13 blue joints {c) Graph type B (J88-L) is defined by combining JS8-A
with edges between symmctricel joints Only the spatial dimension of
the graph is shown for simplicity 2 02
Loss and accuracy curves of the proposed method on CMDFALL
Actions
Loss and accuracy curves of the proposed method on MIC
Confnsion matrix om CMDFALL using LW OFF AAGCN
Distribution of 20 action classes obtained by AACCN (left} ond the
LW-FF-AAGCM using tr8NE,
-Action3D using [W-PEF-AAGƠN System diagram of the demo
Confusion matrix on MIC:
MediaPipe pose estimation with a 33andmark model [129]
Screen captures from the demo
Trang 19Nowadays IIAR bas been an active research topic in, the field of computer vision,
robotics, and multimedia Byon though « lot of effort has been made, action recogaition remains challenging due to diversity, intra-class variations, and inter-class similorities
of human actions [1] Various data modalities, such as color, depth, and skeleton
data, can be used for action recognition Hesearchers have been working, an action
din the early research on action recognition Action recognition from images and videos is
recvguilion using 2D images for cucades Culur data in 2D images were ut
still difficult because of background clutter, occlusion, view-point lighting variations, execution speed, and biometric variance According to psychological research, skeleton
data are compact and cliscrintnative for representing human actions In a study on
imman visua! perception in 1973 [2], Swedish psychologists pointed out that human actious such as walking aad runing could be distinguisucd by 10-12 markers plucud
on body joints In other words, the perception of motion patterns can be specified
by some joints which arc more impertant than others Thesc joints contribute to forming unique characteristics of action patterns ‘he human body is modeled as rigid segments ccmected by joints Consequently, human action can be described by joint trajectories along the temporal dimension HAK algorithms utilizing skeleton data
can avoid tae weakness of video data like lighting conditions and background changes
Most importantly, skeleton data are efficient in rerms af compntation and storage
Advances in pose estinmelion using color/depth dat also contsibute te the popularity
of skeleton-based action recognition As a result, skeletonbased action recognition is currently a promising approach
Skeleton data can be collected directly by motion cepture systems, depth sensors
or indirectly estimated from orlor/depth data Among these methods, motion eapture systems can cutput high-quality skeleton data However, motion captuze systems are
Trang 20expensive and inconvenient to implement ‘The popularity of low-cost RGB-D sen- sors and advauces in pose estimating Lom colur/depth data snake tue sheletou dita easy to acquire, I'he research objective of this dissertation is action recognition using skeleton data collected by RGB-D sousors Ia this work, Sour skeleton dalascly MSR- ActionsD, MICA-Action3D, CMDPALL, and N'’'U RGU—D are used for evaluation These datasels coutain seguented samples cullecled by Microsuft Kinect seasors
pose that each skeleton sequence only contains one action Action recognition aims to
predict the action label from the skeleton data Action recognition can he extended
Ww action detection, Action detectiou focuses on labeling for skeleton sequences Chat
contains multiple actions In action detection, segmentation is reqnired to detect each
action’s starting and ending points Noise is inherent in the skelevon data Various
¬nethads are propasedl to improve action recognition performance A lightweight model
ig required to implement action recognition on devices The dissertation focuses ơn im- proving action recognition performance on noisy datasets and designing Hghtweight anucels for application development
Challenges
Action recognition is stil an attractive research topic due to many challenges Eour major challenges are cliscussed, including (1) intra class variations and inter clase sim ilarities, {2} noise in skelecon data, (3) occlusion caused by other body parts or other vbjecls/persons, and (4) insuflleicun Tabeled date
© Firstly, when performing the same action, different, penpie may generate different pose sequences A person can run fast or leisurely, with arms waving in various
w
for the same action class running As a tesnlt, each action category may
comprise a variely of body unolion types For the sume neo, saanples ight
be shat at different viewpoints, snch as in tront of the suniect or at the side of
the subject Furthermore, there are similaities in human poses across distinct
action classes, The two action classes running and waiking, for example, have
similar Luan motion patterns but with Ciffreut execution spoeds These simi-
larities make it diffiontt to distingnish the two actions, resulting ‘n performance degradation in action recognition All of these issues might, result in wrong action recognition, In real world implementations, these variations will he even more severe, This prompts reseurchers tơ luck inte more powerful action recuguition
methods implemented in real life applications,
Trang 21© The second challenge i
‘0 cope with noise in skeleton data Tt is challenging to caplure bigh-yuality skeletou sequences using depth scusors Pose estimation muy output noisy skeleton data for subjects in non-standing poses Action recognition will sufer from performance degradation when dealing with uoisy skeletow data
As a result, action recognition with noisy skeleton data is a challenge
« ‘The third challenge is the occlusiou Joints may be occluded by other body parts
or other objects/parsons ‘I'h's results in missmg or poorly estimated joints in skeleton data
» The fourth challenge is he insuflicieut labeled data Data labeling is an csseutial step in data collection for deep learning Deep learning architectures can achieve good periormance when large-sca.e datasets are available However, data labeling
jg terlions, time consnming, and expensive Large scale datasets are recmired to
transfer the action recognition model to real-world applications
Objectives
‘Lhe objectives are as follows:
= A compact presentation of human activws: Joints in vue skeleton snodel play điền
ferent roles in each action The first objective is to find an etficient representation
for action recognition
© Improwng action recognstion performance on narsy skeleton data: Whe second
objective is le design a deep loaruing architecture Ghat achioves high-performance action recognition on noisy skeleton daca
= Proposing a liyhbuoeight model for action secoynition: Computation capacity hs
limited on edge devices A ‘ightweight deep learning model for action recognition
is required for application development Constructing an cfficient, lightweight model for action recognition is the third objective of this dissertation
Context and constraints
In this dissertation, some context and constraints are listed as follows
« ‘Three public datasets and one self-collected dataset are used for evaluation ‘Lhe datasets contain scyincuted skeleton ecqucnees cullevtod by Microsull Kinect seu- sors A ist of human actions is pre-detined for each dataset ‘Ihe datasets contain actions performed by a single person or interactions between two persons Other datasets are not considered and,‘or evaluated in this work
© Only actions of dai’y living are considered in the dissertation Action classes in
art performance or any other specific domains are not in the scope of this work
3
Trang 22& Tor all four datasets training,/testing data split, and evaluation protocols are kept
the sow as in the relevant works where the datascts are introduced,
© Cross subject benchmark is performed on all datasets, with half cf the subjects
for troining and the other half for testing
= Cross-view benchuark is perforoicd vu the NTU RGB|D dataset Sequences
captured by camera numbers 2and 3 are used for training Sequences from camera
auaiber 1 are used for testing Ouly singl-view data are used Multi-viw data
processing is not considered in this work
« The study aims at deploying an application using the proposed method ‘This
application is developed tn assess the performance cf a person who does yoga
exercines Pose estimution for a single pervou is implemented using the public Wwol
Google MediaPipe [3] Due to the time limitation, only the result of the action
recognition module is introduced ‘Vhe related modules such as action spotting,
human pose evaluation, and exercise scoring/assessment are out of the scope af
this study
Contributions
‘Phe three main contributions are introduced in this dissertation
« Contribution 1: Propose two JS$ methods for human action recognition: the
Preset: JSS and the antomatic MII selection methods
« Contribution 2: Propose a Feature Fusion (FF) module to combine spatial and tunporal feavuces for the Attetioneahanced Adaptive Graph Couvolutioual Network (AAGCN) using the Belative Joint Position and the joint velocity ‘Lhe proposed method is named FF-AAGCIN The proposed method cutperforms the beseline method on the challenging datasets with noise in the skeleton data,
= Contribution 3: Propose a light weight wsodel LW-FF-AACCN with much fewer model parameters than the haseline method with competing performance in action
recoguilion The propascd auethod cuables the depleyaeat of appticatious using
aud is structured us [ollows:
s Introduction: This suctiou provides the utctivation, objectives, challuuges, cou-
straints, and contribntions af the dissertation:
Trang 23« Chapter [ entitlel "Išterarnre Review": This chapter is a brief on the existing
lidevatuse to obtain a compchousive understanding of human action ccvoguition
« Chapter 2 entitled "Joint Subset Selection for Skeleton based Human Action
Recognition": This chapter presents Preset JSS and automatic MLJ selection
= Chapter 3 cutitled "Foature Fusieu for the Graph Couvolutioual Network": The graph-hased deep learning model FIA AGICIN is proposed for human action recog-
ailion usiug AAGCN as the besciins, FF-AAGCN omperformm the baseline
method on CMDIALL, a challeng-ng dataset with noisy skeleton data
+ Chapter 4 entitled "The Proposed Lightweight Craph Convolutional Network": The lightweight moctel [WV FF AAGCN is proposed with fewer parameters than the baseliae AAGCN, IW-FE-AACICN is suitable for upplication developusent ou
edge devices with limited computation capacity
Trang 24of applivatious Instead of using siaaularl input peripherals, actions van be used to control rohots ar compnters [4! Thman action recognition (TAR), the ability to rec- vguize and understand human actions, is critical for a varivly of applications such as imman-computer interaction, camera surveillance, gaming, and healthcare Asa result, the recuguition of human detious is becoming inereasiugly important Comprchunsive
surveys on action recognition can be found in [5] The demand for systems that cun undersiund huruan actions aud ale patients about poleutial physical aud asental
health concems is rapidly increasing, Merical experts can recommend diet, exercise,
aud auedication by idcutifying changes in daily actions [10] Actiou recognition in video
surveillance can automatically detect suspicions actions and alert the anthorities for
preventive sections
'this chaprer is structured as foilows Section 1.2 is an overview of human action
recognition and its applications Scction 1.3 discusses data modalitics used for action
recognition Among, data modalities, the skeleton da ta, modality ie compact and effi ciont for uction representation, sv this work focuses on action covoguition using skeleton data Section 1.4 describes methods for skeleton data collection Skeleton datasets used
in this dissertation are deseribed in Section 1.5 Section 1.6 reviews action recognition methods using the skeleton data, Section 1.7 highlights the research works on action recognition in Vietaam Scetion 1.8 concludes the chapter
1.2 An overview of action recognition
Researchers have Loun working ou action recognition from images and videos for decades Action recognition methods can be categorized as in Figure 1.1 The mecha nism of human vision is the key direction that researchers have been following for action
recognition, The human vision system can observe the mation and shape of the human
6
Trang 25body in a short time, The observations are then transferred to the human perception system for classification ‘The human visual perception system is highly reliable and precise in action recognition Over the last few decades, researchers have aimed at a similar level of the human visual system with a computer-based recognition system Unfortunately, computer-based systems are still far from the level of the human visual stems due to several issues, such as environmental complexity, intraclass variane
Figure 1.1: Approaches for human action recognition [12]
There are numerous challenges in developing an effective and reliable system for
to recognize a large number of actions with high accuracy, and it should be able to
operate in real-time [4], The action recognition system consists of data collection,
MISOFS are pre-processe
or energy-based segmentation Following pre-processing, feature extraction is used to reduce the mumber of features to improve classification efficiency Feature extraction can be completed using cither handetafted or deep learning methods Actions are
Trang 26streams from a large number of cameras simultaneously As a result video surveil-
Jace fuolage cau be used to tzack crituingl suspects but carely avert: potentially
Tisky situations Action recognition systems allow video surveillance cameras to
be coutinuously evaluated and alerts to be aetiveted in the event uf suspicious
or criminal actions Such a system will uly utilize the capability of surveillance cauere aetworks while also cousiderably improving suety by early ulerts to Uke
anthorities for immediate action
« Healthcare: Action recognition is hecoming, increasingly significant and ntilized
to mouitor patiouls Remote monitoring « patiout allows the doctor to keep track
of the patient's health and intervene hefore major health problems acenr Action recuguiliou systuns can track a patieul’s licalth and spur potential pliysical or
menta? issues early These systems monitor the actions of the elderly in daily life for a sofe und coufortuble euvironment Typienlly, dese devices capture an elderly’s contimons movement, antomatically recognize their actions, and detect any Arrugdlaaity as it occurs, such wy a fall, u stroke, or breathing probleus The elderly can stay at home longer without, being, hospitalized thanks to snch systems, improving (heir comfort and healt
« Human-Computer Inteructiou: Human-Computer Interoetion {HCH) began with keyboards and other peripheral devices that make communication easier, Jaster, uid inere coufortable, With receut advances in computer vision aud cou
appro, pate respousus Television sets have already Leu equipped with rauote control
ze hnman aetions and ini
era, sensors, the computer can now recopmii iat
capability nsing action recognition In the gaming, business, an Xbox game console with a Microsoft Kinect seasur can iutupret whole-body uiotion, resulting ia a
new level of gaming experience
« Robotics: Robotics is the study of the cesign of robots and the compnter systems that coulsol them, These sechuologies are being used to create auacisiues thet can act like humans Robots can he employed in any circumstance and for any purpose, Many robots are uow working ia Lazardous locutious, manufacturing processes, and other situations where Anmans can not survive Many robotics applications rely on a robot’s ability to recognize human actions Ic is expected
that rohots can comprehend buman actions and act accordingly Assistive robot
systems are robotics systems that can help umans with a variety of everyday
tasks like housework Patients, the elderly anc disabled persone can also benefit
frou the support
+ Sclf-driving vehicles: A goud driver is cecugnized for observing aud umicipating
driving sitnations In the future of antonomons vehicles, ancnrate interpretation of
the actions
ther traffic participants would be critical Autonomous vehicles, for
8
Trang 27exampie, must assess and predict pedestrian intentions when crossing a road This, would allow self-driviag cam to avoid poteutielly Lasurdous situations Action recognition can be one of the most critical building blocks in autonomous driving Ave
or motion trajectory in the next few seconds, which could be essential in avoiding
le with an action prediction algorithin can forecast a pedestrian’s uction
a collisiua in a eutergeucy [9] Drivers’ secondary tusks, such as answering Uke
phone, texting, eating, or drinking, white driving, have been docnmented to create
inutoutivenes:, leading tu accidents [11]
« Video retrieval: Online scrviccs are growing rapidly, as well as social modia
services, We may now sce a rapid expansion of onlme services On the Internet, people can easily post and share videos However, because moei search engines employ the accompanying text data to manage video data, maintaining and re- trieving videos according te viden content: is becoming diffienlt ‘Text data, such
ay Lags, Litles, descriptious, and keywords cu be inaccurate, ubluseuted, or irrel- evant, making video retrieval impossible Analyzing kuman behaviors in videos
is am alternate strategy, given che bulk of these videos feature some human ac- tions [4] High speed Internet is available to a large number of cell phone users High-resolution cameras ore built into phones and other mobile devices The
amonnt of viden data is extremely targe, reqmiring antomatic viden retrieval ry
tems, Systens that cau rapidly auelyze videos aud deliver reliable search results
or suggestions become vital as data grow
1.3 Data modalitics for action recognition
Data modalities can be loosely categorized into two groups: visual modalities and non-visnal modalities Visnal modalities such as calor, depth, and skeleton are visually
logival for deseribing welions The trajectories of joiuts are cucuded in skeleton data,
‘The skeleton data efficiently represent actions when the action does not include objects
er scene context Visual medalitics are popular for actiom recognition In reboties and
ing hnman actians llowever, these modalities can also be nsed for actin recognition
while, nou-visual wodalities ke acecleration we uot visually intuitive lor describ-
Trang 281.3.1 Color data
‘The color modality refers ta images or videos captured by cameras Clolor data ere usually easy to obtain, It offers rich information about the captured scene However, action recognition from color data might be compiex because of backgrounds, view points, ond lighting conditions RGB input concains a lot of texture information and appears to he similar to the infarmation processed by hnmans Color data, contain a lot
of information that is winecessury for uctivu recognitiou Color data are highly depen denL ou lighting coudilions Iu this situation, icchuicues that operate with RGB data
must detect important regularities and separace them from non-related regularities like lighting changes [15] Furthermore, color data have large volumes, which resalts in sub- stantial computing costs for representing human actions KCB-based approaches rely
on sequential RGB images However, it is still difficult to recognize actions in the wild
due to high intra-class variance, scaling, occlusion, and clutter (16]
1.3.2 Depth data
Depth maps ore images in which the pixel values describe the distance between the depth sensor and the ecene’s points The depth modality, which is resistant, to color aud texture changes, cut be employed for action recuguition siuce it delivers reliable structnral and shapes information ahont the snbjects IDiHerant types of depth sensurs have buen coummercialized, including Time-of-Flight and struct urodlight-based
sensors ‘These sensors emit infrared radiation and then detect the teflected energy fram
the objects ta obtain depth dats, as showa in Figure 1.2 Depth data are free uf suine
of the insnes that exist in MGB For example, adeptn map is insensitive to changes in Mluininativon, wakes it simpler to discern ferword from background scenes, und offers 3D
information about the scene Towever, there are drawbacks to using a depth map: it dos aot ufler texture infortuation and typivally produces a lot of umeusurcuseut noise
This is true in the case of low-cast RGB-D sensors The depth map indicates the
distance bebweun eich pixel wud the scusor The precision of the sensor is a uon-linear function of distance; therefore, the further an object is from the sensor, the less precise the measurement becomes Due to occlusions and the fact that many materiais absorb
sensor beams the depth map acqnired from the sensor ‘s frequently noisy and contains
auucervus inissiug Guta [15] depth-bused approsches rely primarily uu feotures
collected from the space time volnme, either lacal or global Unlike visuai data depth
naps provide guometsic measures unaffected by lighting Using depth wups tu develop
a system for action identification that is both successfn! and efficient: is still a diffenlt
task However, depth cau make the foreground aud background separation casier [16]
Trang 29tation It is robust against changes in clothing and backgroun Motion capture
(Mocap) systems can be used to collect skeleton data Mocap systems can offer high-
quality skeleton data as these systems are insensitive to view and lighting Howe
er, Mocap systems are expensive and inconvenient As a result, skeleton data obtained
from depth maps or color image are employed in many situations, With the pop-
ularity of low:
cost depth sensors and advances in pose estimation, skeleton data are
y to acquire Skeleton data can be collected directly from depth sensors or by ap-
by appearance and background changes, makin
Sample frames of the
hammer action in MSR-Actiou3D are shown in Figure 1.3
Trang 30
jon of the hammer action in MSR-Action3D,
Figure 1.3: A skeleton represen
ing an action; hence the acceleration signal does not exhibit noticeable variances for the same action Because action recognition employing acceleration data can achieve excellent accuracy in most cases, it has been utilized for remote monitoring
systems while addressing privacy concerns The acceleration modality can be used
for elderly monitoring in healthcare However, the subject must wear the sensors
which are typically inconvenient [19] Sample acceleration data in CMDFALL are
shown in Figure 14
Figure 1.4: Sample acceleration data in CMDFALL [20]
Point cloud modality: The spatial distribution and surface features are rep-
resented by @ point cloud comprising many points, There are two methods for
hy rotating the point cloud in 3D space, point cloud-based action recognition tech- niques are often insensitive to the viewpoint However, point clouds frequently contain noise and have a non-uniform distribution of points, making robust ac- tion recognition difficult Furthermore, it is computationally expensive to process point cloud data
Trang 311.3.5 Multi-mudalily
Action recognition based cn a single modelity has been the mest popular Hewever, each data modalixy has pros and cons, as shown in ‘lable 1.1 Current research focuses
on the fusion of clifferent data modalities and the transter of information across modal
ities to improve the accuracy and robustness of action recognition Features con be
combined trom two ar more modalities for action recognition Acceleration data, for
sxaiaple, can be used in coujumclion with skeleton uta, Huns frequently iuler the actions in a wulti-modal wsanner
Tyhle 1.1: Cunparisoa amoug dala ewdalilies,
1 | Color Riek context afarimsuon Easy to access ‘Sonsitive to lighting ‘background
2 | bạ Insensitive to lighting/ background ‘ SD ie Noisy
anepaL 3D/3D inloriastion
Insensitive to lighting/ background
No eppearance information Wort by subjects Coulis 3D information Compu.ation complexicy Insensitive te lighting/background Noisy
4 | Acceleration Privacy protection,
Point cloud
Multi-modal snachine lacuing is a method that aims to analyze wud late seusory data from several modalities Multimodal machine learning may typically give a mare robust and accurate HAR by combining the advantages of different data modalities Fu- sion and co learning are the two primary forms of multi modality learning approaches Fusion refers to iulegzaling information [evn two or sure modalities for (raining aud
inference, while co learning refers to the transmission of knowledge between distinct:
data modalities [19]
« Multi-modulity Fusion: Because each data modalit:
natural to use fusion to combine the benefits of data modalities to improve HAR
y ha its advantoge, it is
performance Score fusion and feature fusion are two multi-modality fusion tech-
niqnes in HAT The score fusion incorporates the decisions made individually
besed on multiple modalities to obtain the final classification results, On the
other hand, feature fision merges data from several modalities te provide ARATE
gated features that are discriminarive for action recognition
« Multi-modality Co-learning: Transierring, knowledge across modalities can
lel improve aetion recoguition perloraiauce Co-learding is the sludy of wpply-
ing, knowledge gained trom anviliary modalities to learning 2 model on a different
modality Unlike fusion approaches, data from auxiliary modalities is only neces-
„
Trang 32sary during, training, rather than testing This is especial'y useful when specific
inudalilies are unavailable duciag lesting
1.4, Skeleton data collection
Skeleton dave are teuporal sequences of joiut positions Joints are connected ia tle
kinetic mode! by the natural stricture of the human body Tt is convenient ta model
actious using Uhe kinetic node Skuletou data can be collucted using Mecap syeteins, RGB | D sensors, or color/depth-based pose estimation
1.4.1 Data collection from motion capture systems
Tn motion capmnre aystems, markers are placed on joint positions These systema
use cither visual cauictas to Wack refleclive markers mouuled to a subject's bedy
or inertial sensors to estimate bedy component rotations with reference to a fixed
point Existing Moeap devices are equipped with software that allows for the precise capture of 3D skeletal data Llowever, most Mocap devices can only be employed in controlled environnente |2I| Skelcton data collected by Mocap systems are of high
accuracy Towever, Mocap systems are expensive and inconventent for many practical
ayplivations Su thie work Jocus ou the skeleton data collected ly the popular low-cost depth sensors
1.4.2, Data collection from RGB+D sensors
The availability of low-cost depth cameras has cided in extracting human skeleton
data Microsoft, Kinect anc ASUS Xtion PRO LIVE are popular depth sensors These
seusors live au infrared projeclox, an infrared vatwere for capluriug depth data, aud a camera for capturing color images The depth cameras offer 3D depth data of the scene, which helps estimate the 3D joint posivions of tne human skeleton (LU) Inirared ([R) light beams are emitted by an IR emitter Reftected IR beams from the environment retum to the LK depth sensor ‘The reflected beams are used to calculate the distances
between scene points anc the sensor The sensor har can he tilted vertically uring the
tilt motor The three modalities of RGB-D sensor data are color, depth, and skeleton
Color data are the most important data with rich ‘nformation, which allow for the
detection of interesting paints and the extraction of optical flows ‘lhe depth modality is
insensitive to lighting changes, invariant to color and texture changes, and trustworthy for estimating the skeleton It gives extensive 4D structural information of the scene compared to color data Skeleton data are high level features for action representation compared 1 color and depth date, It can be invariant vo scale and illumination [16] An example cf using an RGR D sensor far human computer interaction in entertainment
is shown ia Figure L6
Trang 33KINECT & Ome
Figure 1.5; Microsoft Kinect sensor (left) and human-computer interaction (right) [10]
RGB-D sensors estimate skeleton data from depth maps ‘The depth information makes the skeleton extraction more practical and stable when compared with pose
timation using color images ‘The core idea behind pose estimation is to use dense
probabilistic labeling to segment the human depth image into several body segments This segmentation can be seen as a classification task for each pixel in the depth image
‘The spatial modes of pixel distribution are used to calculate the 3D joint positions In
[22], Shotton et al propose a pose estimation method from a single depth image The pet-pixel classification results from a randomized decision forest classifier are used to classify human body parts The mean-shift approach infers the joint from the pixels
sensors is less accurate than Mocap systems [4] RGB }D sensors can be categorized
as Structured-Light sensors and Time-of-Flight sensors
© Structured-Light sensors: Structured-light sensors, which employ infrared light
The de- vice determines the depth by projecting light in a predetermined pattern through
to collect depth information, are used to acquire 3D skeletal data directly
the infrared sensor and observing the distortion in the pattern when it meets a subject Most RGB-D sensors are affordable, making them suitable for various
y used in HAR research
skeleton data are estimated from depth sensors, noise i 2xamples of
noise in the skeleton data of MSR-Action3D are shown in Figure 1.6
« Time-of-Flight sensors: By generating light and measuring the time it takes
for the light to return, Time-of-Flight sensors [48] gain 3D information With
high frame rates, these sensors capture exceptionally accurate 3D data compared
x4 im
with structured-light sensors [23] RGB data can be retrieved and anal:
conjunction with depth data [21] Microsoft Kinect v2 is a Time-of-Flight sensor
15
Trang 34with 25 joints in the skeleton model
(0) hành gi siaxe artion (subjeet TD: 3,
Figure 1.6: Noise in the skeleton data of MSR-Action31) is marked with red boxes,
1.4.3 Data collection from pose estimation
Many methods for estimating human joints and recognizing poses from ayailable data have been developed ‘These methods use depth images or additional information
fying
body parts, which are then fed into models that extract specific positions for those
provided by the visual sensing device The majority of the methods rely on ident
parts, This section presents an overview of visual data-based methods for construct
ing human skeletons The first approach is to use depth images for pose estimation
A human: skeleton can be estimated from a single depth image or a series of depth
images acquired over time Because of the additional geometric information provided
by depth pictures, this method is commonly employed in collecting action data The second approach is to tise color images for 2D pose estimation, Most RGB image-based
algorithms extract visual cues using deep learning architectures and other methods to
match poses of segmented silhouettes for body part identification It is worth noting
that 2D poses have limits compared to 3D skeletons For example, 2D poses do not
allow any rotation in 3D space, whereas skeletal data do As a result, many skeleton-
based techniques that are expr
y or implicitly predicated on the advantages of 3D
space cannot be applied directly to the 2D scenario [24]:
16
Trang 35The limitation of using depth sensors is the range within a few meters So depth sensors are typically used for indoor applications When a large range is required
for outdoor applications, pose estimation from color data s important An example
application is for surveillance and rescue using the Unmanned Aerial Vehicle (UAV) Pose estimation from color data is applied due to the large distance between the UAV and the subjects
Skeleton data can be estimated from color data using pose estimation tools such as OpenPose [25], MediaPipe [26] AlphaPose [27], and DeepLabCut [28] Examples of pose estimation using OpenPose and MediaPipe are shown in Figure 1.8
and Figure 1.9, respectively
Trang 36by RGB-D sensors MSR-Action3D, MICA-Action3D, and CMDFALL are collected
by Kinect v1 sensors while NTU RGB} D is collected by Kinect v2 sensors MSR-
MSR-Action3D is introduced by Li et al [17] There are 20 actions performed by
ten subjects Each subject performs an action two or three times, There are 20 joi
in the skeleton model The frame rate is 15 fps There are 557 action sequen ‘sin
Action Set 1 (AS1),
total, Actions in MSR-Action’D are grouped into three subset:
18
Trang 37Action Set 2 (AS2), and Acrion Set 3 (AS3) as in Table 1.3 Fach subset consists of
eight actions, sv yom actions appear in wore than one subset
Twala [As List of actions in MST-ActionaN,
Action Set 1 (A81) | Action Set 2 (AS2} Action Set 3 (A53)
aĨ srm wawe | L húgh nh wawe — 0 hình throw
Ua, 14, forward kick,
10, sand clap 09 draw ecrele 3T tenris swing T3 send TT tee hand wave
18, cennis seve 12, side boxing 2U pickup & teow 14, forward kick
1.5.2 MICA-Action3D
Action sequences are collected Ly u Kinect v1 seusor, "The Gutasel is built for eross-
dataset, evaluation with the same list of 20 actions as in MSR Action3D [17] Fach
action is repeated two or hres times by cach subject Tweuty subjects participate in
data collection ta generate 1,196 action samples I'he frame rate is 20 fps
1.5.3 CMDFALL
‘Uhe dataset is introduced in [All] to evaluate methods for hnman falling detection
in healtheare For dala cullectiog, seven Kiucet seasors aru installed weress thie sur- roundings ‘Ihe frame rate is 2U ps ‘hase actions are performed by 5U people ranging
sh 20 fomales and 30 maics} Table 1.4 lists the 20 action
classes in the CMDFALL dataset ‘'here are many poses with subjects lying on the
ground in the CMDFALL dataset so there is severe noise in the skeleton data.) The
in age from 21 to 40 (
reason ‡s that the Kinect sensor could not get on accurate pose estimation for sub-
jects (rom non-standing poses [39] As a result, CMDPALL is a challenging dataset for
action recognition Action recognition is conducted using skeleton Gata from Kinect
view 3 in this study, the same as in the original paper on CMDFALL [20| There axe 1.¥63 sequences in total, Half of the subjects are used for training with ¥74 sequences
‘The remaining arc used Zor testing wilh 989 sequences CMDFALL ie an imbalanced dataset with different numbers of samples in action classes
Trang 38Luble La: List of aetiora In CMDEALL
1 [Action Name [1 Aetion Name
2 | cun slow'y 13 eral 3_| jump ie place 13 _sit on chair then stané up
4 [move band and tog [14 mave choir
3 [Tele baud pick up [15 sit oa chair hen fall lel
6 | right band pick up | 16 sit on chair then dall right
ï | magger 17_ sit om bed then stané up
8 | hat fal 18 Reøn bại then si ap
9 | back fall Wie on bed then fall ef:
0 | ten, tall 20 tie on bad Uber fall eight
skeleton model ‘here are 56,480 sequences in the dataset, categorized into 6U action classes Table 1.5 shows the list of action classes in the NTU RGB4+D dataset The actions are performed by 40 peop'e Three Kinect sensors are monnted at the same height but at various wugles The datuset is collected using 17 camera setups with different ‘neights anc dstances ‘The frame rate is 30 fps ‘I'he dataset authors rec- ounnend tựa benelusarke: (1) Crusssubject {C8}: half of the subjects ave used for training, and the remaining haif are used for testing ‘I'here are 40,320 sequences in the training sct and 16,560 scquences in the testing sct (2) Cross-view (CV): the training set includes 37,920 sequences (from cameras 2 and 3}, and the testing ses consists of
18,960 scqucaces (Lom camera 1)
1.6 Skeleton-based action recognition methods
Skeletou data are snore eommpact and discrianinitive for action represeatation than the traditional color data, Skeleton data have a lot of advantages, such as computa- tion efficiency and robustness against variations in clothing texture and background Skeletan data are simple to abtain thanks to the widespread use of depth sensnrs aud advanees in Lunn pose vetimation with color/depth cata Vacious aiethods for skeleton-based action recognition have been proposed in the literature, as shown in
Figure 1.10 Skeluwu-based action recuguition eau be categorized as iu Figure LLL
Moreover, skeleton-based ILAB can directly benefit from advances in pose estimation Multianodatity HAR is also a popular approach in the Htcrature to combine the ad- vantages of diferent data modalities
Trang 39cova0) Joint Velocity €NN AGCN
Figure 1.11: Skeleton-based HAR methods Proposed methods are marked in green,
1.6.1.1 Joint-based action recognition
Joint-based representation has been commonly used for action recognition, Joint based representation models actions using the relationship between joint coordinates
In [43], Wang et al use Relative Joint Position (RJP) for action representation Fourier
‘Temporal Pyramid (FTP) representation is proposed to tackle temporal misalignment
As skeleton data are sequences of different lengths, Fourier transform is applied to all sequences Feature vectors are formed by concatenating the low-frequency Fourier coefficients from transformation results 3D skeleton datasets are used for performance
ction3D, MSR-Daily-Activity3D, and CMU Mocap
In [44], Yang et al present features that model the dynamics of individual joints
Trang 40formed using, a temporal hierarchy of covariance deseriptors (CavaD 1) The covariance
ayateix is employed in this deseziplor Ww reflect the inlerdepeudeuce of juiut positions Covariance matrices are computed across temporal windows in a hierarchical approach
to capture the dependency bebw n joint positions, as in Figure 1.18 The deseriptor's
Fignre 1.12: Covariaice descriplor inethad 43]
ou
Figure 1.13: Temgoral Licrrachy for covarianes deveriptors 45]
In an alternative approach only the actively involved joints in actions are chosen for action representation Joints are either preset ar antomatically selected basect
on statistical metrics of joint locations or angles In [32], the histogram of JD joint
ates The Hidden Markov Model (HMM)
is used to classify the temporal evolution of visual words Ad‘acent joint pairs such
positings is represented in spherical curd
as hand/wrist, foot/ankic are very close to cach other and thus redundant for the description of human actions ‘lhe spine, neck, and shoulder do not contribute to
body motion A eet of 12 joints is selected from the 20 joint skeleton model In [46],
Caglic et af use a subset of 11 joints to extract features for action recognition In [47], differcut joint cumbinations suck ay the ueck, right hip, lefl hip, right waist, cle.,
23