Luận Án tiến sĩ a study on deep learning techniques for human action representation and recognition with skeleton data

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY PHAM DINH TAN A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKE

Trang 1

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

PHAM DINH TAN

A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA

DOCTORAL DISSERTATION IN COMPUTER ENGINEERING

Hanai—2022

Trang 2

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

PHAM DINH TAN

A STUDY ON DEEP LEARNING TECHNIQUES FOR HUMAN ACTION REPRESENTATION AND RECOGNITION WITH SKELETON DATA

Major: Computer Engineering

Code: 9480106

DOCTORAL DISSERTATION IN COMPLTER ENGINEERING

SUPERVISORS:

1 Assoc Prof Vu Hai

2 Assoc Prof Le Phi Lan

Hanoi—2022

Trang 3

DECLAILATION OF AUTHORSIHP

T, Pham Dinh Tan, declare that the dissertation titled “A stndy on deep learning,

Veclutiques for hua: action represcatation and recoguition with ekeleton data" Lass

been entirely composed hy myself T assnre same points as follows

a This work was done wholly or mainly while in candidature for a #h.D research degree wt Hanoi Universily of Scicuce and Techuolagy

a ‘The work has not been snhmitted for any other degree or qualifications at Hanoi

Univesity of Scieace aud Technology or aay other institution

= Appropriate acknowledzment has been given within this dissertation, where ref

erence has been made to the published work of others

a The dissertation cubinitued is ay own, except where work in tie collaboration as been included The collaborative contributions have been indicated

Henoi, May 8, 2022

PHD Student

Pham Dinh Tan

SUPERVISORS

1 Assoc Prof Vu Hai

2 Assoc Prof Le Thi Lan

Trang 4

ACKNOWLEDGEMENT

This clissertation is composed eluting my Ph.D at the Compnter Vision Department, MICA Lustitule, Hanui University of Seicuce and Tedinolugy Tam grateful to all people who comtrihnte in differant: ways ta my Ph.D journey

First, I would like to express sincere thanks to my supervisors Assoc Prof Vu Hai and Assor Prof Le Thi Lan for their suicance and support

T would Like te thunk oll MICA meuiers for their Lelp during my PLD study My

sincere thanks to Dr Memyen Viet Son, Asane Prof Dao Tring Kien, and Aseoe

Prof Tran Thi Thal Hai for giving ae a let of support and valuable wdviee Many thanks to Dr Nenyen 'Thny Dinh, Ngnyen Mong Qnan, Iloang Van Nam Nguyen Tien Nam, Pham Quang Tien, ed Nguyen Tien Thanh Zor their support

T would like to thank my colleagues at the Hanoi University of Mining and Geology

for their support duriug my PL.D study Speciel thaaks te tay landly for understanding

amy hours shied ta the eampnter screen

Hanoi, May 18, 2022

PLD Student

Trang 5

ABSTRACT

Taman action recognition (TIAR} fram color and depth sensors (RGB-D) especially derived information such as skolcton data, receives the research community's attculion dne to ite wide applications HAT has many practical applications such as abnor

mal event detection from camera surveillance gaming, human-machine interaction,

elderiy memitoring, and virtnal/angmented ee: In addition to the advantages of

fust computation, low storage, and imunutubilily with Inman appewanee, skeleton data

have shortcomings The shortcomings include pose estimation errors skeleton noise

a challenging dataset with noise in skeleton datz, and NTU RGB D - a worldwide

benchmark among the ‘arge-scale datasers Therefore, these datasets cover different

datascl scales as well as the quality of skeleton data

Ta overcome the ‘imitations of the skeletan data, the diseertation presents techniqnes

in different approaches First, as joints have different levels of engagemem in each action, techniques for selecting joints that play an important role in hnman actions are proposed, including both preset Joint subset selection and sulomutic joint subset selection Two drameworks are evaluated to shaw the periarmance of using a snbset of joints for action representation, The first fanework etuploys Dynaaie Taw Warping (DTW) aud Fourier Temporal Pyramic (FTP), while the second one uses Covariance Descriptors extracted on joint position end velocity Experimental results show thot joint snbsect selection helps improve action recognition periarmance on datasers with

uoise in skeleton data

Tlowever TIAR using nandcratted feature extraction cond not exploit the inherent

graph strucure of the human skeleton Recuut Graph Couvolution Networks (GCNs}

are studies to handle these issues Among GCN madels, Attention-enhanced Adaptive

‘N} is used o3 the baseline model AACN achieves

state of the art performance on large scale datasets such as NTU RGRLD and Kinet,

Convolutional Network (AAC

ics, However, AACN employs only joint information ‘Therefore, a Feature Fusion

(PF) module is proposed in this dissertation The new model is named FF AAGCN

F-AACICN is evaluated ou the large-scale dataset NTU ROB-D

and CMDPALL The evaluation results show that the proposed method is rabust to

‘The performance

Trang 6

noise and invariant to the skeleton transtation Particularly, FF-AAGCON achieves re-

auarkable results ou challouging datasets Finally, w the computing expacity of edge devices is limited, a lightweight deep learning model is expected for application de- ployment A lightwcight GCN architecture is propused to show (hat the complexity of

‘The proposed lightweight model is suilwsle ‘or application development on edge devives GCN architecture can still be reduced depending on the dataset’s characteristi

Hanoi, May th, 2022

Ph.D) Student

Trang 7

1⁄2, Ân overview of action recoguition 2 hy nhớ 6

1.3 Data invdulities for wlio recogHiOH và cà ce eee eetre neers 9

1.4.1 Dato collection from motion capture systems 14 1.4.8 Data collection rom TRRR—T sensors - - 14 1.4.3 Dato collection from pose estimation 16

1.0,1.1 Joinf-based action recognition shunrh cac 48 1.6.1.2 Body part-based action rccosnition 25

Trang 8

1.6.2 Deep learning-based mcthods - - 28

EPreser loint 8ubset Selection ào coi oi ST 2.2.1.1 Spatial-TEmporal Ieprcscntation 39 3.3.1.2 Dynamie 'Lime Wiarping ¬— 39

Antomatie Joint #ubset 8election à cà cà cà 40 2.2.2.1 Joint weight assignment - cee - 41 2.2.2.2, Most informative joint selection khe chat 42 2.2.2.5, Human action cecognition based on ML joints 43

.L Lvaluation mtetri€s "— - do 2.3.2 Preset Joint Subset Selection - - 46 3.4.8 Automatic Joint 8ubset 8election 48

4, Conclusion of the chapter Bĩ

CHAPTER 4 THE PROPOSED LIGHTWBEIGHT GRAPRH CONVOLU- TIONAL NEPTWORK cu KH nhìn He nà nà nh he

85

4⁄2 Relatod work ou Lightweight Graph Couvolutivual Networks 8 4.3, Propused mGVHOÔ cào co cà cài settee BF

Trang 10

Attontion-enhanced Adaptive Graph Convolutional Network

Adaptive Graph Convolutional Network

Adaptive number of Most Lufuriative Joints

Action Bet

Actional-Structural Graph Convolutional Network

Batch Normalization

Body Part Location

Chanmel Atsention Module

Close-Circuit Television

Convolutional Neural Network

Covariance Descriptor on Most Informative Joints Ceutral Processing Cuit

Croes Subject

Cross-View Discrete Fourier Transtorm

Dynamic Time Warping

Fully Connected Feature Fusion Floating Point OP eration

Fixed number of Most Informative Joints Frans per second

Fourier Temporal Pyramid

Graph Couvolutivna! Nework Graph based Convolutional Nenral Network

Graphical Processing Unit

Gated Recurrent Unit

Human Action Recognition

HunuucCornputer Intersection

Trang 11

Joint Subset Selection

Tie Algebra Relative Pair

Long-Short Term Memory

Most Informative Joint

Motion capture system

Marke Random Field

Multi-Task Learnmg Network

Overlapping

Richly Activated Grapa Convolutional Network Rectified Linear Unit

Residual Neural Network

Relative Joim Position

Recurrent Neural Network Relevance Vector Machine

Spatial Attantion Module Sattware Development Kit

Special Euclideua group Special Orthogonal group Souliul' Temporal Graph Couvolutioual Network Spatial Temporal Channel Attention Module

Support Veetor Machine

+ Distributed Stochastic Neighbor Embedding

Temporal Atrention Module

Temporal Covariance Descriptor

Temporal Convolutional Network uanned Aeral Veliele

Very Fast Decision Trees

Trang 12

SYMBOLS

mwư The adjacency matrix of the graph

2 Ay Normalized adjacency matrix

3 Ở The number of action classes

4 COV(%) ‘Lemporal covariance descriptor of joint positions

5 COV(%) Temporal covartuace descriptor of joint velucities

6 — COVeaupie(Sp) The sample covariance descriptor of joint positions

7 COVeampie(Sp) The sample covariaucy descriptur of joint velocities

8 2 The cost fiction in Dynamic Time Warping

9 »D The watrix of Lhe graph,

mM #y The cet of intra skeleton edges

1l ếm ‘The sez of inter-frame edges

2 € The cet of graph cdges

15 Ty Tnpnt featnre to the Channel Attention Modnle

16 đấy Input feature vo the Spatial Avtention Module

7 i, Tnpnt featnre to the Temporal Attention Module

wok ‘Lhe number of layers in the temporal hierarchy

3L Ks Spatial kernel size

2M ‘The number of most informative joints

23° Me Output of the Clauuel Attention Module

24M, Outpnt of the Spatial Attention Modnie

35 MẸ Output of the Tumoral Attention Module

28 ON The number of joints in the skeleton

3? Nb The number of suuples in the cf action class

28 m(H Joint position of the # joint at the #" frame

29° ReLU The Rectified Linear Unit activation function

30 Spt} The coordinate vector of iuformative joiuts at the frece

Trang 13

The length of a skeleton seqnence

The maximum length of skeleton sequences

The hyperbolic tangent act:varion fimction

‘The weight of the #* ‘oint for the j sample im an action class (<j < Nhe)

‘The weight of the ¢ joint

Matrix of trainaisle weights The node corresponding, to the #* joint at the # frame Joint velocity at the ¢* frune

The eet of vertexes of the graph Coordinate of the i** joint along the z-uxis at the t” frame

joint along the y axis at the ¢®? frame Coordinase of the i joint along the z-axis at the i“ frome

Coordinate of the i**

Trang 14

Comparison auoug dats iodal:ties

Datasets with different data modalities: Skeleton (5), Depth (D), Accel-

List of actions in MSH-ActionsD 0 2

List of actions in CMDFALL

List of actions in NTU RGRLD

Accuracy (%) comparison ox MSR-Actiou3D The reference [58] refers

to the paper with correction available at: http://raviteiaw.weebly.cont

Ablation study on MSK-Action3D) by accuracy (%)

Computational time (ms) of Preset 38 on MSM Action3D

Performance evaluation for Preset JSS on CMDFALL

The arenrary (%) abtained by the proposed method with different nem,

bers of layers und features fur MSR-Actiow3D und CMDFALL

Acenracy (%) evatuation on MSR Action3D The reference [58] refers to

(ue paper with conection available al: Ltup://ravityjav.wecbly cou

Performance evaluation on CMDFALL

Computational time (ms} of FM1J/AMLJ on MSK-Action3D

Comparison between Preset J88 using Covariance Descriptors and AMUL

Dataset parameters

Iuput aad output data slnages of FF-AAGCN on MICA-Actiou3D

Software package version information on Lbnntn 18.114,

Ablation study with AAGCN |104) ioplenentation ou CMDFALL,

Performance of ['/-AAGON on CMDI'ALL using velocity with different

frame offscts

Compusison of compulativn time between FE-AAGON and the baseline

AAGCN on CMDPALL Testing time is ca:culated per sample

Ablation study by accuracy (%) oa NTU RGB 1D

Performance evaluation by acenracy (4) on NTU RGB+D

Compwison of traiuiag/tebting tine between FF-AAGON aud tbe base-

line on NTU RGB-+D with cross-subject (CS) benchmark ‘raining time

is calculated in hours (h) ‘Testing time per sample is calculated in mil-

Trang 15

jine on NTU RGB | D with cross-view (CV) benchmark Traiuing time

is calculated in hours (h) ‘Testing time per sample is calculated in mil-

lisecuutl: (uss)

Accuracy (%} comparison of MSR-Action3L), The reference [53] refers

lo the paper with correction available at: hity:/ /ravitejuy.weebly.com „

Performance evaluation on MICA Action3T

Ablation study of FE-AAGON with different numbers of basic blocks on

CMDEALL

Inpu: znd output data shapes on MICA-Action3D

Ablation study on CMDFALL Performance scores are in percentage

Abbreviations in use include Feature Huston (FF), Lightweight (LW),

and Joint Subset Selection (JSS)

80

al

88 9U

OL

Performance comparison on CMDFALL Performance scores are in percentage.92 Model parameters and compntation requirement on MIDRAT,I,

Memory cost comparison on CMDEALL, 2 :

‘Training and testing time on CMDFALL Testing time ic calculated per

Accuracy (%t) comparicon of LW FF AAGCN with other methods on

MSR-Action3D The reference [58{ refers to the paper with correction

available at: http://ravirejav-weebly.com

Performance evaluation on MICA-ActiondL Performance scores are in

Comparison on model parameters, FLOPs, and accuracy (%} on NLU

‘Training time and testing time on NTU RCB—D with cross-subjeet (CS)

henchmark Testing time per sample in calenlated per sample

‘Training time and testing time on NTU RGB+D with cross-view (CV)

Denchioark ‘Testing time por sample is calculated per sample

Performance evaluation on subsets of N‘I'U RGL+D with C8 benchmark

Trang 16

Approaches for Luman activa recogaition J13|

Sample depth data in M8H-ActionäD [I7] ¬¬

A skeleton representation of the hammer action in MSR-Action3D 12 Sample acceleration date im CMDEALL [20| 12

Microsoft Kinect sensor (left) and human computer interaction (right) [10] 15

Noise in the skeleton data of MSR-Action3D is marked with red hoxes 16

Skclelou data estinated by a Kiseet seasor with 20 joints 7

Temporal bierrachy or covarieaee đoscripeop || 38

Features formed by skeleton pose, saeod, duổ acccleration |ö4| 26

Subsets of 7 joints (a), LL joints (b), and 15 joints (c) are selected from

Lie algebro-based action recognition [8| 2

Relative position, speed, and acceleration features [39] 27

Mast, informative jaint selection using, jaint, angles [44] 28

Temporal Convolutional Newurk (TCN) with residuals [66] 2a Skeleton representation in SkeleMotion method [ 3U

Skeleton to image mapping using tree structure [75]

Attentional Recurrent Kelational Network-LSTM [76]

Joint trajectories of the tua kend wane action in MICA ActionaD 3?

Joint Subset Section tedinigues van be used i couibination with dif

ferent action representation and recognition methods 37

System diagram of the proposed Preset JS8 ¬ 38

Preset ISS: A subset of 14 blne joints ic selected tram the 20 joint skele

ton model Red joints are omitted sẻ 38

Joints in FMIJ are marked in red for the high throw: action in MST Action3D.43

Trang 17

stem diagram of the proposed TMILI/AMI1

‘Twolayer hierarchy with (a) weuc-uverlapping aud (b) overlappitig (OL}

Confusion matrix on MSR-Action3D ASL

Confusion matrix on MBK-ActionsD ASZ 2

Confusion mutrix ou MSR-Actiou3D AS3 2 wee

Acnracies {%) for MSR Action3D and CMDTALT when changing tne

anunber of inost information joiats with FMI ¬

Aocnrariee (%) for MSH-Action41) and COMJ)IALL with differant thresh-

Accuracy (%) on three subsets ASI, AS2, and ASY of MSR-ActionsD

with different wumbers of layers uid fentures 6

TT score (%} on CMDTALT, with different members of layers and featnres

) of MICA-Actiou3D with different pose estimation methods

Contnsion matrix of MICA-Action3D when applying OpenPose for skele-

Avcuracy (2

lon estimation (eon color date,

Confusion matrix of MICA-Action3b when applying Kinect SDK for

skelctou estimation frou color date

Trajectaries of draw X and draw circle actions im MICA Actian3D

Graph modeling in ST-GƠN

System diagram af AR GƠN [89|,

SGE-EGR framework [IUU|

MS-GSD system diagram |102|

System diagram of EfficientGON [103]

AAGCN eystem diagram [104]

The eysrem dị

‘am consists of a Featnre Pision (FF) module, a BN

layer, ten Spacial-Temporal Basic Blocks, a GAP layer, and a Softmax layer Spalial-Tanporel Busic Block |104]

RIP for (a) Microsoft Kimect v1 skeleton with 20 joints (b) Microsoft

Kincct v2 skeleton with 25 joints

Joint velocity is defined as two-sided position displacement [109]

Components in the Feature Fusion module

Adaptive mechanism in AAGON

Contfnsion matrix on CMDFALTL using AAGON

Confusion mazrix on CMDEALL using F'l-AAGCN,

Similarity betwoon left fall and right fall actions in CMDPALL

59

60 g1 6L

62

63

65

Trang 18

Distribntian of 20 action classss ïn CMDPAII, ahrained by AAQON

(left) aud FF-AAGON (right) usiug t-SNE

Loss and accuracy curves of AAGCN on CMDFALL,

Loss aud accuracy curves of FF-AAGON ou CMDFALL

Loss and accuracy curves of AAGCN on MICA-ActionsD

Loss aud uccuracy curves of FF-AAGCN ou MICA-Action3D

Attention map for actions in MICA Action?) nsing the jet cofermap

vf Matplollib [L14] Buel row ia the altention unap eorruspouds vo eae

Joint Dark color represents the Targe weight and light color reprasents

the sinall weight

Confusion matrix on MICA

ction3D using AAGCH

Confusion matrix on MICA-Actiou3D using FF-AAGCN,

Graph detinition in WGCN [119!

System dingram of the proposed LW-FF-AAGCN

(a) Preset: selection of 13 blne joints from the skeleton moe! of 241 joints

(b) Graph tyne A (JSS-A) is dedined by the solid green edges connecting,

13 blue joints {c) Graph type B (J88-B) is defined by combining JSS-A

with edges between symmetricel joints Only the spatial dimension of

the araph is shown for simplicity

(a) Preset selection of 18 blue joints from the skeleton mode! of 25 joints

(b} Graph type A (ISS Aj te defined hy the solid green edges connecting

13 blue joints {c) Graph type B (J88-L) is defined by combining JS8-A

with edges between symmctricel joints Only the spatial dimension of

the graph is shown for simplicity 2 02

Loss and accuracy curves of the proposed method on CMDFALL

Actions

Loss and accuracy curves of the proposed method on MIC

Confnsion matrix om CMDFALL using LW OFF AAGCN

Distribution of 20 action classes obtained by AACCN (left} ond the

LW-FF-AAGCM using tr8NE,

-Action3D using [W-PEF-AAGƠN System diagram of the demo

Confusion matrix on MIC:

MediaPipe pose estimation with a 33andmark model [129]

Screen captures from the demo

Trang 19

Nowadays IIAR bas been an active research topic in, the field of computer vision,

robotics, and multimedia Byon though « lot of effort has been made, action recogaition remains challenging due to diversity, intra-class variations, and inter-class similorities

of human actions [1] Various data modalities, such as color, depth, and skeleton

data, can be used for action recognition Hesearchers have been working, an action

din the early research on action recognition Action recognition from images and videos is

recvguilion using 2D images for cucades Culur data in 2D images were ut

still difficult because of background clutter, occlusion, view-point lighting variations, execution speed, and biometric variance According to psychological research, skeleton

data are compact and cliscrintnative for representing human actions In a study on

imman visua! perception in 1973 [2], Swedish psychologists pointed out that human actious such as walking aad runing could be distinguisucd by 10-12 markers plucud

on body joints In other words, the perception of motion patterns can be specified

by some joints which arc more impertant than others Thesc joints contribute to forming unique characteristics of action patterns ‘he human body is modeled as rigid segments ccmected by joints Consequently, human action can be described by joint trajectories along the temporal dimension HAK algorithms utilizing skeleton data

can avoid tae weakness of video data like lighting conditions and background changes

Most importantly, skeleton data are efficient in rerms af compntation and storage

Advances in pose estinmelion using color/depth dat also contsibute te the popularity

of skeleton-based action recognition As a result, skeletonbased action recognition is currently a promising approach

Skeleton data can be collected directly by motion cepture systems, depth sensors

or indirectly estimated from orlor/depth data Among these methods, motion eapture systems can cutput high-quality skeleton data However, motion captuze systems are

Trang 20

expensive and inconvenient to implement ‘The popularity of low-cost RGB-D sensors and advauces in pose estimating Lom colur/depth data snake tue sheletou dita easy to acquire, I'he research objective of this dissertation is action recognition using skeleton data collected by RGB-D sousors Ia this work, Sour skeleton dalascly MSR- ActionsD, MICA-Action3D, CMDPALL, and N'’'U RGU—D are used for evaluation These datasels coutain seguented samples cullecled by Microsuft Kinect seasors

pose that each skeleton sequence only contains one action Action recognition aims to

predict the action label from the skeleton data Action recognition can he extended

Ww action detection, Action detectiou focuses on labeling for skeleton sequences Chat

contains multiple actions In action detection, segmentation is reqnired to detect each

action’s starting and ending points Noise is inherent in the skelevon data Various

¬nethads are propasedl to improve action recognition performance A lightweight model

ig required to implement action recognition on devices The dissertation focuses ơn improving action recognition performance on noisy datasets and designing Hghtweight anucels for application development

Challenges

Action recognition is stil an attractive research topic due to many challenges Eour major challenges are cliscussed, including (1) intra class variations and inter clase sim ilarities, {2} noise in skelecon data, (3) occlusion caused by other body parts or other vbjecls/persons, and (4) insuflleicun Tabeled date

w

for the same action class running As a tesnlt, each action category may

comprise a variely of body unolion types For the sume neo, saanples ight

be shat at different viewpoints, snch as in tront of the suniect or at the side of

the subject Furthermore, there are similaities in human poses across distinct

action classes, The two action classes running and waiking, for example, have

similar Luan motion patterns but with Ciffreut execution spoeds These simi-

larities make it diffiontt to distingnish the two actions, resulting ‘n performance degradation in action recognition All of these issues might, result in wrong action recognition, In real world implementations, these variations will he even more severe, This prompts reseurchers tơ luck inte more powerful action recuguition

methods implemented in real life applications,

Trang 21

‘0 cope with noise in skeleton data Tt is challenging to caplure bigh-yuality skeletou sequences using depth scusors Pose estimation muy output noisy skeleton data for subjects in non-standing poses Action recognition will sufer from performance degradation when dealing with uoisy skeletow data

As a result, action recognition with noisy skeleton data is a challenge

« ‘The third challenge is the occlusiou Joints may be occluded by other body parts

or other objects/parsons ‘I'h's results in missmg or poorly estimated joints in skeleton data

» The fourth challenge is he insuflicieut labeled data Data labeling is an csseutial step in data collection for deep learning Deep learning architectures can achieve good periormance when large-sca.e datasets are available However, data labeling

jg terlions, time consnming, and expensive Large scale datasets are recmired to

transfer the action recognition model to real-world applications

Objectives

‘Lhe objectives are as follows:

= A compact presentation of human activws: Joints in vue skeleton snodel play điền

ferent roles in each action The first objective is to find an etficient representation

for action recognition

objective is le design a deep loaruing architecture Ghat achioves high-performance action recognition on noisy skeleton daca

= Proposing a liyhbuoeight model for action secoynition: Computation capacity hs

limited on edge devices A ‘ightweight deep learning model for action recognition

is required for application development Constructing an cfficient, lightweight model for action recognition is the third objective of this dissertation

Context and constraints

In this dissertation, some context and constraints are listed as follows

« ‘Three public datasets and one self-collected dataset are used for evaluation ‘Lhe datasets contain scyincuted skeleton ecqucnees cullevtod by Microsull Kinect seusors A ist of human actions is pre-detined for each dataset ‘Ihe datasets contain actions performed by a single person or interactions between two persons Other datasets are not considered and,‘or evaluated in this work

art performance or any other specific domains are not in the scope of this work

3

Trang 22

& Tor all four datasets training,/testing data split, and evaluation protocols are kept

the sow as in the relevant works where the datascts are introduced,

for troining and the other half for testing

= Cross-view benchuark is perforoicd vu the NTU RGB|D dataset Sequences

captured by camera numbers 2and 3 are used for training Sequences from camera

auaiber 1 are used for testing Ouly singl-view data are used Multi-viw data

processing is not considered in this work

« The study aims at deploying an application using the proposed method ‘This

application is developed tn assess the performance cf a person who does yoga

exercines Pose estimution for a single pervou is implemented using the public Wwol

Google MediaPipe [3] Due to the time limitation, only the result of the action

recognition module is introduced ‘Vhe related modules such as action spotting,

human pose evaluation, and exercise scoring/assessment are out of the scope af

this study

Contributions

‘Phe three main contributions are introduced in this dissertation

« Contribution 1: Propose two JS$ methods for human action recognition: the

Preset: JSS and the antomatic MII selection methods

« Contribution 2: Propose a Feature Fusion (FF) module to combine spatial and tunporal feavuces for the Attetioneahanced Adaptive Graph Couvolutioual Network (AAGCN) using the Belative Joint Position and the joint velocity ‘Lhe proposed method is named FF-AAGCIN The proposed method cutperforms the beseline method on the challenging datasets with noise in the skeleton data,

= Contribution 3: Propose a light weight wsodel LW-FF-AACCN with much fewer model parameters than the haseline method with competing performance in action

recoguilion The propascd auethod cuables the depleyaeat of appticatious using

aud is structured us [ollows:

s Introduction: This suctiou provides the utctivation, objectives, challuuges, cou-

straints, and contribntions af the dissertation:

Trang 23

« Chapter [ entitlel "Išterarnre Review": This chapter is a brief on the existing

lidevatuse to obtain a compchousive understanding of human action ccvoguition

« Chapter 2 entitled "Joint Subset Selection for Skeleton based Human Action

Recognition": This chapter presents Preset JSS and automatic MLJ selection

= Chapter 3 cutitled "Foature Fusieu for the Graph Couvolutioual Network": The graph-hased deep learning model FIA AGICIN is proposed for human action recog-

ailion usiug AAGCN as the besciins, FF-AAGCN omperformm the baseline

method on CMDIALL, a challeng-ng dataset with noisy skeleton data

+ Chapter 4 entitled "The Proposed Lightweight Craph Convolutional Network": The lightweight moctel [WV FF AAGCN is proposed with fewer parameters than the baseliae AAGCN, IW-FE-AACICN is suitable for upplication developusent ou

edge devices with limited computation capacity

Trang 24

of applivatious Instead of using siaaularl input peripherals, actions van be used to control rohots ar compnters [4! Thman action recognition (TAR), the ability to rec- vguize and understand human actions, is critical for a varivly of applications such as imman-computer interaction, camera surveillance, gaming, and healthcare Asa result, the recuguition of human detious is becoming inereasiugly important Comprchunsive

surveys on action recognition can be found in [5] The demand for systems that cun undersiund huruan actions aud ale patients about poleutial physical aud asental

health concems is rapidly increasing, Merical experts can recommend diet, exercise,

aud auedication by idcutifying changes in daily actions [10] Actiou recognition in video

surveillance can automatically detect suspicions actions and alert the anthorities for

preventive sections

'this chaprer is structured as foilows Section 1.2 is an overview of human action

recognition and its applications Scction 1.3 discusses data modalitics used for action

recognition Among, data modalities, the skeleton da ta, modality ie compact and effi ciont for uction representation, sv this work focuses on action covoguition using skeleton data Section 1.4 describes methods for skeleton data collection Skeleton datasets used

in this dissertation are deseribed in Section 1.5 Section 1.6 reviews action recognition methods using the skeleton data, Section 1.7 highlights the research works on action recognition in Vietaam Scetion 1.8 concludes the chapter

1.2 An overview of action recognition

Researchers have Loun working ou action recognition from images and videos for decades Action recognition methods can be categorized as in Figure 1.1 The mecha nism of human vision is the key direction that researchers have been following for action

recognition, The human vision system can observe the mation and shape of the human

6

Trang 25

body in a short time, The observations are then transferred to the human perception system for classification ‘The human visual perception system is highly reliable and precise in action recognition Over the last few decades, researchers have aimed at a similar level of the human visual system with a computer-based recognition system Unfortunately, computer-based systems are still far from the level of the human visual stems due to several issues, such as environmental complexity, intraclass variane

Figure 1.1: Approaches for human action recognition [12]

There are numerous challenges in developing an effective and reliable system for

to recognize a large number of actions with high accuracy, and it should be able to

operate in real-time [4], The action recognition system consists of data collection,

MISOFS are pre-processe

or energy-based segmentation Following pre-processing, feature extraction is used to reduce the mumber of features to improve classification efficiency Feature extraction can be completed using cither handetafted or deep learning methods Actions are

Trang 26

streams from a large number of cameras simultaneously As a result video surveil-

Jace fuolage cau be used to tzack crituingl suspects but carely avert: potentially

Tisky situations Action recognition systems allow video surveillance cameras to

be coutinuously evaluated and alerts to be aetiveted in the event uf suspicious

or criminal actions Such a system will uly utilize the capability of surveillance cauere aetworks while also cousiderably improving suety by early ulerts to Uke

anthorities for immediate action

« Healthcare: Action recognition is hecoming, increasingly significant and ntilized

to mouitor patiouls Remote monitoring « patiout allows the doctor to keep track

of the patient's health and intervene hefore major health problems acenr Action recuguiliou systuns can track a patieul’s licalth and spur potential pliysical or

menta? issues early These systems monitor the actions of the elderly in daily life for a sofe und coufortuble euvironment Typienlly, dese devices capture an elderly’s contimons movement, antomatically recognize their actions, and detect any Arrugdlaaity as it occurs, such wy a fall, u stroke, or breathing probleus The elderly can stay at home longer without, being, hospitalized thanks to snch systems, improving (heir comfort and healt

« Human-Computer Inteructiou: Human-Computer Interoetion {HCH) began with keyboards and other peripheral devices that make communication easier, Jaster, uid inere coufortable, With receut advances in computer vision aud cou

appro, pate respousus Television sets have already Leu equipped with rauote control

ze hnman aetions and ini

era, sensors, the computer can now recopmii iat

capability nsing action recognition In the gaming, business, an Xbox game console with a Microsoft Kinect seasur can iutupret whole-body uiotion, resulting ia a

new level of gaming experience

« Robotics: Robotics is the study of the cesign of robots and the compnter systems that coulsol them, These sechuologies are being used to create auacisiues thet can act like humans Robots can he employed in any circumstance and for any purpose, Many robots are uow working ia Lazardous locutious, manufacturing processes, and other situations where Anmans can not survive Many robotics applications rely on a robot’s ability to recognize human actions Ic is expected

that rohots can comprehend buman actions and act accordingly Assistive robot

systems are robotics systems that can help umans with a variety of everyday

tasks like housework Patients, the elderly anc disabled persone can also benefit

frou the support

+ Sclf-driving vehicles: A goud driver is cecugnized for observing aud umicipating

driving sitnations In the future of antonomons vehicles, ancnrate interpretation of

the actions

ther traffic participants would be critical Autonomous vehicles, for

8

Trang 27

exampie, must assess and predict pedestrian intentions when crossing a road This, would allow self-driviag cam to avoid poteutielly Lasurdous situations Action recognition can be one of the most critical building blocks in autonomous driving Ave

or motion trajectory in the next few seconds, which could be essential in avoiding

le with an action prediction algorithin can forecast a pedestrian’s uction

a collisiua in a eutergeucy [9] Drivers’ secondary tusks, such as answering Uke

phone, texting, eating, or drinking, white driving, have been docnmented to create

inutoutivenes:, leading tu accidents [11]

« Video retrieval: Online scrviccs are growing rapidly, as well as social modia

services, We may now sce a rapid expansion of onlme services On the Internet, people can easily post and share videos However, because moei search engines employ the accompanying text data to manage video data, maintaining and re- trieving videos according te viden content: is becoming diffienlt ‘Text data, such

ay Lags, Litles, descriptious, and keywords cu be inaccurate, ubluseuted, or irrel- evant, making video retrieval impossible Analyzing kuman behaviors in videos

is am alternate strategy, given che bulk of these videos feature some human actions [4] High speed Internet is available to a large number of cell phone users High-resolution cameras ore built into phones and other mobile devices The

amonnt of viden data is extremely targe, reqmiring antomatic viden retrieval ry

tems, Systens that cau rapidly auelyze videos aud deliver reliable search results

or suggestions become vital as data grow

1.3 Data modalitics for action recognition

Data modalities can be loosely categorized into two groups: visual modalities and non-visnal modalities Visnal modalities such as calor, depth, and skeleton are visually

logival for deseribing welions The trajectories of joiuts are cucuded in skeleton data,

‘The skeleton data efficiently represent actions when the action does not include objects

er scene context Visual medalitics are popular for actiom recognition In reboties and

ing hnman actians llowever, these modalities can also be nsed for actin recognition

while, nou-visual wodalities ke acecleration we uot visually intuitive lor describ-

Trang 28

1.3.1 Color data

‘The color modality refers ta images or videos captured by cameras Clolor data ere usually easy to obtain, It offers rich information about the captured scene However, action recognition from color data might be compiex because of backgrounds, view points, ond lighting conditions RGB input concains a lot of texture information and appears to he similar to the infarmation processed by hnmans Color data, contain a lot

of information that is winecessury for uctivu recognitiou Color data are highly depen denL ou lighting coudilions Iu this situation, icchuicues that operate with RGB data

must detect important regularities and separace them from non-related regularities like lighting changes [15] Furthermore, color data have large volumes, which resalts in sub- stantial computing costs for representing human actions KCB-based approaches rely

on sequential RGB images However, it is still difficult to recognize actions in the wild

due to high intra-class variance, scaling, occlusion, and clutter (16]

1.3.2 Depth data

Depth maps ore images in which the pixel values describe the distance between the depth sensor and the ecene’s points The depth modality, which is resistant, to color aud texture changes, cut be employed for action recuguition siuce it delivers reliable structnral and shapes information ahont the snbjects IDiHerant types of depth sensurs have buen coummercialized, including Time-of-Flight and struct urodlight-based

sensors ‘These sensors emit infrared radiation and then detect the teflected energy fram

the objects ta obtain depth dats, as showa in Figure 1.2 Depth data are free uf suine

of the insnes that exist in MGB For example, adeptn map is insensitive to changes in Mluininativon, wakes it simpler to discern ferword from background scenes, und offers 3D

information about the scene Towever, there are drawbacks to using a depth map: it dos aot ufler texture infortuation and typivally produces a lot of umeusurcuseut noise

This is true in the case of low-cast RGB-D sensors The depth map indicates the

distance bebweun eich pixel wud the scusor The precision of the sensor is a uon-linear function of distance; therefore, the further an object is from the sensor, the less precise the measurement becomes Due to occlusions and the fact that many materiais absorb

sensor beams the depth map acqnired from the sensor ‘s frequently noisy and contains

auucervus inissiug Guta [15] depth-bused approsches rely primarily uu feotures

collected from the space time volnme, either lacal or global Unlike visuai data depth

naps provide guometsic measures unaffected by lighting Using depth wups tu develop

a system for action identification that is both successfn! and efficient: is still a diffenlt

task However, depth cau make the foreground aud background separation casier [16]

Trang 29

tation It is robust against changes in clothing and backgroun Motion capture

(Mocap) systems can be used to collect skeleton data Mocap systems can offer high-

quality skeleton data as these systems are insensitive to view and lighting Howe

er, Mocap systems are expensive and inconvenient As a result, skeleton data obtained

from depth maps or color image are employed in many situations, With the pop-

ularity of low:

cost depth sensors and advances in pose estimation, skeleton data are

y to acquire Skeleton data can be collected directly from depth sensors or by ap-

by appearance and background changes, makin

Sample frames of the

hammer action in MSR-Actiou3D are shown in Figure 1.3

Trang 30

jon of the hammer action in MSR-Action3D,

Figure 1.3: A skeleton represen

ing an action; hence the acceleration signal does not exhibit noticeable variances for the same action Because action recognition employing acceleration data can achieve excellent accuracy in most cases, it has been utilized for remote monitoring

systems while addressing privacy concerns The acceleration modality can be used

for elderly monitoring in healthcare However, the subject must wear the sensors

which are typically inconvenient [19] Sample acceleration data in CMDFALL are

shown in Figure 14

Figure 1.4: Sample acceleration data in CMDFALL [20]

Point cloud modality: The spatial distribution and surface features are rep-

resented by @ point cloud comprising many points, There are two methods for

hy rotating the point cloud in 3D space, point cloud-based action recognition techniques are often insensitive to the viewpoint However, point clouds frequently contain noise and have a non-uniform distribution of points, making robust action recognition difficult Furthermore, it is computationally expensive to process point cloud data

Trang 31

1.3.5 Multi-mudalily

Action recognition based cn a single modelity has been the mest popular Hewever, each data modalixy has pros and cons, as shown in ‘lable 1.1 Current research focuses

on the fusion of clifferent data modalities and the transter of information across modal

ities to improve the accuracy and robustness of action recognition Features con be

combined trom two ar more modalities for action recognition Acceleration data, for

sxaiaple, can be used in coujumclion with skeleton uta, Huns frequently iuler the actions in a wulti-modal wsanner

Tyhle 1.1: Cunparisoa amoug dala ewdalilies,

1 | Color Riek context afarimsuon Easy to access ‘Sonsitive to lighting ‘background

2 | bạ Insensitive to lighting/ background ‘ SD ie Noisy

anepaL 3D/3D inloriastion

Insensitive to lighting/ background

No eppearance information Wort by subjects Coulis 3D information Compu.ation complexicy Insensitive te lighting/background Noisy

4 | Acceleration Privacy protection,

Point cloud

Multi-modal snachine lacuing is a method that aims to analyze wud late seusory data from several modalities Multimodal machine learning may typically give a mare robust and accurate HAR by combining the advantages of different data modalities Fu- sion and co learning are the two primary forms of multi modality learning approaches Fusion refers to iulegzaling information [evn two or sure modalities for (raining aud

inference, while co learning refers to the transmission of knowledge between distinct:

data modalities [19]

« Multi-modulity Fusion: Because each data modalit:

natural to use fusion to combine the benefits of data modalities to improve HAR

y ha its advantoge, it is

performance Score fusion and feature fusion are two multi-modality fusion tech-

niqnes in HAT The score fusion incorporates the decisions made individually

besed on multiple modalities to obtain the final classification results, On the

other hand, feature fision merges data from several modalities te provide ARATE

gated features that are discriminarive for action recognition

« Multi-modality Co-learning: Transierring, knowledge across modalities can

lel improve aetion recoguition perloraiauce Co-learding is the sludy of wpply-

ing, knowledge gained trom anviliary modalities to learning 2 model on a different

modality Unlike fusion approaches, data from auxiliary modalities is only neces-

„

Trang 32

sary during, training, rather than testing This is especial'y useful when specific

inudalilies are unavailable duciag lesting

1.4, Skeleton data collection

Skeleton dave are teuporal sequences of joiut positions Joints are connected ia tle

kinetic mode! by the natural stricture of the human body Tt is convenient ta model

actious using Uhe kinetic node Skuletou data can be collucted using Mecap syeteins, RGB | D sensors, or color/depth-based pose estimation

1.4.1 Data collection from motion capture systems

Tn motion capmnre aystems, markers are placed on joint positions These systema

use cither visual cauictas to Wack refleclive markers mouuled to a subject's bedy

or inertial sensors to estimate bedy component rotations with reference to a fixed

point Existing Moeap devices are equipped with software that allows for the precise capture of 3D skeletal data Llowever, most Mocap devices can only be employed in controlled environnente |2I| Skelcton data collected by Mocap systems are of high

accuracy Towever, Mocap systems are expensive and inconventent for many practical

ayplivations Su thie work Jocus ou the skeleton data collected ly the popular low-cost depth sensors

1.4.2, Data collection from RGB+D sensors

The availability of low-cost depth cameras has cided in extracting human skeleton

data Microsoft, Kinect anc ASUS Xtion PRO LIVE are popular depth sensors These

seusors live au infrared projeclox, an infrared vatwere for capluriug depth data, aud a camera for capturing color images The depth cameras offer 3D depth data of the scene, which helps estimate the 3D joint posivions of tne human skeleton (LU) Inirared ([R) light beams are emitted by an IR emitter Reftected IR beams from the environment retum to the LK depth sensor ‘The reflected beams are used to calculate the distances

between scene points anc the sensor The sensor har can he tilted vertically uring the

tilt motor The three modalities of RGB-D sensor data are color, depth, and skeleton

Color data are the most important data with rich ‘nformation, which allow for the

detection of interesting paints and the extraction of optical flows ‘lhe depth modality is

insensitive to lighting changes, invariant to color and texture changes, and trustworthy for estimating the skeleton It gives extensive 4D structural information of the scene compared to color data Skeleton data are high level features for action representation compared 1 color and depth date, It can be invariant vo scale and illumination [16] An example cf using an RGR D sensor far human computer interaction in entertainment

is shown ia Figure L6

Trang 33

KINECT & Ome

Figure 1.5; Microsoft Kinect sensor (left) and human-computer interaction (right) [10]

RGB-D sensors estimate skeleton data from depth maps ‘The depth information makes the skeleton extraction more practical and stable when compared with pose

timation using color images ‘The core idea behind pose estimation is to use dense

probabilistic labeling to segment the human depth image into several body segments This segmentation can be seen as a classification task for each pixel in the depth image

‘The spatial modes of pixel distribution are used to calculate the 3D joint positions In

[22], Shotton et al propose a pose estimation method from a single depth image The pet-pixel classification results from a randomized decision forest classifier are used to classify human body parts The mean-shift approach infers the joint from the pixels

sensors is less accurate than Mocap systems [4] RGB }D sensors can be categorized

as Structured-Light sensors and Time-of-Flight sensors

The device determines the depth by projecting light in a predetermined pattern through

to collect depth information, are used to acquire 3D skeletal data directly

the infrared sensor and observing the distortion in the pattern when it meets a subject Most RGB-D sensors are affordable, making them suitable for various

y used in HAR research

skeleton data are estimated from depth sensors, noise i 2xamples of

noise in the skeleton data of MSR-Action3D are shown in Figure 1.6

« Time-of-Flight sensors: By generating light and measuring the time it takes

for the light to return, Time-of-Flight sensors [48] gain 3D information With

high frame rates, these sensors capture exceptionally accurate 3D data compared

x4 im

with structured-light sensors [23] RGB data can be retrieved and anal:

conjunction with depth data [21] Microsoft Kinect v2 is a Time-of-Flight sensor

15

Trang 34

with 25 joints in the skeleton model

(0) hành gi siaxe artion (subjeet TD: 3,

Figure 1.6: Noise in the skeleton data of MSR-Action31) is marked with red boxes,

1.4.3 Data collection from pose estimation

Many methods for estimating human joints and recognizing poses from ayailable data have been developed ‘These methods use depth images or additional information

fying

body parts, which are then fed into models that extract specific positions for those

provided by the visual sensing device The majority of the methods rely on ident

parts, This section presents an overview of visual data-based methods for construct

ing human skeletons The first approach is to use depth images for pose estimation

A human: skeleton can be estimated from a single depth image or a series of depth

images acquired over time Because of the additional geometric information provided

by depth pictures, this method is commonly employed in collecting action data The second approach is to tise color images for 2D pose estimation, Most RGB image-based

algorithms extract visual cues using deep learning architectures and other methods to

match poses of segmented silhouettes for body part identification It is worth noting

that 2D poses have limits compared to 3D skeletons For example, 2D poses do not

allow any rotation in 3D space, whereas skeletal data do As a result, many skeleton-

based techniques that are expr

y or implicitly predicated on the advantages of 3D

space cannot be applied directly to the 2D scenario [24]:

16

Trang 35

The limitation of using depth sensors is the range within a few meters So depth sensors are typically used for indoor applications When a large range is required

for outdoor applications, pose estimation from color data s important An example

application is for surveillance and rescue using the Unmanned Aerial Vehicle (UAV) Pose estimation from color data is applied due to the large distance between the UAV and the subjects

Skeleton data can be estimated from color data using pose estimation tools such as OpenPose [25], MediaPipe [26] AlphaPose [27], and DeepLabCut [28] Examples of pose estimation using OpenPose and MediaPipe are shown in Figure 1.8

and Figure 1.9, respectively

Trang 36

by RGB-D sensors MSR-Action3D, MICA-Action3D, and CMDFALL are collected

by Kinect v1 sensors while NTU RGB} D is collected by Kinect v2 sensors MSR-

MSR-Action3D is introduced by Li et al [17] There are 20 actions performed by

ten subjects Each subject performs an action two or three times, There are 20 joi

in the skeleton model The frame rate is 15 fps There are 557 action sequen ‘sin

Action Set 1 (AS1),

total, Actions in MSR-Action’D are grouped into three subset:

18

Trang 37

Action Set 2 (AS2), and Acrion Set 3 (AS3) as in Table 1.3 Fach subset consists of

eight actions, sv yom actions appear in wore than one subset

Twala [As List of actions in MST-ActionaN,

Action Set 1 (A81) | Action Set 2 (AS2} Action Set 3 (A53)

aĨ srm wawe | L húgh nh wawe — 0 hình throw

Ua, 14, forward kick,

10, sand clap 09 draw ecrele 3T tenris swing T3 send TT tee hand wave

18, cennis seve 12, side boxing 2U pickup & teow 14, forward kick

1.5.2 MICA-Action3D

Action sequences are collected Ly u Kinect v1 seusor, "The Gutasel is built for eross-

dataset, evaluation with the same list of 20 actions as in MSR Action3D [17] Fach

action is repeated two or hres times by cach subject Tweuty subjects participate in

data collection ta generate 1,196 action samples I'he frame rate is 20 fps

1.5.3 CMDFALL

‘Uhe dataset is introduced in [All] to evaluate methods for hnman falling detection

in healtheare For dala cullectiog, seven Kiucet seasors aru installed weress thie sur- roundings ‘Ihe frame rate is 2U ps ‘hase actions are performed by 5U people ranging

sh 20 fomales and 30 maics} Table 1.4 lists the 20 action

classes in the CMDFALL dataset ‘'here are many poses with subjects lying on the

ground in the CMDFALL dataset so there is severe noise in the skeleton data.) The

in age from 21 to 40 (

reason ‡s that the Kinect sensor could not get on accurate pose estimation for sub-

jects (rom non-standing poses [39] As a result, CMDPALL is a challenging dataset for

action recognition Action recognition is conducted using skeleton Gata from Kinect

view 3 in this study, the same as in the original paper on CMDFALL [20| There axe 1.¥63 sequences in total, Half of the subjects are used for training with ¥74 sequences

‘The remaining arc used Zor testing wilh 989 sequences CMDFALL ie an imbalanced dataset with different numbers of samples in action classes

Trang 38

Luble La: List of aetiora In CMDEALL

1 [Action Name [1 Aetion Name

2 | cun slow'y 13 eral 3_| jump ie place 13 _sit on chair then stané up

4 [move band and tog [14 mave choir

3 [Tele baud pick up [15 sit oa chair hen fall lel

6 | right band pick up | 16 sit on chair then dall right

ï | magger 17_ sit om bed then stané up

8 | hat fal 18 Reøn bại then si ap

9 | back fall Wie on bed then fall ef:

0 | ten, tall 20 tie on bad Uber fall eight

skeleton model ‘here are 56,480 sequences in the dataset, categorized into 6U action classes Table 1.5 shows the list of action classes in the NTU RGB4+D dataset The actions are performed by 40 peop'e Three Kinect sensors are monnted at the same height but at various wugles The datuset is collected using 17 camera setups with different ‘neights anc dstances ‘The frame rate is 30 fps ‘I'he dataset authors rec- ounnend tựa benelusarke: (1) Crusssubject {C8}: half of the subjects ave used for training, and the remaining haif are used for testing ‘I'here are 40,320 sequences in the training sct and 16,560 scquences in the testing sct (2) Cross-view (CV): the training set includes 37,920 sequences (from cameras 2 and 3}, and the testing ses consists of

18,960 scqucaces (Lom camera 1)

1.6 Skeleton-based action recognition methods

Skeletou data are snore eommpact and discrianinitive for action represeatation than the traditional color data, Skeleton data have a lot of advantages, such as computation efficiency and robustness against variations in clothing texture and background Skeletan data are simple to abtain thanks to the widespread use of depth sensnrs aud advanees in Lunn pose vetimation with color/depth cata Vacious aiethods for skeleton-based action recognition have been proposed in the literature, as shown in

Figure 1.10 Skeluwu-based action recuguition eau be categorized as iu Figure LLL

Moreover, skeleton-based ILAB can directly benefit from advances in pose estimation Multianodatity HAR is also a popular approach in the Htcrature to combine the advantages of diferent data modalities

Trang 39

cova0) Joint Velocity €NN AGCN

Figure 1.11: Skeleton-based HAR methods Proposed methods are marked in green,

1.6.1.1 Joint-based action recognition

Joint-based representation has been commonly used for action recognition, Joint based representation models actions using the relationship between joint coordinates

In [43], Wang et al use Relative Joint Position (RJP) for action representation Fourier

‘Temporal Pyramid (FTP) representation is proposed to tackle temporal misalignment

As skeleton data are sequences of different lengths, Fourier transform is applied to all sequences Feature vectors are formed by concatenating the low-frequency Fourier coefficients from transformation results 3D skeleton datasets are used for performance

ction3D, MSR-Daily-Activity3D, and CMU Mocap

In [44], Yang et al present features that model the dynamics of individual joints

Trang 40

formed using, a temporal hierarchy of covariance deseriptors (CavaD 1) The covariance

ayateix is employed in this deseziplor Ww reflect the inlerdepeudeuce of juiut positions Covariance matrices are computed across temporal windows in a hierarchical approach

to capture the dependency bebw n joint positions, as in Figure 1.18 The deseriptor's

Fignre 1.12: Covariaice descriplor inethad 43]

ou

Figure 1.13: Temgoral Licrrachy for covarianes deveriptors 45]

In an alternative approach only the actively involved joints in actions are chosen for action representation Joints are either preset ar antomatically selected basect

on statistical metrics of joint locations or angles In [32], the histogram of JD joint

ates The Hidden Markov Model (HMM)

is used to classify the temporal evolution of visual words Ad‘acent joint pairs such

positings is represented in spherical curd

as hand/wrist, foot/ankic are very close to cach other and thus redundant for the description of human actions ‘lhe spine, neck, and shoulder do not contribute to

body motion A eet of 12 joints is selected from the 20 joint skeleton model In [46],

Caglic et af use a subset of 11 joints to extract features for action recognition In [47], differcut joint cumbinations suck ay the ueck, right hip, lefl hip, right waist, cle.,

23

Tiêu đề	A study on deep learning techniques for human action representation and recognition with skeleton data
Tác giả	Pham Dinh Tan
Người hướng dẫn	Assoc. Prof. Vu Hai, Assoc. Prof. Le Thi Lan
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Engineering
Thể loại	Doctoral dissertation
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	135
Dung lượng	3,49 MB