Luận văn skeletonbased human activity representation and recognition

So other features in representation action are investigated Goints velocities, com>ined with joints positions to create more discrimination fealure of cach action.. 3.1 Overview of the p

Trang 1

RY OF EDUCATION AND TRAINING

LYERSITY OF SCLENCE AND TECHNOLOGY

Tien Nam NGUYEN

SKELETON-BASED TILMAN ACTIVITY REPRESENTATION AND

RECOGNITION

MASTER OF SCIENCE THESIS TIN

TNFORMATION SYSTEM

Hanoi - 2019

Trang 2

HANOI UNIVERSITY OF SCLENCE AND TECHNOLOGY

Tien Nam NGUYEN

SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND

RECOGNITION

Speciality: Information System

MASTER OF SCIENCE THESIS IN

Trang 3

GÔNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập — Tự do — [lạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn: Nguyễn

i luận văn: Nghiên cứu và phát triển phương pháp biểu diễn vả

Đề

nhận đạng hoạt động người dựa trên khung xương

Chuyên ngành: Hệ thông thông tin

Mii sé SV: CBC18019

Tác giá, Người hướng dẫn khoa học và Hội đồng cham luận văn xác nhận

tác giá đã sửa chữa, bỗ sung luận văn theo biên bản họp lIậi đồng ngày

26/10/2019 với c

nội dung sau

STT Yêu cầu của hội đẳng i dung da stra chữa, bồ sung

1 Gop chuong 4 va 5 Da gop chương 4 va chuong § thinh

1 chương tên là Các kết quả thực

nghiém (18n tiéng Anh: Experimental

results)

2 Giải thích lí do lựa chọn các

phương pháp nhận đạng sứ dung trong dé tai

Học viên đã bỗ sung thêm chỉ tiết li

do lựa chọn phương pháp ở chương Ì phần 3

3 Bố sung các độ đo đánh giá

Precision, Recall, Fl

Học viên bố sung thêm thông tin về

cách tính các độ đo đánh giá đã được

trình bày ở chương 4 phân 2 (Evaluation metric) Cac d§ do Precision, Recall va F1 score déu cd

thể được sử dụng để đánh giá hệ

thống nhân dạng Tuy nhiên, trong

luận án, để có thể so sảnh với các

phương pháp đã để xuất trước đó, tủy

vào cơ sở di liệu mà các độ do khác

nhau được sử dụng Cơ sở dữ liệu

MSRAction3D sử dụng độ chính xác (Accuracy) trong khi co sở dữ liệu CMIDFaI sử đụng độ do F1 score Trong bản chỉnh sửa của luận văn, bên cạnh các độ đo sử dụng riêng cho

từng cơ sở đữ liệu, học viên đã bố

Trang 4

sung thêm báng 4.7 ở chương 4 kết

qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm

Ngày 07 tháng L1 năm 2019

CHỦ TỊCH HỘI DÒNG

Trang 5

Acknowlcdgements

T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot

or had a question about my research or writing She consistently allowed

this thesis to be my own work, but steered me in the right the direction

whenever she thought T needed it,

T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD student Pham Dinh Tan who participated and give me more useful information Without their passionate participation and input, the validation

survey could not have been successfully conducted,

I would also like to acknowledge to School of [nformation and Communica- tion technology where T have been crealed all lhe best conditional to make the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis

Finally, I must express my very profound gratitude to my parents, my sister

and also to my colleagues in Toshiba Software Development VietNam (Nha

Dink Duc, Pham Van Thanh and many colleagues) for providing: me with uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This

accomplistment would not have been possible without them Thank you !

Trang 6

Abstract

Human action recognition problem with the aim is to predict what action

of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many

fields such as human computer interaction, surveillance camera, robotics,

health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin

nities for HAR as they provide richer information of the scene Thanks to

ect und Asus Xtion PROLIVE allows lo open new opportu-

performance computing hardware Among hanc-crafted descriptors for ac-

tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions

proves its effectiveness and computational efficiency [2] To take into ac-

count the duration variation of action, a temporal hicrarshy representation

is introduced with multiple layers However, the disadvantage of Cov3DI is that it uses of all joints in the skeleton, which causes computational burden

Trang 7

and may become ineffective as each joint has a certain level of engagement

in an action Moreover, the authors employs only Joint positions as joint features It seems not good enough to represent action So other features

in representation action are investigated Goints velocities), com>ined with

joints positions to create more discrimination fealure of cach action This

public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,

the experimental results show that the proposed method obtains 6.17% of

improvement over the original method and outperforrns many state-of-the-

art methods, On CMDFall dalasct, the proposed method with FL score of

9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]

and LSTM (I score: 0.46) [5] The contributions of the thesis have been published in an international conferece

Trang 8

Challenges and open issues ¡n skeleton-based HAR 2

State of the Art

The proposed approach

The most informative joznts detection

3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22

3.2.1.1 Detect candidate joints foreach action

3.2.1.2 Select the most informalive joints of each action,

Trang 9

3.2.2 Stralegy 2 (AM) far most information joints deleclon 24

3.3 Action representation by covariance descriptor

Evaluation of features used for joint representation

4.4.1 Results on MSRAction3D dataset

44.1.1 ActionSetl

4412 ActionSet2 441.3 ActionSet?

44.2 Results on CMDFull dalascl

45 Evaluation of the most intormative joints selection

4.5.1 The effect of the number of most informative somnts

4.5.2 Comparison between two strategies

Comparison with state-of-the-art methods

Trang 10

Referenecs 56

Trang 11

2.1 ki ofthe joints of a skeleton in MSRActon3D daasel 8 2.2 Coordinate System in Kinect 9 2.3 Ilustration of the temporal normalization [6] 9

2.4 Illustration the Fourier Temporal Pyramid [7] 10

2.5 The Cuv3D] descriptur [2]

2.6 The constriction of the CovP3DJ [8]

2.7 Informative joinis delermined [or MSRActien3D dalaset [9] 13

2.8 Representation in Lie group and mapping te Lie algebra

29 Two input streams architecture [10]

2.10 Hierarchical RNNs for action representation learning [11]

2.11 Part-aware LSTM for action recognition [12]

2.12 Action recognition 2y combining CNN and LSTM [13]

3.1 Overview of the proposed method,

3.2 Skeleton sequence of Action Throw in MSRAction3D,

3.3 Candidate joints of Action 6 (Throw) in MSRAction3D dataset The

candidate joints are marked in red colUr

Trang 12

The statistical length of each action in MSRAction3D dataset

The statistical number of occurrences of cach action in MSRAction3D

Confusion matrix obtained for ActienSei1

Accuracy obtained with ditferent features on ActionSet2

Confusion matrix obtained for ActionSet2

Accuracy obtained with ditferent features on Action Set 3

Confusion matrix of Action Set 3

Results obtained forCMDFall datasct

Confusion matrix of CMDFall

Comparison the distribution of cach class based on coval

on position in ActionSet] of MSRAction3D

Comparison the distribution of each class based on covariance features

on velocity in ActionSct] of MSRAction3D

Comparison the distribution of each class based on covariance features

on combination of position and velocity in ActionSct] of MSRAction3D 44

vũi

Trang 13

20 actions and 3 subscis of MSRAcion3D đatascL

List of action classes and categorization

The selected joints on AS3 of MSRAction3D dataset with two stategies 45

Accuracy (%) of state-of-the-art methods on MSRAction3D dataset

Three best resulls are in bold * On the experiments of authors nut

used all datasets

FI score of state-of-the-art methods on CMDFall dataset

Trang 14

List of Acronymtypes

AMIU Adaptive Most Informative Joints,

CgD Convolutional Three Dimensional

CNN Cosvolutional Neural Network

DTW Dynamic Time Warping

FMLJ Fixing Most Informative Joints

HAR Human Action Recognition

LSTM Long short-lerm memory

MIJs Most Informative Joints

NN Neural Network

RNN Recurrent Neural Network

SVM Support Vector Machine.

Trang 15

Machine will be going to collaborate with people in many fields Ard to do this,

understanding the human’ s behavior is the vore technology

The main aim of human action recognition (HAR) is to recognize type ofhumar.’ s actions using information captured by scnsors These actions can be executed by

one, two, 4 group people Human action recognition from video is a term which used

to figure out the method to recognize human’ s action from one or several cameras Although, there have been a numerous approaches proposed [or understanding human

action [15], due to the high nuraber of actions and subjects, the huge variation in

action rate, environment, human action recognition is still a challenge for researchers The release of cost-effective depth cameras such as Microsoft Kinect [14] and Asus Xtion PROLIVE allows to open new opportunities for HAR as they provide richer information of the scene Besides color images, depth and skeleton information are

also available Morcover, the lalost rescarch results on humun pose cstimation in RGB

video show that the human pose anc skeleton can be accurately estimated even in

complex scenes Using skeleton information tor human action recognition has several

Trang 16

1, Challenges and open issucsin skeleton-based HAR

advantages in comparison with those using color and depth information Firstly, unlike

color, skeleton is robust to the human appearance variation Sccondly, HAR based on

skeleton is effective in term of memory requirement and computational time (e.g., a human skeleton contains 20 joints with 3 coordinates (x, y, 2)) Finally, it has been

proven dy the Johansson’ s study thal human vision is able to recognize human ly and the velocity of the different motion patterns by observing the movement of human

skeleton’ s joints ]17)

Astesulis, a wide range of methods for HAR using skeleton information have been

introduced [18], [1] Previous studies have shown that to a certain extent, skeleton-

based HAR methods have solved some of the problems of human action recognition with RGB cameras or video, and have demonstrated yood recognition performance in several benchmarks datasets However, while working with more challenging datasets,

the recognition rates are still low

My thesis aims at improving the performance of human action recognition based on skeleton In the next section, the challenges and open issues in skeleton-based HAR will

be di

scd Then, the objectives and the contributions of my thesis will be provided

1.2 Challenges and open issues in skeleton-based

by assuming that the actions of interest have been detected by the first step HAR

from skelelal data has different difficulties and challenges as follows:

+ As the quality of skeleton estimation is net always perfect especially for non-

standing posture, the skeletal data usually has noise Fig 1.2 illustrates one

exa:nple of Walk action in CMIDFall where some joints are not accurately esti-

Trang 17

1.2 Challenges and openissues in skeleton-based HAR

a high inter-class similarity in action set For example, three

(draw X, draw circle, draw tick) in MSRAction 3D are very similar

+ As each subject has his/her own manner to perform actions in term of speed,

phase, the action dataset has a high intra-class variation

+ In some datasets for instance CMDFall, the beginning moment of action can be

Figure 1.2; Example of noisy data in CMDFall dataset with action Walk.

Trang 18

1.3 Objectives and Contributions

Le lalla

Figure 1.3; Example of continuous action in CMDFall datasets,

confused with the previous action as the actions is composed of several action

(e.g., sit on chair then stand up in CMDFall dataset as illustrated in Fig 1.3)

1.3 Objectives and Contributions

Current approaches for skeleton-based action recognition can be roughly divided into

two main categories The first category uses hand-crafted features while the second cat-

egory investigates deep learning methods to automate feature extraction process Deep

leaming based techniques usually require large datasets and high performance computing hardware Among hand-crafted descriptors for action representation, Cov3DJ with

covariance matrix of 3D joint positions proves its effectiveness and computational ef-

ficiency [2] To account for time dimension of skeletal data, a temporal hierarchy

representation is introduced in [2] with multiple layers In the first layer, covariance

matrix is computed on all frames In latter layers, covariance matrices are computed

on shorter temporal windows Cov3DJ is evaluated on MSRAction3D and MSRC12

datasets However, disadvantage of Cov3DJ is the use of all joints in computation, which causes computational burden and may become ineffective as each joint has a

certain level of engagement in an action [19] So I think whether combine the idea

about subset joints from [19] with covariance feature Besides reduction feature di-

mension, it also helps to create discriminative feature between each class Moreover,

the authors employs only joints positions as joint features It seems not good enough

to represent action So other features in representation action are investigated (joints

velocities), combined with joints positions to create more discrimination feature of each action.

Trang 19

1.4 Outline of thethesis

This thesis improves the covariance-based action recognition method presented in

(2) by (1) proposimy two different schemes to select the most informative joints and (2)

combining velecity information with positions of the joints for action representation

To cvaluate the cffectivencss of the two proposed improvements, extensive cxperi-

ments have been performed on two public datasets (MSRAction3D [3] and CMDFall

[4) The experimental results show that the proposed method obtains 6.17% of im-

provement aver the orginal method and outperforms many state-of-the-art methods

On CMDFall dataset, the proposed method with Fl score of 0.64 outperforms the deep

learning networks ResTCN (FI score: 0.39) [4] and LSTM (FI score: 0,46) [5] The contributions of the thesis have becn published in an international conference

1.4 Outline of the thesis

The thesis consists of 5 main chapters In chapier 2, I present characteristic of skeletal data, pre-processing techniques as well as analyze in detail the advantages anc disad- vantages of the state-of-the-ait approaches proposed tor HAR using skeleton Then, the proposed method is described in Chapter 3 Chapter 4 aims af prosenling exper:

imental results on two benchmark datasets (MSRAction3D and CMDFall) Finally,

conclusions and future works are given in Chapter 5.

Trang 20

broadly grouped into two main approaches: hand-crafted and deep learning With the

ability lo seli-study Jeatures from data, deep learning allows to find the features thal hand-crafted can not create Throughout recent years, deep learning receives tremen-

dons interest from rescarching community as a strong feature cneinccring method

However, to precisely extract specific features from raw data is a hard task even with deep learning That is why hand crafting still stays an important role for this prot- fom This chapier aims al analyzing (he slatc-ol-the-art methods proposed for human action recognition I divided these methods into two main categories: hand-crafted features-based and deep learning based

2.1 Overview of skeletal data and skeleton-based

human action recognition

According to [17], a skeleton is considered as a schematic model of the locations of torso, head and limbs of the human body Parameters and motion of such « skeleton

Trang 21

Pre-processingtechniques

can be used as a representation of human actions and therefore, the human body pose

is defined by means of the relative location of the joints in the skeleton

The number of joints of human pose depends on what the device we use to extract

skeleton For example Kinect sensor vl provides 20 joints while 25 joints are available with Kincct v2 Is this thesis, all cape

datasets MSRAction3D and CMDFall which provide 20 joints The id of each joints

are shown in Fig, 2.1, In this thesis, [use some notations as follov

A joint ith in skeleton at the time t is represented by: pt = (¢, yt, 2) where:

ments are performed on two benchmark

» xt, uf and f is the coordinates of joint;

T=1;2, ; K; Kis the number of the joints used for representing human skeleton

(K = 20 in this thesis),

» €=1,2, , 7, Tis the duration of the action of interest;

The origin (0, 0, 0) is located at the center of the IR sensor on Kinect (see Fig

2.2), X axis grows to the sensors left, y-axis grows up (note that this direction

is based on the sensors tilt) and z-axis grows out in the direction the sensor is facing, the used unit is meter

Given a sequence of skeleton P = {n'}; i= 1,2, ,K and t= 1,2, ,T the skeleton- based human action recognition aims to determine the label of action class

of the action sequence, the noise or missing dats And pre-processing is able to salve

those above issues I have synthesized und adopled reasonable techniques offered by the authors of skeleton-based action recognition prodlems In general, to address the factors

Trang 22

Figure 2.1: Id of the joints of a skeleton in MSRAction3D dataset.

Trang 23

Figure 2.3: Illustration of the temporal normalization [6]

like as the differences among executor and velocity in [2], [8] they used normalization

for scaling skeletal data into range [0,1] An another approach for normalization in [6],

[20] they set the joint hip as original coordinate (0,0,0), the coordinate of rest joints are computed by minus the hip coordinate Instead of subtracting the coordinate of the rest

joints, in [9] they rotate the skeleton and align the horizontal axis x with the direction

from left hip (HL) to right hip (HR) To deal with the varying length of action in [20]

they firstly defined the number of desired frames and use an interpolation algorithm

based on the the known frames to ensure the length of each action is equivalent.

Trang 24

2.3 Hand-crafted features-basedapproach

Figure 2.4: Illustration the Fourier Temporal Pyramid [7]

On the other hand, in [6] they offered their algorithm (see Fig 2.3) called: Cubic spline interpolation of Kinematic Features to normalization the temporal An other approach in [21], [2] to handle the different length of action is by using a pyramidal

approach (see Fig 2.4), To address with the style of the action, following by [14] most

of the authors applied the dynamic temporal warping to compute a nominal curve for

representation for this action [20]

2.3 Hand-crafted features-based approach

In hand-crafted-based approach, features are manually designed and extracted on char-

acteristics of actions The methods belonging to this approach are categorized into 2

atial and temporal descriptors and geometric descriptors

groups:

2.3.1 Spatial-temporal descriptors

In this approach, the authors try to compute temporal-spatial features for each joint

in the skeleton The method introduced by Hussein et al.[2] belongs to this approach

The authors concatenated all information joints pi = Qu, yi, 2) with t= 1,2, , K

Trang 25

at time t to create a vector S = [27, xt x? , Pe a Es ee i Uy Be Be EI yt, yt ft zt, z* , 2°] Foran entire

sequence, they computed the covariance features for representation Moreover, the

authors also proposed a computation on small window to well distinct the actions

which have temporal structures Inspired from [21 in [81 they also computed covarianec matrix named CovP3DJ on the separate parts of body instead of on the all joints (see Fig 2.6) This idea helps to reduce the memory occupation trom 78.26% to 80,35%

but

curacy is not significantly improved since they cid not use the relationship

among the joints in the different parts of body

To solve the issues with covariance matrix that have shortcomings such as being prone

wo be singular, limited capability in modeling: complicated feature reladionship, ard having a fixed form of representation, in [22] they used nonlinear kernel matrices to modify the original covariance matrix This proposal has obtained the promising results and got state-of-the-art results of benchmark datasets Instead of employing all joints for representing action, in [19], the authors stated that each action can be discriminated

by some specific joints So they proposed to extract the most informative joints for

cach action This idea is applicd in many research of the other authors In [7], they proposed the concept of actionlet on the subset of joints Their model is more robust

to the errors in the features, and it can better characterize the intra-class variations in the actions The features which they used to pass in actionlet are the relative position

Trang 26

Figure 2.6: The construction of the CovP3DJ [8]

with the difference between a pairwise joints pi, pj as follows:

Đị =Đí— Bị, LÍ=J Q.1)

In [23], the authors used the subsets of joints to compute two new features: displace-

ment vectors and relative positions Displacement vectors are defined as:

then concatenated However in this work, the subset joints are predefined instead of

calculating from data Therefore, this approach can be suitable for some datasets and

may give poor results on others đai

sets In [9], the authors

informative joints based on differential entropy (the extend of Shanon Entropy) and used bag-of-word scheme with linear CRF for action recognition Figure 2.7 shows the

selected joints for MSRAction3D

Trang 27

the matrix in Lie group, so Lie group is used in many research of authors In [20] the

authors defined bady parts [ror the skeleton, the relative yeometry can be described by considering a rigid-body transformation (including rotation and translation) between lwo rigid body pars Morcover these rigid bodics are members of the the special

Euclidean group SE(3) So the sequences of skeleton are represented as a curved in

the Lie group However classification with the curves in Lie group is difficult, they used Lie algebra to map the curves to the vector To take into account rate variations

among subjects (hey used dynamic time warping (o get nominal curve [or each action

Based on Lie group [24] the author continued using it to represent action sequences However they saw that using the normalization the rotation do not change so they

only used the transition transformation for describing the geometric relative between

body parts This representation reduces the length features by a half if compared

wilh representation in [20] They also proposed a new algorithm for mapping from Lie

Trang 28

2.4 Deep learning based approaches

group to the Lie algebra On the others hand, in [25] proposed the geodesic distance and movement energy for selection key frame before using Lie group for describing

the action sequences The methods based on Lie group are often time consuming

Moreover, the multi-dimensional curves in Lie group are difficult to visualize

felony etl umrpingbierig

) warn) eee]

(a) Describe between two body parts ø„ (b) Illustration logarithm map and un-

and éy, in Lie group [20] wrapping while rolling [24]

Figure 2.8: Representation in Lie group and mapping to Lie algebra

Besides the hand-craft features, in deep learning-based approach often employ end-to-

end network to represent action In [26], the authors used convolutional neuron network for action representation Moreover, they also computed the most informative joints

to create a description of the similarly discriminative movement of actions In [10], the

authors also used CNN to automatically learn action representation However, instead

of using only the raw skeletal data as in [26], they used additional input stream with

skeleton motion M which is calculated by subtract position of all joints at time t and

t+1 After that, they concatenated two inputs stream at the end the convolution layers,

one more fully connected layer is used before passing to softmax layer to classify (see

Fig, 2.9) Some modification of CNNs such as C3D, GCNs [27] are used popularly, In 2D CNN they assumed that it could only encode spatial information, and the employed

3D convolution could extract motion feature from temporal sequence [11] verifies that

3D CNN can achieve faster and accurate performance In [28] they used the 3D filters

Trang 29

430 Carteson Coordinates

Figure 2.10: Hierarchical RNNs for action representation learning [11]

(7x7xS, 5x5x3, 3x5x3) for extraction features Furthermore, they did not only use one

channel but also use 2 channels: one for spatial and one for temporal For this approach

based on CNN, the network is unstable to train Completely training of these models requires immense computation resource

15

Trang 30

Near

#

Figure 2.11: Part-aware LSTM for action recognition [12]

Human action recognition can categorized as the temporal sequence problem so the

architecture for this problem like recurrent neural network, long short term memory

are usually used and obtained good performance For instance in [11] they used the

hierarchical RNNs for representation learning At first, they divided body into five

parts, each body part is passed to one RNN At the next layer they connect the output

of each body part which are linked together The process is repeated until they re-

construct the initial body (see Fig 2.10) Similar idea, In [12] each group joints can

be assigned to a major part of the body, and actions can be interpreted based on the

interactions between body parts or with other objects, they modified the architecture

of general LSTM Instead of keeping a long-term memory of the entire bodys motion

in the cell, they split it into part-based cells Each cell of part has its individual input,

forget but the output gate will be shared among the body parts (see Fig 2.11)

Generally, for the approach based on recurrent networks has a drawback that is the

input data length is not always fixed since the duration of each action are almost

different, Input padding as a solution to this problem does not give a reasonable results So training these networks are very complex to converge and need carefully when optimization loss function.

Trang 31

Figure 2.12: Action recognition by combining CNN and LSTM [13]

Another interest approach applied in many similar problems which are the combination of convolution and recurrent neuron network are used in [13] (see Fig.2.12)

At the some first layers they used CNN with the features extracted from Lie group However they only used 1D CNN instead of 2D CNN as [26], [10], after that they used the output of convolutional layer as the input of LSTM

17

Trang 32

crafted feature-based approaches are intuitive The results obtained in some datasets

are very promising The main benelit of these approaches is that we do nol need

the huge number of data for training The training models are straightforward with

simple classifiers However, hand-cratted has some drawbacks The features are ex- qacted from the perspeelive of the observers, so they often reflect the local property Moreover, the process for selecting and computing these features are sometimes unin-

telligible Also, for the problem with big data or requiring high precision, han¢-crafted

dees not seem lo capture and solve the problern reasonably On the contrary in deep learning method, the relevant features are implicitly found out on training the models

The performance of these architectures are reecntly proved to be promising Despite

these resulls, to train successfully the models is really a difficult isk Al first, in the

design phase, we have to carefully design our model In the second place, with the

huge parameters, we will need a huge data for leaming and also different strategies tor

iraining process This is slill a challenying task for every researcher.

Trang 33

the literature [2] by proposing to detect the most intormative joints for covariance

Tnalrix descriptor computation and by combining velocity with position as features for action representation With the above mentioned improvements, the proposed method

provide morc condensed and ci representation of action Moreover, the velocity information along with the positior allow to distinguish the actions that are similar

in term of joint positions an¢ different in term of joint velocity In this chapter, I

introduce in detail the main steps of my proposed solution

3.L The proposed approach

As analyzed above, Iam going to list main steps of our proposed method based on the raw skeletal data The below pipeline assures that the duration of action is

predetermined through action spotting step Figure 3.1 shows the overview of the

proposed method T cescribe briefly the input and output of each step in the proposed method,

Trang 34

3.2 The most informative joints detection

a

ee

Figure 3.1: Overview of the proposed method

1 Most informative joints detection

+ Input : The raw skeletal data extracted from sensors + Output: The most informative joints for each action

2 Compute the covariance feature using joints’ positions and velocities from the

most informative joints

+ Input: The informative joints of each action

+ Output: Final features that concatenate from covariance matrix of position

and velocity

3 Action recognition using SVM (Support Vector Machine)

+ Input: The features from previous step

+ Output: Model trained for training phase and label of action for testing

phase

In the following sections, I will describe in detail each step of the proposed method,

3-2 Themostinformative joints detection

Inspiring from [19], each action can be represented by some informative joints instead

of all joints Using all joints in action recognition requires large computation and

storage capacity and may also degrade recognition accuracy due to noise in skeletal

20

Trang 35

3.2 The most informative joints detection

Figure 3.2: Skeleton sequence of Action Throw in MSRAction3D

data Noise in skeletal data can be mitigated by using only joints which mainly engage

in each action For instance, with action Throw as shown in Fig 3.2, we only need the information from joint shoulder left, elbow left, wrist left, hand left, the remaining

joints do not have much meaning for recognition process From this observation, in

this thesis, instead of using all joints for computing covariance descriptor as in [2]

I propose to detect first the most information joints and then compute covariance

descriptors from these joints

Two strategies for most informative joints detection with different ideas are proposed

The first strategy named FMIJ (Fixed Most Informative Joints) with fixing the most informative joints in each action It consists of two main steps: The first step aims at determining joints with largest position variances in an action instance of a

subject as candidate joints Then, a set of candidate joints from all subjects are formed

for each action The second step selects joints with the largest number of occurrences in

candidate set as Most Informative Joints The limitation of FMIJ is that the candidate

Joints for each subject with each action has to be selected in the first step After this step, the remaining joints (the joints that are not selected) are no longer considered

Due to the presence of noise in some action instance of some subjects, this strategy may miss the important joints

To overcome the limitation of the first strategy, I propose the second strategy named AMI (Adaptive Most Informative Joints) In details, instead of choosing subset joints

21

Tiêu đề	Skeleton-based human activity representation and recognition
Tác giả	Tien Nam Nguyen
Người hướng dẫn	Assoc. Prof. Thi Lan Le
Trường học	Hanoi University Of Science And Technology
Chuyên ngành	Information System
Thể loại	Luận văn thạc sĩ
Năm xuất bản	2019
Thành phố	Hanoi

Định dạng
Số trang	70
Dung lượng	1,76 MB