So other features in representation action are investigated Goints velocities, com>ined with joints positions to create more discrimination fealure of cach action.. 3.1 Overview of the p
Trang 1RY OF EDUCATION AND TRAINING
LYERSITY OF SCLENCE AND TECHNOLOGY
Tien Nam NGUYEN
SKELETON-BASED TILMAN ACTIVITY REPRESENTATION AND
RECOGNITION
MASTER OF SCIENCE THESIS TIN
TNFORMATION SYSTEM
Hanoi - 2019
Trang 2
HANOI UNIVERSITY OF SCLENCE AND TECHNOLOGY
Tien Nam NGUYEN
SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND
RECOGNITION
Speciality: Information System
MASTER OF SCIENCE THESIS IN
Trang 3GÔNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập — Tự do — [lạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Nguyễn
i luận văn: Nghiên cứu và phát triển phương pháp biểu diễn vả
Đề
nhận đạng hoạt động người dựa trên khung xương
Chuyên ngành: Hệ thông thông tin
Mii sé SV: CBC18019
Tác giá, Người hướng dẫn khoa học và Hội đồng cham luận văn xác nhận
tác giá đã sửa chữa, bỗ sung luận văn theo biên bản họp lIậi đồng ngày
26/10/2019 với c
nội dung sau
STT Yêu cầu của hội đẳng i dung da stra chữa, bồ sung
1 Gop chuong 4 va 5 Da gop chương 4 va chuong § thinh
1 chương tên là Các kết quả thực
nghiém (18n tiéng Anh: Experimental
results)
2 Giải thích lí do lựa chọn các
phương pháp nhận đạng sứ dung trong dé tai
Học viên đã bỗ sung thêm chỉ tiết li
do lựa chọn phương pháp ở chương Ì phần 3
3 Bố sung các độ đo đánh giá
Precision, Recall, Fl
Học viên bố sung thêm thông tin về
cách tính các độ đo đánh giá đã được
trình bày ở chương 4 phân 2 (Evaluation metric) Cac d§ do Precision, Recall va F1 score déu cd
thể được sử dụng để đánh giá hệ
thống nhân dạng Tuy nhiên, trong
luận án, để có thể so sảnh với các
phương pháp đã để xuất trước đó, tủy
vào cơ sở di liệu mà các độ do khác
nhau được sử dụng Cơ sở dữ liệu
MSRAction3D sử dụng độ chính xác (Accuracy) trong khi co sở dữ liệu CMIDFaI sử đụng độ do F1 score Trong bản chỉnh sửa của luận văn, bên cạnh các độ đo sử dụng riêng cho
từng cơ sở đữ liệu, học viên đã bố
Trang 4
sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 5Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica- tion technology where T have been crealed all lhe best conditional to make the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !
Trang 6Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many
fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is that it uses of all joints in the skeleton, which causes computational burden
Trang 7and may become ineffective as each joint has a certain level of engagement
in an action Moreover, the authors employs only Joint positions as joint features It seems not good enough to represent action So other features
in representation action are investigated Goints velocities), com>ined with
joints positions to create more discrimination fealure of cach action This
public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforrns many state-of-the-
art methods, On CMDFall dalasct, the proposed method with FL score of
9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]
and LSTM (I score: 0.46) [5] The contributions of the thesis have been published in an international conferece
Trang 8Challenges and open issues ¡n skeleton-based HAR 2
State of the Art
The proposed approach
The most informative joznts detection
3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22
3.2.1.1 Detect candidate joints foreach action
3.2.1.2 Select the most informalive joints of each action,
Trang 93.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2 441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 10Referenecs 56
Trang 112.1 ki ofthe joints of a skeleton in MSRActon3D daasel 8 2.2 Coordinate System in Kinect 9 2.3 Ilustration of the temporal normalization [6] 9
2.4 Illustration the Fourier Temporal Pyramid [7] 10
2.5 The Cuv3D] descriptur [2]
2.6 The constriction of the CovP3DJ [8]
2.7 Informative joinis delermined [or MSRActien3D dalaset [9] 13
2.8 Representation in Lie group and mapping te Lie algebra
29 Two input streams architecture [10]
2.10 Hierarchical RNNs for action representation learning [11]
2.11 Part-aware LSTM for action recognition [12]
2.12 Action recognition 2y combining CNN and LSTM [13]
3.1 Overview of the proposed method,
3.2 Skeleton sequence of Action Throw in MSRAction3D,
3.3 Candidate joints of Action 6 (Throw) in MSRAction3D dataset The
candidate joints are marked in red colUr
Trang 12The statistical length of each action in MSRAction3D dataset
The statistical number of occurrences of cach action in MSRAction3D
Confusion matrix obtained for ActienSei1
Accuracy obtained with ditferent features on ActionSet2
Confusion matrix obtained for ActionSet2
Accuracy obtained with ditferent features on Action Set 3
Confusion matrix of Action Set 3
Results obtained forCMDFall datasct
Confusion matrix of CMDFall
Comparison the distribution of cach class based on coval
on position in ActionSet] of MSRAction3D
Comparison the distribution of each class based on covariance features
on velocity in ActionSct] of MSRAction3D
Comparison the distribution of each class based on covariance features
on combination of position and velocity in ActionSct] of MSRAction3D 44
vũi
Trang 1320 actions and 3 subscis of MSRAcion3D đatascL
List of action classes and categorization
The selected joints on AS3 of MSRAction3D dataset with two stategies 45
Accuracy (%) of state-of-the-art methods on MSRAction3D dataset
Three best resulls are in bold * On the experiments of authors nut
used all datasets
FI score of state-of-the-art methods on CMDFall dataset
Trang 14List of Acronymtypes
AMIU Adaptive Most Informative Joints,
CgD Convolutional Three Dimensional
CNN Cosvolutional Neural Network
DTW Dynamic Time Warping
FMLJ Fixing Most Informative Joints
HAR Human Action Recognition
LSTM Long short-lerm memory
MIJs Most Informative Joints
NN Neural Network
RNN Recurrent Neural Network
SVM Support Vector Machine.
Trang 15Machine will be going to collaborate with people in many fields Ard to do this,
understanding the human’ s behavior is the vore technology
The main aim of human action recognition (HAR) is to recognize type ofhumar.’ s actions using information captured by scnsors These actions can be executed by
one, two, 4 group people Human action recognition from video is a term which used
to figure out the method to recognize human’ s action from one or several cameras Although, there have been a numerous approaches proposed [or understanding human
action [15], due to the high nuraber of actions and subjects, the huge variation in
action rate, environment, human action recognition is still a challenge for researchers The release of cost-effective depth cameras such as Microsoft Kinect [14] and Asus Xtion PROLIVE allows to open new opportunities for HAR as they provide richer information of the scene Besides color images, depth and skeleton information are
also available Morcover, the lalost rescarch results on humun pose cstimation in RGB
video show that the human pose anc skeleton can be accurately estimated even in
complex scenes Using skeleton information tor human action recognition has several
Trang 161, Challenges and open issucsin skeleton-based HAR
advantages in comparison with those using color and depth information Firstly, unlike
color, skeleton is robust to the human appearance variation Sccondly, HAR based on
skeleton is effective in term of memory requirement and computational time (e.g., a human skeleton contains 20 joints with 3 coordinates (x, y, 2)) Finally, it has been
proven dy the Johansson’ s study thal human vision is able to recognize human ly and the velocity of the different motion patterns by observing the movement of human
skeleton’ s joints ]17)
Astesulis, a wide range of methods for HAR using skeleton information have been
introduced [18], [1] Previous studies have shown that to a certain extent, skeleton-
based HAR methods have solved some of the problems of human action recognition with RGB cameras or video, and have demonstrated yood recognition performance in several benchmarks datasets However, while working with more challenging datasets,
the recognition rates are still low
My thesis aims at improving the performance of human action recognition based on skeleton In the next section, the challenges and open issues in skeleton-based HAR will
be di
scd Then, the objectives and the contributions of my thesis will be provided
1.2 Challenges and open issues in skeleton-based
by assuming that the actions of interest have been detected by the first step HAR
from skelelal data has different difficulties and challenges as follows:
+ As the quality of skeleton estimation is net always perfect especially for non-
standing posture, the skeletal data usually has noise Fig 1.2 illustrates one
exa:nple of Walk action in CMIDFall where some joints are not accurately esti-
Trang 171.2 Challenges and openissues in skeleton-based HAR
a high inter-class similarity in action set For example, three
(draw X, draw circle, draw tick) in MSRAction 3D are very similar
+ As each subject has his/her own manner to perform actions in term of speed,
phase, the action dataset has a high intra-class variation
+ In some datasets for instance CMDFall, the beginning moment of action can be
Figure 1.2; Example of noisy data in CMDFall dataset with action Walk.
Trang 181.3 Objectives and Contributions
Le lalla
Figure 1.3; Example of continuous action in CMDFall datasets,
confused with the previous action as the actions is composed of several action
(e.g., sit on chair then stand up in CMDFall dataset as illustrated in Fig 1.3)
1.3 Objectives and Contributions
Current approaches for skeleton-based action recognition can be roughly divided into
two main categories The first category uses hand-crafted features while the second cat-
egory investigates deep learning methods to automate feature extraction process Deep
leaming based techniques usually require large datasets and high performance comput- ing hardware Among hand-crafted descriptors for action representation, Cov3DJ with
covariance matrix of 3D joint positions proves its effectiveness and computational ef-
ficiency [2] To account for time dimension of skeletal data, a temporal hierarchy
representation is introduced in [2] with multiple layers In the first layer, covariance
matrix is computed on all frames In latter layers, covariance matrices are computed
on shorter temporal windows Cov3DJ is evaluated on MSRAction3D and MSRC12
datasets However, disadvantage of Cov3DJ is the use of all joints in computation, which causes computational burden and may become ineffective as each joint has a
certain level of engagement in an action [19] So I think whether combine the idea
about subset joints from [19] with covariance feature Besides reduction feature di-
mension, it also helps to create discriminative feature between each class Moreover,
the authors employs only joints positions as joint features It seems not good enough
to represent action So other features in representation action are investigated (joints
velocities), combined with joints positions to create more discrimination feature of each action.
Trang 191.4 Outline of thethesis
This thesis improves the covariance-based action recognition method presented in
(2) by (1) proposimy two different schemes to select the most informative joints and (2)
combining velecity information with positions of the joints for action representation
To cvaluate the cffectivencss of the two proposed improvements, extensive cxperi-
ments have been performed on two public datasets (MSRAction3D [3] and CMDFall
[4) The experimental results show that the proposed method obtains 6.17% of im-
provement aver the orginal method and outperforms many state-of-the-art methods
On CMDFall dataset, the proposed method with Fl score of 0.64 outperforms the deep
learning networks ResTCN (FI score: 0.39) [4] and LSTM (FI score: 0,46) [5] The contributions of the thesis have becn published in an international conference
1.4 Outline of the thesis
The thesis consists of 5 main chapters In chapier 2, I present characteristic of skeletal data, pre-processing techniques as well as analyze in detail the advantages anc disad- vantages of the state-of-the-ait approaches proposed tor HAR using skeleton Then, the proposed method is described in Chapter 3 Chapter 4 aims af prosenling exper:
imental results on two benchmark datasets (MSRAction3D and CMDFall) Finally,
conclusions and future works are given in Chapter 5.
Trang 20broadly grouped into two main approaches: hand-crafted and deep learning With the
ability lo seli-study Jeatures from data, deep learning allows to find the features thal hand-crafted can not create Throughout recent years, deep learning receives tremen-
dons interest from rescarching community as a strong feature cneinccring method
However, to precisely extract specific features from raw data is a hard task even with deep learning That is why hand crafting still stays an important role for this prot- fom This chapier aims al analyzing (he slatc-ol-the-art methods proposed for human action recognition I divided these methods into two main categories: hand-crafted features-based and deep learning based
2.1 Overview of skeletal data and skeleton-based
human action recognition
According to [17], a skeleton is considered as a schematic model of the locations of torso, head and limbs of the human body Parameters and motion of such « skeleton
Trang 21
Pre-processingtechniques
can be used as a representation of human actions and therefore, the human body pose
is defined by means of the relative location of the joints in the skeleton
The number of joints of human pose depends on what the device we use to extract
skeleton For example Kinect sensor vl provides 20 joints while 25 joints are avail- able with Kincct v2 Is this thesis, all cape
datasets MSRAction3D and CMDFall which provide 20 joints The id of each joints
are shown in Fig, 2.1, In this thesis, [use some notations as follov
A joint ith in skeleton at the time t is represented by: pt = (¢, yt, 2) where:
ments are performed on two benchmark
» xt, uf and f is the coordinates of joint;
T=1;2, ; K; Kis the number of the joints used for representing human skeleton
(K = 20 in this thesis),
» €=1,2, , 7, Tis the duration of the action of interest;
The origin (0, 0, 0) is located at the center of the IR sensor on Kinect (see Fig
2.2), X axis grows to the sensors left, y-axis grows up (note that this direction
is based on the sensors tilt) and z-axis grows out in the direction the sensor is facing, the used unit is meter
Given a sequence of skeleton P = {n'}; i= 1,2, ,K and t= 1,2, ,T the skeleton- based human action recognition aims to determine the label of action class
of the action sequence, the noise or missing dats And pre-processing is able to salve
those above issues I have synthesized und adopled reasonable techniques offered by the authors of skeleton-based action recognition prodlems In general, to address the factors
Trang 22
Figure 2.1: Id of the joints of a skeleton in MSRAction3D dataset.
Trang 23Figure 2.3: Illustration of the temporal normalization [6]
like as the differences among executor and velocity in [2], [8] they used normalization
for scaling skeletal data into range [0,1] An another approach for normalization in [6],
[20] they set the joint hip as original coordinate (0,0,0), the coordinate of rest joints are computed by minus the hip coordinate Instead of subtracting the coordinate of the rest
joints, in [9] they rotate the skeleton and align the horizontal axis x with the direction
from left hip (HL) to right hip (HR) To deal with the varying length of action in [20]
they firstly defined the number of desired frames and use an interpolation algorithm
based on the the known frames to ensure the length of each action is equivalent.
Trang 242.3 Hand-crafted features-basedapproach
Figure 2.4: Illustration the Fourier Temporal Pyramid [7]
On the other hand, in [6] they offered their algorithm (see Fig 2.3) called: Cubic spline interpolation of Kinematic Features to normalization the temporal An other approach in [21], [2] to handle the different length of action is by using a pyramidal
approach (see Fig 2.4), To address with the style of the action, following by [14] most
of the authors applied the dynamic temporal warping to compute a nominal curve for
representation for this action [20]
2.3 Hand-crafted features-based approach
In hand-crafted-based approach, features are manually designed and extracted on char-
acteristics of actions The methods belonging to this approach are categorized into 2
atial and temporal descriptors and geometric descriptors
groups:
2.3.1 Spatial-temporal descriptors
In this approach, the authors try to compute temporal-spatial features for each joint
in the skeleton The method introduced by Hussein et al.[2] belongs to this approach
The authors concatenated all information joints pi = Qu, yi, 2) with t= 1,2, , K
Trang 25at time t to create a vector S = [27, xt x? , Pe a Es ee i Uy Be Be EI yt, yt ft zt, z* , 2°] Foran entire
sequence, they computed the covariance features for representation Moreover, the
authors also proposed a computation on small window to well distinct the actions
which have temporal structures Inspired from [21 in [81 they also computed covarianec matrix named CovP3DJ on the separate parts of body instead of on the all joints (see Fig 2.6) This idea helps to reduce the memory occupation trom 78.26% to 80,35%
but
curacy is not significantly improved since they cid not use the relationship
among the joints in the different parts of body
To solve the issues with covariance matrix that have shortcomings such as being prone
wo be singular, limited capability in modeling: complicated feature reladionship, ard having a fixed form of representation, in [22] they used nonlinear kernel matrices to modify the original covariance matrix This proposal has obtained the promising results and got state-of-the-art results of benchmark datasets Instead of employing all joints for representing action, in [19], the authors stated that each action can be discriminated
by some specific joints So they proposed to extract the most informative joints for
cach action This idea is applicd in many research of the other authors In [7], they proposed the concept of actionlet on the subset of joints Their model is more robust
to the errors in the features, and it can better characterize the intra-class variations in the actions The features which they used to pass in actionlet are the relative position
Trang 26Figure 2.6: The construction of the CovP3DJ [8]
with the difference between a pairwise joints pi, pj as follows:
Đị =Đí— Bị, LÍ=J Q.1)
In [23], the authors used the subsets of joints to compute two new features: displace-
ment vectors and relative positions Displacement vectors are defined as:
then concatenated However in this work, the subset joints are predefined instead of
calculating from data Therefore, this approach can be suitable for some datasets and
may give poor results on others đai
sets In [9], the authors
informative joints based on differential entropy (the extend of Shanon Entropy) and used bag-of-word scheme with linear CRF for action recognition Figure 2.7 shows the
selected joints for MSRAction3D
© computed the most
Trang 27
the matrix in Lie group, so Lie group is used in many research of authors In [20] the
authors defined bady parts [ror the skeleton, the relative yeometry can be described by considering a rigid-body transformation (including rotation and translation) between lwo rigid body pars Morcover these rigid bodics are members of the the special
Euclidean group SE(3) So the sequences of skeleton are represented as a curved in
the Lie group However classification with the curves in Lie group is difficult, they used Lie algebra to map the curves to the vector To take into account rate variations
among subjects (hey used dynamic time warping (o get nominal curve [or each action
Based on Lie group [24] the author continued using it to represent action sequences However they saw that using the normalization the rotation do not change so they
only used the transition transformation for describing the geometric relative between
body parts This representation reduces the length features by a half if compared
wilh representation in [20] They also proposed a new algorithm for mapping from Lie
Trang 282.4 Deep learning based approaches
group to the Lie algebra On the others hand, in [25] proposed the geodesic distance and movement energy for selection key frame before using Lie group for describing
the action sequences The methods based on Lie group are often time consuming
Moreover, the multi-dimensional curves in Lie group are difficult to visualize
felony etl umrpingbierig
) warn) eee]
(a) Describe between two body parts ø„ (b) Illustration logarithm map and un-
and éy, in Lie group [20] wrapping while rolling [24]
Figure 2.8: Representation in Lie group and mapping to Lie algebra
2.4 Deep learning based approaches
Besides the hand-craft features, in deep learning-based approach often employ end-to-
end network to represent action In [26], the authors used convolutional neuron network for action representation Moreover, they also computed the most informative joints
to create a description of the similarly discriminative movement of actions In [10], the
authors also used CNN to automatically learn action representation However, instead
of using only the raw skeletal data as in [26], they used additional input stream with
skeleton motion M which is calculated by subtract position of all joints at time t and
t+1 After that, they concatenated two inputs stream at the end the convolution layers,
one more fully connected layer is used before passing to softmax layer to classify (see
Fig, 2.9) Some modification of CNNs such as C3D, GCNs [27] are used popularly, In 2D CNN they assumed that it could only encode spatial information, and the employed
3D convolution could extract motion feature from temporal sequence [11] verifies that
3D CNN can achieve faster and accurate performance In [28] they used the 3D filters
Trang 292.4 Deep learning based approaches
430 Carteson Coordinates
Figure 2.10: Hierarchical RNNs for action representation learning [11]
(7x7xS, 5x5x3, 3x5x3) for extraction features Furthermore, they did not only use one
channel but also use 2 channels: one for spatial and one for temporal For this approach
based on CNN, the network is unstable to train Completely training of these models requires immense computation resource
15
Trang 302.4 Deep learning based approaches
Near
#
Figure 2.11: Part-aware LSTM for action recognition [12]
Human action recognition can categorized as the temporal sequence problem so the
architecture for this problem like recurrent neural network, long short term memory
are usually used and obtained good performance For instance in [11] they used the
hierarchical RNNs for representation learning At first, they divided body into five
parts, each body part is passed to one RNN At the next layer they connect the output
of each body part which are linked together The process is repeated until they re-
construct the initial body (see Fig 2.10) Similar idea, In [12] each group joints can
be assigned to a major part of the body, and actions can be interpreted based on the
interactions between body parts or with other objects, they modified the architecture
of general LSTM Instead of keeping a long-term memory of the entire bodys motion
in the cell, they split it into part-based cells Each cell of part has its individual input,
forget but the output gate will be shared among the body parts (see Fig 2.11)
Generally, for the approach based on recurrent networks has a drawback that is the
input data length is not always fixed since the duration of each action are almost
different, Input padding as a solution to this problem does not give a reasonable results So training these networks are very complex to converge and need carefully when optimization loss function.
Trang 312.4 Deep learning based approaches
Figure 2.12: Action recognition by combining CNN and LSTM [13]
Another interest approach applied in many similar problems which are the com- bination of convolution and recurrent neuron network are used in [13] (see Fig.2.12)
At the some first layers they used CNN with the features extracted from Lie group However they only used 1D CNN instead of 2D CNN as [26], [10], after that they used the output of convolutional layer as the input of LSTM
17
Trang 32crafted feature-based approaches are intuitive The results obtained in some datasets
are very promising The main benelit of these approaches is that we do nol need
the huge number of data for training The training models are straightforward with
simple classifiers However, hand-cratted has some drawbacks The features are ex- qacted from the perspeelive of the observers, so they often reflect the local property Moreover, the process for selecting and computing these features are sometimes unin-
telligible Also, for the problem with big data or requiring high precision, han¢-crafted
dees not seem lo capture and solve the problern reasonably On the contrary in deep learning method, the relevant features are implicitly found out on training the models
The performance of these architectures are reecntly proved to be promising Despite
these resulls, to train successfully the models is really a difficult isk Al first, in the
design phase, we have to carefully design our model In the second place, with the
huge parameters, we will need a huge data for leaming and also different strategies tor
iraining process This is slill a challenying task for every researcher.
Trang 33the literature [2] by proposing to detect the most intormative joints for covariance
Tnalrix descriptor computation and by combining velocity with position as features for action representation With the above mentioned improvements, the proposed method
provide morc condensed and ci representation of action Moreover, the velocity information along with the positior allow to distinguish the actions that are similar
in term of joint positions an¢ different in term of joint velocity In this chapter, I
introduce in detail the main steps of my proposed solution
3.L The proposed approach
As analyzed above, Iam going to list main steps of our proposed method based on the raw skeletal data The below pipeline assures that the duration of action is
predetermined through action spotting step Figure 3.1 shows the overview of the
proposed method T cescribe briefly the input and output of each step in the proposed method,
Trang 343.2 The most informative joints detection
a
ee
Figure 3.1: Overview of the proposed method
1 Most informative joints detection
+ Input : The raw skeletal data extracted from sensors + Output: The most informative joints for each action
2 Compute the covariance feature using joints’ positions and velocities from the
most informative joints
+ Input: The informative joints of each action
+ Output: Final features that concatenate from covariance matrix of position
and velocity
3 Action recognition using SVM (Support Vector Machine)
+ Input: The features from previous step
+ Output: Model trained for training phase and label of action for testing
phase
In the following sections, I will describe in detail each step of the proposed method,
3-2 Themostinformative joints detection
Inspiring from [19], each action can be represented by some informative joints instead
of all joints Using all joints in action recognition requires large computation and
storage capacity and may also degrade recognition accuracy due to noise in skeletal
20
Trang 353.2 The most informative joints detection
Figure 3.2: Skeleton sequence of Action Throw in MSRAction3D
data Noise in skeletal data can be mitigated by using only joints which mainly engage
in each action For instance, with action Throw as shown in Fig 3.2, we only need the information from joint shoulder left, elbow left, wrist left, hand left, the remaining
joints do not have much meaning for recognition process From this observation, in
this thesis, instead of using all joints for computing covariance descriptor as in [2]
I propose to detect first the most information joints and then compute covariance
descriptors from these joints
Two strategies for most informative joints detection with different ideas are pro- posed
The first strategy named FMIJ (Fixed Most Informative Joints) with fixing the most informative joints in each action It consists of two main steps: The first step aims at determining joints with largest position variances in an action instance of a
subject as candidate joints Then, a set of candidate joints from all subjects are formed
for each action The second step selects joints with the largest number of occurrences in
candidate set as Most Informative Joints The limitation of FMIJ is that the candidate
Joints for each subject with each action has to be selected in the first step After this step, the remaining joints (the joints that are not selected) are no longer considered
Due to the presence of noise in some action instance of some subjects, this strategy may miss the important joints
To overcome the limitation of the first strategy, I propose the second strategy named AMI (Adaptive Most Informative Joints) In details, instead of choosing subset joints
21