This for acion representation anc 2 combining velocity information wilh posi- tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extens
Trang 1Tien Nam NGUYEN
SKELETON-BASED TILMAN ACTIVITY REPRESENTATION AND
RECOGNITION
MASTER OF SCIENCE THESIS TIN
TNFORMATION SYSTEM
Hanoi - 2019
Trang 2
HANOI UNIVERSITY OF SCLENCE AND TECHNOLOGY
Tien Nam NGUYEN
SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND
RECOGNITION
Speciality: Information System
MASTER OF SCIENCE THESIS IN
Trang 3GÔNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập — Tự do — [lạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Nguyễn
i luận văn: Nghiên cứu và phát triển phương pháp biểu diễn vả
Đề
nhận đạng hoạt động người dựa trên khung xương
Chuyên ngành: Hệ thông thông tin
Mii sé SV: CBC18019
Tác giá, Người hướng dẫn khoa học và Hội đồng cham luận văn xác nhận
tác giá đã sửa chữa, bỗ sung luận văn theo biên bản họp lIậi đồng ngày
1 Gop chuong 4 va 5 Da gop chương 4 va chuong § thinh
1 chương tên là Các kết quả thực
nghiém (18n tiéng Anh: Experimental
results)
2 Giải thích lí do lựa chọn các
phương pháp nhận đạng sứ dung trong dé tai
Học viên đã bỗ sung thêm chỉ tiết li
do lựa chọn phương pháp ở chương Ì phần 3
3 Bố sung các độ đo đánh giá
Precision, Recall, Fl
Học viên bố sung thêm thông tin về
cách tính các độ đo đánh giá đã được
trình bày ở chương 4 phân 2 (Evaluation metric) Cac d§ do
Precision, Recall va F1 score déu cd
thể được sử dụng để đánh giá hệ
thống nhân dạng Tuy nhiên, trong
luận án, để có thể so sảnh với các
phương pháp đã để xuất trước đó, tủy
vào cơ sở di liệu mà các độ do khác
nhau được sử dụng Cơ sở dữ liệu
MSRAction3D sử dụng độ chính xác
(Accuracy) trong khi co sở dữ liệu
CMIDFaI sử đụng độ do F1 score Trong bản chỉnh sửa của luận văn,
bên cạnh các độ đo sử dụng riêng cho
từng cơ sở đữ liệu, học viên đã bố
Trang 4
and may become ineffective as each joint has a certain level of engagement
in an action Moreover, the authors employs only Joint positions as joint
features It seems not good enough to represent action So other features
in representation action are investigated Goints velocities), com>ined with
joints positions to create more discrimination fealure of cach action This
for acion representation anc (2) combining velocity information wilh posi-
tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two
public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforrns many state-of-the-
art methods, On CMDFall dalasct, the proposed method with FL score of
9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]
and LSTM (I score: 0.46) [5] The contributions of the thesis have been
published in an international conferece
Trang 5Referenecs 56
Trang 6Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of
Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica-
tion technology where T have been crealed all lhe best conditional to make
the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with
uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !
Trang 7Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of
Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica-
tion technology where T have been crealed all lhe best conditional to make
the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with
uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !
Trang 8Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
these sensors, besides color images, depth and skeleton infonnation arc also
available Moreover, the latest research results on human rose estimation
in RGB video show that the humaa pose and skeleton can be accurately
estimaled even in complex scenes Using skelclon information for human
action recognition has several aclvantages in comparison with those using color and depth information As results, a wide range of methods for HAR
using skeleton information have been introduced [1] The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning Each has its own advantages and disadvan-
tages Decp learning based techniques obtains impressive resulls several
benchmark datasets However, they usually require large datasets and high
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is
that it uses of all joints in the skeleton, which causes computational burden
Trang 9sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 103.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 11and may become ineffective as each joint has a certain level of engagement
in an action Moreover, the authors employs only Joint positions as joint
features It seems not good enough to represent action So other features
in representation action are investigated Goints velocities), com>ined with
joints positions to create more discrimination fealure of cach action This
for acion representation anc (2) combining velocity information wilh posi-
tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two
public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforrns many state-of-the-
art methods, On CMDFall dalasct, the proposed method with FL score of
9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]
and LSTM (I score: 0.46) [5] The contributions of the thesis have been
published in an international conferece
Trang 12Referenecs 56
Trang 13and may become ineffective as each joint has a certain level of engagement
in an action Moreover, the authors employs only Joint positions as joint
features It seems not good enough to represent action So other features
in representation action are investigated Goints velocities), com>ined with
joints positions to create more discrimination fealure of cach action This
for acion representation anc (2) combining velocity information wilh posi-
tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two
public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforrns many state-of-the-
art methods, On CMDFall dalasct, the proposed method with FL score of
9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]
and LSTM (I score: 0.46) [5] The contributions of the thesis have been
published in an international conferece
Trang 14Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
these sensors, besides color images, depth and skeleton infonnation arc also
available Moreover, the latest research results on human rose estimation
in RGB video show that the humaa pose and skeleton can be accurately
estimaled even in complex scenes Using skelclon information for human
action recognition has several aclvantages in comparison with those using color and depth information As results, a wide range of methods for HAR
using skeleton information have been introduced [1] The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning Each has its own advantages and disadvan-
tages Decp learning based techniques obtains impressive resulls several
benchmark datasets However, they usually require large datasets and high
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is
that it uses of all joints in the skeleton, which causes computational burden
Trang 15Challenges and open issues ¡n skeleton-based HAR 2
State of the Art
Hand-crafted features-based apprcach
The proposed approach
The most informative joznts detection
3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22
3.2.1.1 Detect candidate joints foreach action
3.2.1.2 Select the most informalive joints of each action,
Trang 16Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of
Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica-
tion technology where T have been crealed all lhe best conditional to make
the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with
uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !
Trang 17and may become ineffective as each joint has a certain level of engagement
in an action Moreover, the authors employs only Joint positions as joint
features It seems not good enough to represent action So other features
in representation action are investigated Goints velocities), com>ined with
joints positions to create more discrimination fealure of cach action This
for acion representation anc (2) combining velocity information wilh posi-
tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two
public datasets (MSRAction3D [3] and CMDFall [4] On MSRAction3D,
the experimental results show that the proposed method obtains 6.17% of
improvement over the original method and outperforrns many state-of-the-
art methods, On CMDFall dalasct, the proposed method with FL score of
9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4]
and LSTM (I score: 0.46) [5] The contributions of the thesis have been
published in an international conferece
Trang 18Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of
Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica-
tion technology where T have been crealed all lhe best conditional to make
the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with
uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !
Trang 19Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
these sensors, besides color images, depth and skeleton infonnation arc also
available Moreover, the latest research results on human rose estimation
in RGB video show that the humaa pose and skeleton can be accurately
estimaled even in complex scenes Using skelclon information for human
action recognition has several aclvantages in comparison with those using color and depth information As results, a wide range of methods for HAR
using skeleton information have been introduced [1] The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning Each has its own advantages and disadvan-
tages Decp learning based techniques obtains impressive resulls several
benchmark datasets However, they usually require large datasets and high
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is
that it uses of all joints in the skeleton, which causes computational burden
Trang 20sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 213.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 22Referenecs 56
Trang 23Referenecs 56
Trang 24sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 253.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 26Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
these sensors, besides color images, depth and skeleton infonnation arc also
available Moreover, the latest research results on human rose estimation
in RGB video show that the humaa pose and skeleton can be accurately
estimaled even in complex scenes Using skelclon information for human
action recognition has several aclvantages in comparison with those using color and depth information As results, a wide range of methods for HAR
using skeleton information have been introduced [1] The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning Each has its own advantages and disadvan-
tages Decp learning based techniques obtains impressive resulls several
benchmark datasets However, they usually require large datasets and high
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is
that it uses of all joints in the skeleton, which causes computational burden
Trang 27Abstract
Human action recognition problem with the aim is to predict what action
of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics,
health care Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin
nities for HAR as they provide richer information of the scene Thanks to
ect und Asus Xtion PROLIVE allows lo open new opportu-
these sensors, besides color images, depth and skeleton infonnation arc also
available Moreover, the latest research results on human rose estimation
in RGB video show that the humaa pose and skeleton can be accurately
estimaled even in complex scenes Using skelclon information for human
action recognition has several aclvantages in comparison with those using color and depth information As results, a wide range of methods for HAR
using skeleton information have been introduced [1] The methods proposed
for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning Each has its own advantages and disadvan-
tages Decp learning based techniques obtains impressive resulls several
benchmark datasets However, they usually require large datasets and high
performance computing hardware Among hanc-crafted descriptors for ac-
tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions
proves its effectiveness and computational efficiency [2] To take into ac-
count the duration variation of action, a temporal hicrarshy representation
is introduced with multiple layers However, the disadvantage of Cov3DI is
that it uses of all joints in the skeleton, which causes computational burden
Trang 283.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 29Challenges and open issues ¡n skeleton-based HAR 2
State of the Art
Hand-crafted features-based apprcach
The proposed approach
The most informative joznts detection
3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22
3.2.1.1 Detect candidate joints foreach action
3.2.1.2 Select the most informalive joints of each action,
Trang 303.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 31Challenges and open issues ¡n skeleton-based HAR 2
State of the Art
Hand-crafted features-based apprcach
The proposed approach
The most informative joznts detection
3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22
3.2.1.1 Detect candidate joints foreach action
3.2.1.2 Select the most informalive joints of each action,
Trang 32Referenecs 56
Trang 33Challenges and open issues ¡n skeleton-based HAR 2
State of the Art
Hand-crafted features-based apprcach
The proposed approach
The most informative joznts detection
3.2.1 Stralegy 1 (MT) for most information joints delsctlon 22
3.2.1.1 Detect candidate joints foreach action
3.2.1.2 Select the most informalive joints of each action,
Trang 343.2.2 Stralegy 2 (AM) far most information joints deleclon 24
3.3 Action representation by covariance descriptor
Evaluation of features used for joint representation
4.4.1 Results on MSRAction3D dataset
44.1.1 ActionSetl
4412 ActionSet2
441.3 ActionSet?
44.2 Results on CMDFull dalascl
45 Evaluation of the most intormative joints selection
4.5.1 The effect of the number of most informative somnts
4.5.2 Comparison between two strategies
Comparison with state-of-the-art methods
Trang 35sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 36sung thêm báng 4.7 ở chương 4 kết
qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm
Ngày 07 tháng L1 năm 2019
CHỦ TỊCH HỘI DÒNG
Trang 37Acknowlcdgements
T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute The door of
Assox Prof, Lan office was always open whenever Tran into ¢ troubdle spot
or had a question about my research or writing She consistently allowed
this thesis to be my own work, but steered me in the right the direction
whenever she thought T needed it,
T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc Prof Tran Thi Thanh Hai, PhD
student Pham Dinh Tan who participated and give me more useful infor- mation Without their passionate participation and input, the validation
survey could not have been successfully conducted,
I would also like to acknowledge to School of [nformation and Communica-
tion technology where T have been crealed all lhe best conditional to make
the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis
Finally, I must express my very profound gratitude to my parents, my sister
and also to my colleagues in Toshiba Software Development VietNam (Nha
Dink Duc, Pham Van Thanh and many colleagues) for providing: me with
uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis This
accomplistment would not have been possible without them Thank you !