Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn

Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn luận văn tốt nghiệp thạc sĩ

Trang 1

IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION

WITH SPATIAL-TEMPORAL POOLING AND

VIEW SHIFTING TECHNIQUES

Trang 2

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Trang 3

ACKNOWLEDGEMENT

First of all, I sincerely thank the teachers in the School of Information and Communication Technology as well as all the teachers at the Hanoi University of Technology has taught me the knowledge and valuable experience during the past 5 years

I would like to thank the two supervisors, Dr Nguyen Thi Oanh - lecturer in Information Systems and Communication, Institute of Information and Communication Technology, Hanoi University of Technology and Dr Tran Thi Thanh Hai, MICA Research Institute has guided me to complete this master thesis I have learned a lot from them, not only the knowledge of the field of computer vision but also working and studying skills such as writing papers, preparing slides and presenting to the crowd

Finally, I would like to send my thanks to my family, friends and people who have always supported me in the process of studying and researching this thesis

Hanoi, March 2018 Master student

Tuan Dung LE

Trang 4

TABLE OF CONTENT

ACKNOWLEDGEMENT 3

TABLE OF CONTENT 4

LIST OF FIGURES 6

LIST OF TABLES 8

LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS 9

INTRODUCTION 10

CHAPTER 1 HUMAN ACTION RECOGNITION APPROACHES 12

1.1 Overview 12

1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words model 20

CHAPTER 2 PROPOSED FRAMEWORK 24

2.1 General framework 24

2.2 Combination of spatial/temporal information and Bag-of-Words model 25

2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW) 25 2.2.2 Combination of temporal information and Bag-of-Words model (T-BoW) 26

2.3 View shifting technique 27

CHAPTER 3 EXPERIMENTS 30

3.1 Setup environment 30

3.2 Setup 30

3.3 Datasets 30

3.3.1 Western Virginia University Multi-view Action Recognition Dataset (WVU) 30

3.3.2 Northwestern-UCLA Multiview Action 3D (N-UCLA) 32

3.4 Performance measurement 33

3.5 Experiment results 35

3.5.1 WVU dataset 35

3.5.2 N-UCLA dataset 40

CONCLUSION & FUTURE WORK 43

REFERENCES 44

Trang 5

APPENDIX 1 47

Trang 6

LIST OF FIGURES

Figure 1 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body

model, g) cylindrical/ellipsoid human body model [1] 14

Figure 1 2 Construct HOG-HOF descriptive vector based on SSM matrix[6] 16

Figure 1 3 a) Original video of walking action with viewpoints 0𝑜 and 45𝑜, their volumes and silhouettes, b) epipolar geometry in case of extracted actor body silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9] 16

Figure 1 4 MHI (middle row) and MEI (last row) template [15] 18

Figure 1 5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16] 21

Figure 1 6 Three ways to combine multiple 2D views information in the BoW model [11] 21

Figure 2 1 Proposed framework 24

Figure 2 2 Dividing space domain based on bounding box and centroid 26

Figure 2 3 Illustration of T-BoW model 27

Figure 2 4 Illustration of view shifting in testing phase 28

Figure 3 1 Ilustration of 12 action classes in the WVU Multi-view actions dataset 31

Figure 3 2 Cameras setup for capturing WVU dataset 31

Figure 3 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D dataset 32

Figure 3 4 Cameras setup for capturing N-UCLA dataset 33

Figure 3 5 Illustration of confusion matrix 35

Figure 3 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41% 37

Figure 3 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook D3, accuracy 96,67% 38

Figure 3 8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; b) S-BoW mofel with 6 spatial parts, codebook D3, accuracy 63,40% 41

Trang 7

Figure 3 9 Illustration of view shifting on N-UCLA dataset 42

Trang 8

LIST OF TABLES

Table 3 1 Accuracy (%) of basic BoW model on WVU dataset 36

Table 3 2 Accuracy (%) of T-BoW model on WVU dataset 36

Table 3 3 Accuracy (%) of S-BoW model on WVU dataset 38

Table 3 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset 39

Table 3 5 Comparison with others methods on WVU Dataset 39

Table 3 6 Accuracy (%) of basic model on N-UCLA dataset 40

Table 3 7 Accuracy (%) of T-BoW model on N-UCLA dataset 40

Table 3 8 Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset 41

Table 3 9 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on N-UCLA dataset 42

Trang 9

LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS

Trang 10

INTRODUCTION

In the growing social scene from the 3.0 era (automation of information technology and electronic production) to the new 4.0 (a new convergence of technologies such as the Internet Things - Internet, collaboration robots, 3D printing and cloud computing, and the emergence of new business models), automatically collecting and processing information by the computer is very necessary This leads

to higher demands on the interaction between humans and machines both in precision and speed Thus, the problems of object recognition, motion recognition, speech recognition are now attracting a lot of interest of scientists and companies around the world Nowadays, video data is easily generated by devices such as digital cameras, laptops, mobile phones, and video-sharing websites Human action recognition in the video, contributing to the automated exploitation of the resources

of this rich data source

Applications related to human action recognition problems such as: Security and traditional monitoring systems include networks of cameras and are monitored

by humans With the increase in the number of cameras as well as these systems being deployed in multiple locations, the supervisor's efficiency and accuracy issues are required to cover the entire system The task of computer vision is to find a solution that can replace or assist the supervisor Automatic recognition of abnormalities from surveillance systems is a matter that attracts a lot of research The problem of enhancing interaction between humans and machines is still challenging, the visual cues are the most important method of non-verbal communication Effectively exploiting gesture-based communication will create a more accurate and natural human-computer interaction A typical application in the field is the "smart home", intelligent response to the gesture, the action of the user However, these applications are still incomplete and still attract more research In addition, human action recognition problem is also applied in a number of other applications, such as robots, content-based video analysis, content-based and recovery-based video compression, video indexing, and virtual reality games

Trang 11

With the aim of studying and approaching the problem of human action recognition using a combination of multiple views, we explored some of the recent approaches and chose to experiment with the method of using combination of local feature and Bag-of-Words model After analyzing the weaknesses of the method, we proposed a plan for improvement and evaluate it by doing experiments The thesis will be presented in the following format:

 Chapter 1: This chapter focuses on the approaches to provide readers with an overview of the problem of human action recognition in general and using multiple views in particular The last part of this chapter introduces a method that using combination of local feature and the Bag-of-Words model, evaluates the advantages and disadvantages of the method, and then introduces the proposed improvement methods

 Chapter 2: This chapter focuses on presenting an improvement framework using

a combination of spatial/temporal information and view shifting techniques

 Chapter 3: Experiment the proposed method and give the results with some evaluation

 Conclusion and Future works: This section will look at what has been and is not done in the master's thesis and highlight pros and cons and future development

 References

Trang 12

CHAPTER 1 HUMAN ACTION RECOGNITION APPROACHES 1.1 Overview

Recognition and analysis of human actions has been a subject that has attracted much interest over the past three decades and is currently being actively researched

in the field of computer vision This is a good solution to solve the problems of a large number of potential applications in the scope of intelligent monitoring, video recovery, video analysis and human-machine interaction Recent research has highlighted the difficulty of this problem with the large fluctuations in human actions data such as the variability in the way individuals perform actions; movement and clothing; camera angles and motion effects; light fluctuations; occlusion due to objects in the environment or parts of the human body; or disturbances in the surroundings Because there are so many factors that can affect the outcome of the problem, current methods are often limited or placed in simple scenarios with simple backgrounds, simple action classes, and stationary cameras or limit the variation in viewing angles

Many different approaches have been proposed over the years for human action recognition These approaches may be categorized depending on the visual information used to describe the action Single-view methods use a camera to record the human body during the execution of the action However, the appearance of the action is quite different when viewed at arbitrary angle of view Thus, single-view methods are often accompanied by a basic assumption that action is observed from the same angle in both the training data and the testing data The efficiency of single-view methods is significantly reduced if this assumption is not true The obvious way

to improve the accuracy of human action recognition is to increase the number of views per action by increasing the number of cameras, which enables us to exploit a larger amount of visual information to describe an action The multi-views approach has been studied for only a decade now because the limited capabilities of devices and tools in previous decades did not adequately meet the calculated volume of the

Trang 13

method need Recent technological advances have brought powerful tools that allow the multi-view approach to become available in a variety of application contexts Action recognition methods can be divided into two approaches: the traditional approach of using manual features, the approach of neural network An approach using neural networks typically requires large sets of training data, otherwise it would

be ineffective In practical applications, datasets are usually medium and small in size Therefore, in the context of this study, we are interested in a traditional approach that utilizes manually selected features In this approach, the performance representation can be constructed from 2D data (2D approach) or from 3D data (3D approach) [1]

 3D approaches

The general trend in 3D methods is to integrate visual information captured by various angles of view, then represent actions by a 3D model This is, usually, achieved by combining 2D human body poses in terms of binary silhouettes denoting the video frame pixels belonging to the human body on each camera (Fig 1.1b) After obtaining the corresponding 3D human body representation, actions are described as sequences of successive 3D human body poses Human body representations adopted

by 3D methods include visual hulls (Fig 1.1c), motion history volumes (Fig 1.1d) [2], optical flow corresponding to the human body (Fig 1.1e) [3], Gaussian blobs (Fig 1.1f) [4], cylindrical/ellipsoid body models (Fig 1.1g) [5] …

Trang 14

Figure 1 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human

body model, g) cylindrical/ellipsoid human body model [1]

in all cameras because they are outside of the camera's recording area or are occluded

by other objects Obviously, when there is not enough information from all the cameras, it is impossible to obtain accurate 3D descriptions of the human body and

Trang 15

thus produce false prediction On the other hand, multidimensional 2D viewing methods can completely overcome the drawback mentioned above 2D methods tend

to look for invariant features on different angles and then combine the predicted results in the action class Thus, the lack of information of a view does not affect the results Multidimensional 2D viewing methods are often divided into two smaller approaches:

o View-invariant features

First approach is to try to represent the action by describing it with features which are invariant to the view [6, 7, 8, 9, 10] Action recognition is performed on each video from each independent camera First, the methods will show represent the action by view-invariant features, then the action class is based on this invariant feature

A view-invariant approach proposed by N.Junejo et al.[6] is to calculate the similarity of a series of images over time and to observe the stability of the model over multiple viewing angles It then builds a descriptive vector that records the structural characteristics of similarity and time difference in a sequence of actions First, from each video, the authors calculate the difference between consecutive frames From there the authors build a self-similarities matrix called SSM-pos for 3D data and SSM-HOG-HOF for 2D data Next, from the existing SSM model, the authors extracted the local SSM vector (Fig 1.2) and introduced it into the K-mean clustering algorithm Clusters will correspond to a word in the dictionary (BoW approach) Finally, the SVM classifier uses the squared kernel with one-vs-all strategy The advantage of the method is that it achieves high stability under varying

Trang 16

Figure 1 2 Construct HOG-HOF descriptive vector based on SSM matrix [6]

Another view-invariant approach proposed by Anwaar-ul-Haq et al [9] is based

on dense optical flow and epipolar geometry The authors propose a novel similarity score for action matching based on the property of segmentation matrix or two-body fundamental matrix It helps establishing view invariant action matching framework without any preprocessing on original video sequences

their volumes and silhouettes, (b) epipolar geometry in case of extracted actor body

Trang 17

silhouettes, (c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9]

o Combination of information from multi-view

Second approach will be done by combining information from different views [11, 12, 13] Unlike view-invariant approach, we find that different views will hold different amounts of information and may complement one another For example, parts of the human body may be occluded in one view, but are captured different from one another This combination is quite similar to combining views to give a 3D representation of action as 3D approach approaches, but 2D approaches can combine information at different stages of the classification problem: combining features at different views, then classifying or combining the results after performing the classification at different views It may give the final classification results with higher accuracy The two issues that need to be addressed in this approach are (1) what features to use to characterize the action and (2) how to combine information between views in order to achieve the best prediction results

For problem (1), according to [14] we can divide the representation of action into: global representation and representation by local feature Global representation often record the structure, shape, and movement of the human body Two examples

of the global representation are MHI and MEI [15] The idea of these two methods is

to encode information about human motion and shape in a sequence of images into a single "template" image (Figure 1.3), then we can exploit information needed from this template The global performances were studied extensively in the problem of action identification in the period 1997-2007, which often preserved the spatial and

Trang 18

Figure 1 4 MHI (middle row) and MEI (last row) template [15]

Local representations of action using a pipeline include point-of-interest detection, local feature extraction, and aggregation of local features into action representation vectors In order to detect feature points in the video, many point detection detectors have been proposed: Harris3D [16], Cuboid [17], Hessian3D [18], dense sampling After detecting the feature, we will compute a vector describing this point based on the image intensity of the points around this point in three dimensions The most commonly used descriptive vector types are Cuboid, HOG/HOF, HOG3D, ESURF Next, locally-described vectors are used to train BoW model and provide a descriptive vector representing each video Finally, the vector representations of this video pass through a classifier to label an action According to [19], the author made

a combined experiment of detectors and descriptive vector types on two dataset KTH and UCF sports The KTH achieved the highest accuracy when combining the Harris3D detector and the HoF vector, while the UCF achieved the highest accuracy when combining dense sampling and the HOG3D vector In summary, it is difficult

to judge which combination is best for every situation

Trang 19

For problem (2), two most commonly ways of combining information from different views is early and late fusion In the case of early fusion, feature descriptors from different views are concatenated together to form a final vector of the action before being pushed into a common classifier For the late fusion, the classifiers from each view will be trained and their results will be used to produce the final result G.Burghouts et al [11] use the STIP feature and BoW model The results show that late fusion achieved the highest accuracy on the IXMAS dataset R.Kavi et al [20] extract features from LMEI (similar to the MEI template but with more spatially distributed information), then included in the LDA classifier of each individual view The final results are obtained by combining the outputs of the LDAs Another method

of R.Kavi et al [21] using the LSTM ConvNets structure, yielding results using early and late fusion strategies Results from the articles showed that late fusion achieved higher recognition accuracy than early fusion There are two possible explanations for this result First, the occlusion at some views causes the wrong extraction or lack

of feature from that view, so that the vector representing the final action is less effective in classifying Secondly, when using a multiple camera system and a person choosing the direction and position to perform different actions, early fusion will produce vectors representing for the same action but not correlated because the appearance of people on the camera is different Therefore, the accuracy when using early fusion in this case decreases Although, the use of late fusion is also influenced

by this cause, however, when training a classifier for each view, if there exists a view that gives high probability of prediction the right class of action then we can make the final prediction exactly To improve efficiency in case of differences in position and direction of exercise in the two sets of assessment and training data, R Kavi et

Trang 20

Section 1.2 will introduce the baseline method that we apply and propose an improving framework

1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words

model

G.Burghouts et al.[11] proposed a BoW pipeline consisting of STIP features [10] (using Harris 3D detector and HOG/HOF descriptor) extracted from video (Fig 1.4), a random forest model [22] to transform the features into histograms that serves

as video descriptor, and a SVM classifier to predict the action class The authors experiment several strategies to combine the information from multiple views: combining of features as early fusion; combining of video descriptors as intermediate fusion, and posterior probability as late fusion (Fig 1.5) Experiment results shown that averaging prediction probability from all views gained highest accuracy on IXMAS dataset

Trang 21

Figure 1 5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16]

Figure 1 6 Three ways to combine multiple 2D views information in the BoW model [11]

Assume that we have observations from 𝑀 views to capture human actions and

a set of 𝑁 different actions to be recognized For a specific view, we perform the following steps Firstly, STIP features are extracted from video With each local keypoint detected, a histogram of oriented gradients (HoG) and a histogram of optic flows (HOF) in a 3 × 3 × 2 spatio-temporal blocks are computed to capture shape and motion information in the local neighborhood of this point By concatenating HOG and HOF histograms, we acquire a descriptor with 162 values

Secondly, for each action, a random forest model is trained to learn the codebook Random Forest is used instead of K-mean clustering because of its capability to create a more discriminate codebook and its speed [23] The training input consists of a set of positive features from an action class and a set of negative

Trang 22

𝐷𝑚 = { 𝐷𝑚1, 𝐷𝑚2, 𝐷𝑚𝑁} (1.1) where 𝐷𝑚𝑖 is the codebook for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ

The next step is to quantize the STIP features of a video into histogram We pass STIP features through the learned forests then we obtain a 320-bin normalized histogram to describe a video Finally, this descriptors is used to train a binary 𝑆𝑉𝑀𝑚𝑖classifier for class 𝑖𝑡ℎ Corresponding to the codebooks, we have a set of binary 𝑆𝑉𝑀𝑠:

𝑆𝑉𝑀𝑚 = { 𝑆𝑉𝑀𝑚1, 𝑆𝑉𝑀𝑚2, 𝑆𝑉𝑀𝑚𝑁} (1.2) where 𝑆𝑉𝑀𝑚𝑖 is the classifier for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ

The SVM is trained using a chi kernel (C = 1) and the SVM outputs are the posterior probabilities In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM's scores, fit by an additional cross-validation

on the training data With each test sample in the view 𝑚𝑡ℎ, we will have a set of probabilities:

𝑃𝑚 = { 𝑃𝑚1, 𝑃𝑚2, 𝑃𝑚𝑁} (1.3) where 𝑃𝑚𝑖 is the probabilities of action 𝑖𝑡ℎ in view 𝑚𝑡ℎ

Then, the posterior probabilities from all views are combined by taking their average:

Trang 23

This method shown a good performance on multi-views human action recognition, gained 96,4% accuracy on IXMAS dataset (selective negative samples for random forest) We also test this method by random negative samples for random forest and gained 88% accuracy on same dataset However, local descriptor STIP provide shape and motion information of the keypoint but lack of location information (both spatial and temporal coordinates) Moreover, BoW model provides the distribution information of visual word in a video but also lack of the information

on the appearance order as well as the spatial correlation between visual words These factors may lead to confusing between action which have similar local information but actually difference in relative positions such as arms and legs or order of appearance Based on this deduction, several methods proposed adding spatial/temporal information of local features into BoW model and shown an improving performance on single-view human action recognition [24, 25] Parul Shukla et al.[24] divide a video into several parts base on time domain M.Ullah et

al [25] divide detected spatial domain into several smaller parts by using action detector, motion and object detector Common purpose is obtained a final descriptor that provide more information by combining information from smaller spatial/temporal parts

Based on these ideas that we have mentioned above, we proposed a framework for human action recognition using multiple views This will be described in detail in next chapter

Định dạng
Số trang	47
Dung lượng	1,36 MB