Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn luận văn tốt nghiệp thạc sĩ
Trang 1IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION
WITH SPATIAL-TEMPORAL POOLING AND
VIEW SHIFTING TECHNIQUES
Trang 2MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Trang 3ACKNOWLEDGEMENT
First of all, I sincerely thank the teachers in the School of Information and Communication Technology as well as all the teachers at the Hanoi University of Technology has taught me the knowledge and valuable experience during the past 5 years
I would like to thank the two supervisors, Dr Nguyen Thi Oanh - lecturer in Information Systems and Communication, Institute of Information and Communication Technology, Hanoi University of Technology and Dr Tran Thi Thanh Hai, MICA Research Institute has guided me to complete this master thesis I have learned a lot from them, not only the knowledge of the field of computer vision but also working and studying skills such as writing papers, preparing slides and presenting to the crowd
Finally, I would like to send my thanks to my family, friends and people who have always supported me in the process of studying and researching this thesis
Hanoi, March 2018 Master student
Tuan Dung LE
Trang 4TABLE OF CONTENT
ACKNOWLEDGEMENT 3
TABLE OF CONTENT 4
LIST OF FIGURES 6
LIST OF TABLES 8
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS 9
INTRODUCTION 10
CHAPTER 1 HUMAN ACTION RECOGNITION APPROACHES 12
1.1 Overview 12
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words model 20
CHAPTER 2 PROPOSED FRAMEWORK 24
2.1 General framework 24
2.2 Combination of spatial/temporal information and Bag-of-Words model 25
2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW) 25 2.2.2 Combination of temporal information and Bag-of-Words model (T-BoW) 26
2.3 View shifting technique 27
CHAPTER 3 EXPERIMENTS 30
3.1 Setup environment 30
3.2 Setup 30
3.3 Datasets 30
3.3.1 Western Virginia University Multi-view Action Recognition Dataset (WVU) 30
3.3.2 Northwestern-UCLA Multiview Action 3D (N-UCLA) 32
3.4 Performance measurement 33
3.5 Experiment results 35
3.5.1 WVU dataset 35
3.5.2 N-UCLA dataset 40
CONCLUSION & FUTURE WORK 43
REFERENCES 44
Trang 5APPENDIX 1 47
Trang 6LIST OF FIGURES
Figure 1 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body
model, g) cylindrical/ellipsoid human body model [1] 14
Figure 1 2 Construct HOG-HOF descriptive vector based on SSM matrix[6] 16
Figure 1 3 a) Original video of walking action with viewpoints 0𝑜 and 45𝑜, their volumes and silhouettes, b) epipolar geometry in case of extracted actor body silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9] 16
Figure 1 4 MHI (middle row) and MEI (last row) template [15] 18
Figure 1 5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16] 21
Figure 1 6 Three ways to combine multiple 2D views information in the BoW model [11] 21
Figure 2 1 Proposed framework 24
Figure 2 2 Dividing space domain based on bounding box and centroid 26
Figure 2 3 Illustration of T-BoW model 27
Figure 2 4 Illustration of view shifting in testing phase 28
Figure 3 1 Ilustration of 12 action classes in the WVU Multi-view actions dataset 31
Figure 3 2 Cameras setup for capturing WVU dataset 31
Figure 3 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D dataset 32
Figure 3 4 Cameras setup for capturing N-UCLA dataset 33
Figure 3 5 Illustration of confusion matrix 35
Figure 3 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41% 37
Figure 3 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook D3, accuracy 96,67% 38
Figure 3 8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; b) S-BoW mofel with 6 spatial parts, codebook D3, accuracy 63,40% 41
Trang 7Figure 3 9 Illustration of view shifting on N-UCLA dataset 42
Trang 8LIST OF TABLES
Table 3 1 Accuracy (%) of basic BoW model on WVU dataset 36
Table 3 2 Accuracy (%) of T-BoW model on WVU dataset 36
Table 3 3 Accuracy (%) of S-BoW model on WVU dataset 38
Table 3 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset 39
Table 3 5 Comparison with others methods on WVU Dataset 39
Table 3 6 Accuracy (%) of basic model on N-UCLA dataset 40
Table 3 7 Accuracy (%) of T-BoW model on N-UCLA dataset 40
Table 3 8 Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset 41
Table 3 9 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on N-UCLA dataset 42
Trang 9LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS
Trang 10INTRODUCTION
In the growing social scene from the 3.0 era (automation of information technology and electronic production) to the new 4.0 (a new convergence of technologies such as the Internet Things - Internet, collaboration robots, 3D printing and cloud computing, and the emergence of new business models), automatically collecting and processing information by the computer is very necessary This leads
to higher demands on the interaction between humans and machines both in precision and speed Thus, the problems of object recognition, motion recognition, speech recognition are now attracting a lot of interest of scientists and companies around the world Nowadays, video data is easily generated by devices such as digital cameras, laptops, mobile phones, and video-sharing websites Human action recognition in the video, contributing to the automated exploitation of the resources
of this rich data source
Applications related to human action recognition problems such as: Security and traditional monitoring systems include networks of cameras and are monitored
by humans With the increase in the number of cameras as well as these systems being deployed in multiple locations, the supervisor's efficiency and accuracy issues are required to cover the entire system The task of computer vision is to find a solution that can replace or assist the supervisor Automatic recognition of abnormalities from surveillance systems is a matter that attracts a lot of research The problem of enhancing interaction between humans and machines is still challenging, the visual cues are the most important method of non-verbal communication Effectively exploiting gesture-based communication will create a more accurate and natural human-computer interaction A typical application in the field is the "smart home", intelligent response to the gesture, the action of the user However, these applications are still incomplete and still attract more research In addition, human action recognition problem is also applied in a number of other applications, such as robots, content-based video analysis, content-based and recovery-based video compression, video indexing, and virtual reality games
Trang 11With the aim of studying and approaching the problem of human action recognition using a combination of multiple views, we explored some of the recent approaches and chose to experiment with the method of using combination of local feature and Bag-of-Words model After analyzing the weaknesses of the method, we proposed a plan for improvement and evaluate it by doing experiments The thesis will be presented in the following format:
Chapter 1: This chapter focuses on the approaches to provide readers with an overview of the problem of human action recognition in general and using multiple views in particular The last part of this chapter introduces a method that using combination of local feature and the Bag-of-Words model, evaluates the advantages and disadvantages of the method, and then introduces the proposed improvement methods
Chapter 2: This chapter focuses on presenting an improvement framework using
a combination of spatial/temporal information and view shifting techniques
Chapter 3: Experiment the proposed method and give the results with some evaluation
Conclusion and Future works: This section will look at what has been and is not done in the master's thesis and highlight pros and cons and future development
References
Trang 12CHAPTER 1 HUMAN ACTION RECOGNITION APPROACHES 1.1 Overview
Recognition and analysis of human actions has been a subject that has attracted much interest over the past three decades and is currently being actively researched
in the field of computer vision This is a good solution to solve the problems of a large number of potential applications in the scope of intelligent monitoring, video recovery, video analysis and human-machine interaction Recent research has highlighted the difficulty of this problem with the large fluctuations in human actions data such as the variability in the way individuals perform actions; movement and clothing; camera angles and motion effects; light fluctuations; occlusion due to objects in the environment or parts of the human body; or disturbances in the surroundings Because there are so many factors that can affect the outcome of the problem, current methods are often limited or placed in simple scenarios with simple backgrounds, simple action classes, and stationary cameras or limit the variation in viewing angles
Many different approaches have been proposed over the years for human action recognition These approaches may be categorized depending on the visual information used to describe the action Single-view methods use a camera to record the human body during the execution of the action However, the appearance of the action is quite different when viewed at arbitrary angle of view Thus, single-view methods are often accompanied by a basic assumption that action is observed from the same angle in both the training data and the testing data The efficiency of single-view methods is significantly reduced if this assumption is not true The obvious way
to improve the accuracy of human action recognition is to increase the number of views per action by increasing the number of cameras, which enables us to exploit a larger amount of visual information to describe an action The multi-views approach has been studied for only a decade now because the limited capabilities of devices and tools in previous decades did not adequately meet the calculated volume of the
Trang 13method need Recent technological advances have brought powerful tools that allow the multi-view approach to become available in a variety of application contexts Action recognition methods can be divided into two approaches: the traditional approach of using manual features, the approach of neural network An approach using neural networks typically requires large sets of training data, otherwise it would
be ineffective In practical applications, datasets are usually medium and small in size Therefore, in the context of this study, we are interested in a traditional approach that utilizes manually selected features In this approach, the performance representation can be constructed from 2D data (2D approach) or from 3D data (3D approach) [1]
3D approaches
The general trend in 3D methods is to integrate visual information captured by various angles of view, then represent actions by a 3D model This is, usually, achieved by combining 2D human body poses in terms of binary silhouettes denoting the video frame pixels belonging to the human body on each camera (Fig 1.1b) After obtaining the corresponding 3D human body representation, actions are described as sequences of successive 3D human body poses Human body representations adopted
by 3D methods include visual hulls (Fig 1.1c), motion history volumes (Fig 1.1d) [2], optical flow corresponding to the human body (Fig 1.1e) [3], Gaussian blobs (Fig 1.1f) [4], cylindrical/ellipsoid body models (Fig 1.1g) [5] …
Trang 14Figure 1 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human
body model, g) cylindrical/ellipsoid human body model [1]
in all cameras because they are outside of the camera's recording area or are occluded
by other objects Obviously, when there is not enough information from all the cameras, it is impossible to obtain accurate 3D descriptions of the human body and
Trang 15thus produce false prediction On the other hand, multidimensional 2D viewing methods can completely overcome the drawback mentioned above 2D methods tend
to look for invariant features on different angles and then combine the predicted results in the action class Thus, the lack of information of a view does not affect the results Multidimensional 2D viewing methods are often divided into two smaller approaches:
o View-invariant features
First approach is to try to represent the action by describing it with features which are invariant to the view [6, 7, 8, 9, 10] Action recognition is performed on each video from each independent camera First, the methods will show represent the action by view-invariant features, then the action class is based on this invariant feature
A view-invariant approach proposed by N.Junejo et al.[6] is to calculate the similarity of a series of images over time and to observe the stability of the model over multiple viewing angles It then builds a descriptive vector that records the structural characteristics of similarity and time difference in a sequence of actions First, from each video, the authors calculate the difference between consecutive frames From there the authors build a self-similarities matrix called SSM-pos for 3D data and SSM-HOG-HOF for 2D data Next, from the existing SSM model, the authors extracted the local SSM vector (Fig 1.2) and introduced it into the K-mean clustering algorithm Clusters will correspond to a word in the dictionary (BoW approach) Finally, the SVM classifier uses the squared kernel with one-vs-all strategy The advantage of the method is that it achieves high stability under varying
Trang 16Figure 1 2 Construct HOG-HOF descriptive vector based on SSM matrix [6]
Another view-invariant approach proposed by Anwaar-ul-Haq et al [9] is based
on dense optical flow and epipolar geometry The authors propose a novel similarity score for action matching based on the property of segmentation matrix or two-body fundamental matrix It helps establishing view invariant action matching framework without any preprocessing on original video sequences
their volumes and silhouettes, (b) epipolar geometry in case of extracted actor body
Trang 17silhouettes, (c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9]
o Combination of information from multi-view
Second approach will be done by combining information from different views [11, 12, 13] Unlike view-invariant approach, we find that different views will hold different amounts of information and may complement one another For example, parts of the human body may be occluded in one view, but are captured different from one another This combination is quite similar to combining views to give a 3D representation of action as 3D approach approaches, but 2D approaches can combine information at different stages of the classification problem: combining features at different views, then classifying or combining the results after performing the classification at different views It may give the final classification results with higher accuracy The two issues that need to be addressed in this approach are (1) what features to use to characterize the action and (2) how to combine information between views in order to achieve the best prediction results
For problem (1), according to [14] we can divide the representation of action into: global representation and representation by local feature Global representation often record the structure, shape, and movement of the human body Two examples
of the global representation are MHI and MEI [15] The idea of these two methods is
to encode information about human motion and shape in a sequence of images into a single "template" image (Figure 1.3), then we can exploit information needed from this template The global performances were studied extensively in the problem of action identification in the period 1997-2007, which often preserved the spatial and
Trang 18Figure 1 4 MHI (middle row) and MEI (last row) template [15]
Local representations of action using a pipeline include point-of-interest detection, local feature extraction, and aggregation of local features into action representation vectors In order to detect feature points in the video, many point detection detectors have been proposed: Harris3D [16], Cuboid [17], Hessian3D [18], dense sampling After detecting the feature, we will compute a vector describing this point based on the image intensity of the points around this point in three dimensions The most commonly used descriptive vector types are Cuboid, HOG/HOF, HOG3D, ESURF Next, locally-described vectors are used to train BoW model and provide a descriptive vector representing each video Finally, the vector representations of this video pass through a classifier to label an action According to [19], the author made
a combined experiment of detectors and descriptive vector types on two dataset KTH and UCF sports The KTH achieved the highest accuracy when combining the Harris3D detector and the HoF vector, while the UCF achieved the highest accuracy when combining dense sampling and the HOG3D vector In summary, it is difficult
to judge which combination is best for every situation
Trang 19For problem (2), two most commonly ways of combining information from different views is early and late fusion In the case of early fusion, feature descriptors from different views are concatenated together to form a final vector of the action before being pushed into a common classifier For the late fusion, the classifiers from each view will be trained and their results will be used to produce the final result G.Burghouts et al [11] use the STIP feature and BoW model The results show that late fusion achieved the highest accuracy on the IXMAS dataset R.Kavi et al [20] extract features from LMEI (similar to the MEI template but with more spatially distributed information), then included in the LDA classifier of each individual view The final results are obtained by combining the outputs of the LDAs Another method
of R.Kavi et al [21] using the LSTM ConvNets structure, yielding results using early and late fusion strategies Results from the articles showed that late fusion achieved higher recognition accuracy than early fusion There are two possible explanations for this result First, the occlusion at some views causes the wrong extraction or lack
of feature from that view, so that the vector representing the final action is less effective in classifying Secondly, when using a multiple camera system and a person choosing the direction and position to perform different actions, early fusion will produce vectors representing for the same action but not correlated because the appearance of people on the camera is different Therefore, the accuracy when using early fusion in this case decreases Although, the use of late fusion is also influenced
by this cause, however, when training a classifier for each view, if there exists a view that gives high probability of prediction the right class of action then we can make the final prediction exactly To improve efficiency in case of differences in position and direction of exercise in the two sets of assessment and training data, R Kavi et
Trang 20Section 1.2 will introduce the baseline method that we apply and propose an improving framework
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words
model
G.Burghouts et al.[11] proposed a BoW pipeline consisting of STIP features [10] (using Harris 3D detector and HOG/HOF descriptor) extracted from video (Fig 1.4), a random forest model [22] to transform the features into histograms that serves
as video descriptor, and a SVM classifier to predict the action class The authors experiment several strategies to combine the information from multiple views: combining of features as early fusion; combining of video descriptors as intermediate fusion, and posterior probability as late fusion (Fig 1.5) Experiment results shown that averaging prediction probability from all views gained highest accuracy on IXMAS dataset
Trang 21Figure 1 5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16]
Figure 1 6 Three ways to combine multiple 2D views information in the BoW model [11]
Assume that we have observations from 𝑀 views to capture human actions and
a set of 𝑁 different actions to be recognized For a specific view, we perform the following steps Firstly, STIP features are extracted from video With each local keypoint detected, a histogram of oriented gradients (HoG) and a histogram of optic flows (HOF) in a 3 × 3 × 2 spatio-temporal blocks are computed to capture shape and motion information in the local neighborhood of this point By concatenating HOG and HOF histograms, we acquire a descriptor with 162 values
Secondly, for each action, a random forest model is trained to learn the codebook Random Forest is used instead of K-mean clustering because of its capability to create a more discriminate codebook and its speed [23] The training input consists of a set of positive features from an action class and a set of negative
Trang 22𝐷𝑚 = { 𝐷𝑚1, 𝐷𝑚2, 𝐷𝑚𝑁} (1.1) where 𝐷𝑚𝑖 is the codebook for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ
The next step is to quantize the STIP features of a video into histogram We pass STIP features through the learned forests then we obtain a 320-bin normalized histogram to describe a video Finally, this descriptors is used to train a binary 𝑆𝑉𝑀𝑚𝑖classifier for class 𝑖𝑡ℎ Corresponding to the codebooks, we have a set of binary 𝑆𝑉𝑀𝑠:
𝑆𝑉𝑀𝑚 = { 𝑆𝑉𝑀𝑚1, 𝑆𝑉𝑀𝑚2, 𝑆𝑉𝑀𝑚𝑁} (1.2) where 𝑆𝑉𝑀𝑚𝑖 is the classifier for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ
The SVM is trained using a chi kernel (C = 1) and the SVM outputs are the posterior probabilities In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM's scores, fit by an additional cross-validation
on the training data With each test sample in the view 𝑚𝑡ℎ, we will have a set of probabilities:
𝑃𝑚 = { 𝑃𝑚1, 𝑃𝑚2, 𝑃𝑚𝑁} (1.3) where 𝑃𝑚𝑖 is the probabilities of action 𝑖𝑡ℎ in view 𝑚𝑡ℎ
Then, the posterior probabilities from all views are combined by taking their average:
Trang 23This method shown a good performance on multi-views human action recognition, gained 96,4% accuracy on IXMAS dataset (selective negative samples for random forest) We also test this method by random negative samples for random forest and gained 88% accuracy on same dataset However, local descriptor STIP provide shape and motion information of the keypoint but lack of location information (both spatial and temporal coordinates) Moreover, BoW model provides the distribution information of visual word in a video but also lack of the information
on the appearance order as well as the spatial correlation between visual words These factors may lead to confusing between action which have similar local information but actually difference in relative positions such as arms and legs or order of appearance Based on this deduction, several methods proposed adding spatial/temporal information of local features into BoW model and shown an improving performance on single-view human action recognition [24, 25] Parul Shukla et al.[24] divide a video into several parts base on time domain M.Ullah et
al [25] divide detected spatial domain into several smaller parts by using action detector, motion and object detector Common purpose is obtained a final descriptor that provide more information by combining information from smaller spatial/temporal parts
Based on these ideas that we have mentioned above, we proposed a framework for human action recognition using multiple views This will be described in detail in next chapter