Moving object tracking using fully convolutional neural network

Inspired by those aforementioned advantages of deep learning, this thesis will dig deeper into the family of deep learning based trackers and propose a novel approach to tackle visual ob

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER THESIS

Moving object tracking using Fully

Convolutional neural network

Quang Minh Bui

Minh.BQCB190097@sis.hust.edu.vn

Control Engineering and Automation

HANOI, 4/2021

1

Trang 2

Acknowlegement

This thesis is not possible without the inspiration and support of many people I would like to extend my appreciation to everyone that has been a part of the journey First and foremost, I would like to express my sincere gratitude to my research supervisors, Dr Pham Van Truong and Dr Tran Thi Thao at Hanoi University

of Science and Technology, for their consistent guidance, encouragement and sup- portive advises during the time I do my research Especially, I would like to thank

Dr Tran Thi Thao for the initial idea that I later developed and presented in this thesis

I am grateful to Hanoi University of Science and Technology for giving me the scholarship to encourage me in such a long journey I would like to thank my col- leagues at Viettel Corporation for their support in HPC related issues that boosted the progress of my work

Abstract

Visual Object Tracking is one of the most fundamental and critical task in computer vision due to its wide range of usage in both civilian and military applications such as video surveillance, traffic monitoring, autonomous vehicles, The central problem of Visual Object Tracking is to precisely determine the position of an arbitrary object in a video when its initial state is given in the first frame of the sequence Although many researchers have been putting a lot of effort into tackling the problem, object tracking remains challenging due to the factors of many scenario such as occlusion, illumination change,

This thesis proposes a novel method to solve those problem by adopting apprearance based object tracking approach empowering by deep features from Siamese networks Siamese networks put forward a simple framework for tracking yet achieving remarkable performance in terms of the balance between accuracy and speed How- ever, its performance degrades when suffering from fast target appearance changes due to its poor discrimination from other similar objects and clutter background which are considered as distractors Therefore, in this work, we propose a mechanism to better represent object of interest and enhance the discriminative capability

of the trackers We presents an improvement for Siamese networks by integrating the Convolutional Block Attention Module (CBAM) into the baseline In which, attention plays a role not only to tell the model to concentrate on important features, but also neglecting distractor to enhance the represention of object of interest As

a result, the discriminative capability, adaptability and robustness of the tracker are increased Experimental results on the two popular benchmarks OTB2015 and VOT2018 have shown that our approach achieved remarkable accuracy and robustness while maintaining the tracking speed that is practical for real world applications

Master student

Trang 3

Table of Contents

1.1 Background 1

1.1.1 Challenges in visual object tracking 2

1.1.2 The general framework of visual object tracking 3

1.2 Motivation of this study 4

1.3 Contribution of Thesis 5

1.4 Outline 5

1.5 Publications related to this research 6

Chapter 2: Literature Review 7 2.1 Classification of tracking methods 7

2.2 Generative tracking methods 8

2.2.1 Traditional object trackers 9

2.2.2 CNN-based object trackers 10

2.3 Discriminative tracking methods 12

2.3.1 Traditional object trackers 13

2.3.2 CNN-based object trackers 17

2.4 Chapter summary 20

Chapter 3: Proposed Method 22 3.1 Baseline study 22

3.1.1 SiamFC Architecture 22

3.1.2 Analysis on SiamFC tracker 23

ii

Trang 4

3.2 Proposed Network Architecture 25

3.2.1 Channel Attention 26

3.2.2 Spatial Attention 29

3.3 Offline Training 29

Chapter 4: Experimental Result and Conclusion 32 4.1 Benchmark datasets 32

4.1.1 Tracking metrics 32

4.1.2 Benchmark datasets 33

4.2 Experimental result 34

4.2.1 Implementation detail 34

4.2.2 Ablation study 35

4.2.3 Comparison to other object trackers 38

4.4 Conclusion 41

4.5 Future work 41

Trang 5

List of Figures

1.1.1 Challenging cases in visual object tracking 3

1.1.2 General Framework of Visual Object Tracking 4

2.1.1 Two main categories of object tracking methods 7

2.1.2 Classification Appearance-based tracking methods 8

2.2.1 General framework of traditional generative tracking methods The method referred in the image is presented in [1] Image courtesy of [2] 9 2.2.2 Representative illustration for Siamese based tracking methods 11

2.3.1 Illustration of a typical discriminative tracking framework Image courtesy of [2] .12

2.3.2 General framework of correlation filter based tracker Image courtesy of [3] 14

2.3.3 The general framework of CNN-based classification trackers Image courtesy of [4] 19

3.1.1 SiamFC network architecture Image courtesy from [5] 23

3.1.2 Similarity function in SiamFC produce high confidence score for dis- tractive objects of the same class as the target 24

3.1.3 The padded value creep further into the center as we go down the network 24

3.2.1 Proposed network architecture 25

3.2.2 The overview of CBAM block Image courtesy of [6] 26

3.2.3 Channel attention architecture Image taken from [6] 26

3.2.4 Illustration of average and max pooling operation 27

3.2.5 A representative examples of MLP structure 28

4

Trang 6

3.2.6 Spatial attention architecture Image courtesy of [6] 29

3.3.1 Pair of examplar and searching image for training Images are cropped from the same video If the boundary of the window is beyond the actual content of the image, the boundary is padded with average color 30 4.1.1 An illustration of the overlap region Image courtesy of [7] 32

4.1.2 EAO measurement illustration Image taken from [8] 34

4.2.1 Similarity score map comparison between SiamFC tracker and proposed tracker 36

4.2.2 Effectiveness of our proposed CBAM modules 36

4.2.3 Success plot comparison in OTB100 benchmark 38

4.2.4 Precision plot comparison in OTB100 benchmark 39

4.2.5 Qualitative comparison of different trackers in OTB100 benchmark Example sequences are Ironman, Motorolling, Skiing Best view in color 39 4.2.6 Qualitative comparison of different trackers in VOT2018 benchmark Example sequences are ants, gymnastic1, iceskater1 Best view in color 40

Trang 7

List of Tables

4.1 Backbone network parameter for object representation 35 4.2 SiamCBAM tracker performance on OTB1100 with different training procedures + denote whether the attention modules is used in the architecture or not ∗ denote the modules is trained when layer 4, 5

of the backbone network is unfreezed 37 4.3 SiamCBAM tracker performance on OTB100 with different value of reduction ratio 37 4.4 SiamCBAM tracker performance on VOT2018 with different value of reduction ratio 37 4.5 Performance comparison between our tracker and other methods in VOT2018 40

6

Trang 8

to automatically provide visual analysis There are many sub-tasks in such systems including anomaly activity detection, video reasoning, human-computer interaction, object navigation, Most of these tasks are related and dependent on the result from visual tracking algorithm For example successfully following the state of an tracked object is essential information for monitoring and thoroughly understanding its activity over time in order to alarm whether an abnormal event is likely to happen or not Therefore, visual tracking subsystems are crucial components of any modern visual intelligent system

Given the initial state of an arbitrary object in the first frame of the video, the objective of visual tracking modules is to accurately identify the states of that object in the rest of the sequence To be more specific, the initial state of the object

is simply a bounding box that marks the region where the object locates in the first frame No further information about the object of interest is provided other than raw pixels inside the initialized box in the first frame of the sequence From those data, the patch of the tracked object is encoded by numerous of methods to form a model which can be used to represent and update the state of the target in subsequent frame Due to the requirements of a specific application, other property

of tracking system like the number of object to be tracked may vary which give rise to different approach to solve the problem There are basically two category of visual tracking problem intensively investigated by community which are:

• Single Object Tracking: focuses on developing robust algorithms to track

only one object in a video

• Multiple Object Tracking: handles the trajectory of multiple objects at

the same time

In this thesis, we will focus on the first category and comprehensively discuss the main problems and solutions of visual single object tracking In the next section,

we will discuss several challenging scenarios and present the general framework to construct an efficient tracker

1

Trang 9

1.1.1 Challenges in visual object tracking

Many challenging factors should be taken into consideration when designing a object tracking algorithm

• Appearance change: indicates the change in size or shape of an object that

make a huge difference between its appearance in the first frame and subsequent frames For example, an object moving far away from or toward the camera will make it become smaller or bigger which challenges the object rep-

resentation quality This example refers to a term called scale variation

The problem becomes even more severe if the appearance of the object trans- forms rapidly within a few frames which hinders the performance of trackers since the object representation is hard to update accordingly Such scenarios

is also refered as deformation

Another typical type of appearance change is occlusion which happen when

an object is partially or even fully hidden In fully occlusion, the object is completely absent from images for a certain amount of time Therefore, the capability to memorize and generalize the object representation is crucial for a trackers to successfully follow the trajectory of the object when it re-appear in subsequent frames Furthermore, this scenario requires a searching mechanism

to determine where in the frame the object would re-appear Such problems is

another problem in visual object tracking known as Long-term Visual Object

only concentrate on developing tracker for Short-term Visual Object Tracking

problem

• Motion blur: It is caused by the relative movement of the recording device

that gives rise to the change the light exposure during the time images are recorded This is a common problem while working with non-stationary camera or tracking fast moving object In such cases, the appearance of the object could vary sharply which deteriorate the localization process

• Illumination change: Illumination variation on an object make it brighter

or darker which may give rise to the loss of feature on object surface, hence degrade the object appearance modelling process

• Background Clutter: Area surrounding the target containing several similar

object or having no significant boundary with the target can easily fool object trackers since they are confused to determine exactly where the object of interest is In this case, object tracker are prone to drift away from actual target

• Real-time processing requirements: A tracking algorithm is only practi-

cal if its inference time is fast enough to satisfy operating time constraint in a particular application Otherwise, it is only meaningful for research purpose There may be some other challenging scenarios to be addressed in some specific tracking sequences, but those aforementioned issues are not only the most common issues in real world but also the main problems that are evaluated in many popular tracking benchmarks Therefore, the main objective of visual tracking researchers

is to develop accurate and robust trackers to handle such problems

2

Trang 10

Some examples of the aforementioned issues are shown in Figure 1.1.1

Figure 1.1.1: Challenging cases in visual object tracking

1.1.2 The general framework of visual object tracking

The general framework of visual object tracking is illustrated in Figure 1.1.2 As mentioned in section 1.1, in the first frame of a video, a bounding box surrounding the target object is initialized and utilized to extract important details which clearly define of the object of interest in the image These information is scientifically referred as features In figure 1.1.2, the object representation phase is called features extraction In machine learning in general, feature extraction starts from an initial set of observed data (in this case, the initial bounding box) and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations [9] In computer vision, there are two main methods used

to extract features including traditional hand-crafted feature exactractor (such as HOG[10], SIFT[11], SURF[12], ORB[13], ) and deep features (using deep neural network to learn features from images) After extracting feature, the raw data (images) is transformed into features vector feeding the learning algorithm in order

to build an object model or an classifier to perform tasks required by a specific application In figure 1.1.2, after contructing object model from initial state in the first frame, we enter the tracking loop repeattedly operating in subsequent frames Whenever a new frame is fed into the tracking system, based on the object model which was built, a set of region candidates is proposed to mark potential locations where the object could be in the current frame From those candidates, we carry out the localization process to find the actual location of the target Once the object is located, feature extractor is used to represent the instantaneous object and update

3

Trang 11

the object model for the next iteration Notably, in some modern tracking method,

it is unnecessary to update model at all

Figure 1.1.2: General Framework of Visual Object Tracking

1.2 Motivation of this study

The flow chart depicted in figure 1.1.2 is representative for tracking algorithms Understanding this diagram will help us to better grasp the idea of most of existing tracking methods and guide us to find ways to make progress and breakthrough in this area It can be seen that improving any step represented by a block in figure 1.1.2 can lead to the enhancement of existed tracking methods State of the art performance can be achieved by proposing more efficient object initialization, novel object representations, better object state estimation or smart procedures for model update

Among all these possible improvements, object representation is considered to

be the most important issues to discuss since poorly represented object can adversely affect the object model which is updated through time The better the object representation is, the more robust and accurate the object model will be constructed Object representation have to be designed to achieve invariance to appearance change in order to all handle the appearance challenges discussed in 1.1.1 Furthermore, it should have decent capability to discriminate the target object(foreground) and background in background clutter cases A traditional way to represent an object is to use hand-crafted features which utilizing various algorithms

to extract the information from the image itself based on expert knowledge It plays

an important role and achieve significant result in early researchs of visual object tracking However, due to the advent of deep learning, computer vision and machine learning communities are now equipped with much more powerful tools that give rise

to many breakthrough in object recognition [14], object detection [15] and object segmentation [16] Visual Object Tracking is strongly related with those areas so the idea of using deep learning for object tracking comes in handy in such situation These days, state of the art trackers are often built based upon deep neural networks and they achieves outstanding results in comparison with traditional approach The main reason for this unprecedented performance is because deep learning techniques are capable of providing hierarchical representation learning specifically for certain task or domain in both supervised or unsupervised scheme; therefore, the object representation is itself the object modelling process in a certain degree

The other important component of visual tracking is the target localization process which estimates the state of object through subsequent frames To give an accurate estimation of the state of the target object, a tracker must be able to discriminative the object from surrounding background This can be done by training

a classifier online or offline to predict whether the target is present in an image patch or not Then, a searching process using this classifier is carried out to find the

4

Trang 12

most likely location of the object To successfully training this classifier, it requires labeled data of the target object which is a problem in visual object tracking Unlike object detection or object classification problem, there is no preliminary information

of the target other than the image patch provided in the first frames of the video

In traditional tracker using hand-crafted features, object features is extracted only from the initial image patch itself which limits the richness of target representation However, when it comes to deep learning based tracker, deep neural networks can learn those features from massive amount of data so that it can enrich the representation of the target no matter what object is as long as that object exists in the training datasets

Inspired by those aforementioned advantages of deep learning, this thesis will dig deeper into the family of deep learning based trackers and propose a novel approach

to tackle visual object tracking problem

1.3 Contribution of Thesis

This thesis embraces the short-term single object tracking problem Our main objective is to build a generic tracker which can track any kind of objects throughout the whole sequences given only the initial state of object in the first frame of the video

The main contributions of this work are as follows:

• We revisit various of current tracking approaches to provide detail information

and discussion about how they actually works Intensive comparison is carried out to highlight advantages and disadvantages of each tracking method in order

to propose possible improvement in this challenging problem

• A novel deep network architecture which inherited from Siamese network

equipped with several attention mechanisms is presented for real-time, model- free object tracking The tracking problems is treated as similarity learning problem as proposed in the baseline Siamese Network [5] However, the inte- gration of the convolutional block attention modules allows the deep network

to capture semantic information from target to better represent the object

of interest in order to enhance the accuracy and robustness of the tracker in comparison with its original baseline SiamFC [5]

• Intensive experiments are conducted to evaluate the performance of the tracker

on various popular visual object tracking benchmarks The results demon- strated that our tracking method can achieve high accuracy and robustness while maintaining fast tracking speed which is appropriate for real world application Furthermore, based on the result comparison among several training strategies, we can get an experience to better train more deep neural network models in the future

1.4 Outline

The content of the thesis is constructed as follows:

• Chapter 2 provides overview about various related topics of visual object

tracking Many state of the art trackers in both hand-crafted features and deep

Trang 13

tracking approaches are comprehensively discussed and compared to figure out pros and cons of each tracking approach

• Chapter 3 A novel deep learning architecture specifically designed for visual

object tracking based on Siamese network is proposed

• Chapter 4 Intensive experiments are carried out to evaluate the perfor-

mance of proposed method on OTB benchmark dataset[17] and VOT2018 challenge[18] At the end of the chapter, we present some conclusion and propose some potential orientation to improve the current work in the future

1.5 Publications related to this research

The following publication is the by-product of the work of the thesis

• Dan Anh Do, Giap Nguyen Vu, Minh Quang Bui, Huong Ninh, Hai Tran

Tien, ”Real-Time Long-Term Tracking with Adaptive Online Searching Model”

in ICCTA ’20: Proceedings of the 2020 6th International Conference on Com- puter and Technology Applications [19]

6

Trang 14

Chapter 2

Literature Review

2.1 Classification of tracking methods

In such a broad research area like tracking, it is not easy to clearly classify existing methods into exact categories due to the fact that the many tracking methods share the same ideas with various modifications and some hybrid methods may belong

to several categories However, no matter how complex tracking methods, they basically can be simplified to the framework mentioned in 1.1.2 which means they all consist of the same building blocks There are many good review works [20] [21]

In those works, it can be generally seen that tracking methods can be divided into two main categories which are point-based and appearance-based tracking method

Figure 2.1.1: Two main categories of object tracking methods

Appearance-based tracking

Appearance based tracking refers to a class of tracker where the appearance of the target plays a central role As discussed in 1.2, typical object tracking system

Trang 15

consist of 4 main building block which are object initialization, object modelling, object state estimation, object localization In appearance-based tracking method, the tracker stores the model built upon the appearance object of interest which is continously updated later on

In this chapter, we will concentrate on appearance-based method since it is the approach that we apply in this thesis As stated above, object trackers categorized into this type of method form the object model based on its appearance The representation of object appearance can be done by two different approaches that

we refer traditional trackers and CNN-based trackers In traditional trackers, the object of interest is represented by hand-crafted feature extractors which exploits the information solely from the images itself by using various kind of algorithms to encode the important textures of the target into feature vectors so that redundant data from images can be removed Some popular algorithms which is used for this includes HOG[10], SIFT[11], SURF[12], On the other hand, object features

in CNN-based trackers are learned by deep neural network which is empowered by massive amount of data from variety of object classes Most of researches in tracking problem can be further sorted into generative-based, discriminative-based or hybrid methods Since hybrid methods basically exploit the advantages of the first two appearance based approaches, this section will only embrace the main ideas and representative successful object trackers based on the generative and discriminative tracking methods Figure 2.1.2 shows the classification of appearance-based trackers

Figure 2.1.2: Classification Appearance-based tracking methods

2.2 Generative tracking methods

Generative methods work based on generative models which use observed

data from the target object in the feature space to model the distribution of object appearances via calculating joint probability of the two subsets containing positive and negative samples of the object of interest

In other words, the generative appearance model is described for object and optionally for surrounding area Then, the tracking problem can be treated as a similarity learning problem Once the apprearance model is built, search instances

8

Trang 16

containing multiple image patches which is sampled in the vicinity of the object of interest The object location is then estimated by picking the image patch which has the highest similarity score with the built appearance model The formulation

of similarity function may vary in each particular approach of constructing the appearance model The generalization capacity of generative model is huge so that after the object position is localized, it is often seen in many researches that it is unnecessary to update the model It is only updated if some specific conditions is met, especially when the target of object is partially occluded or even fully occluded which requires the tracker to be reinitialized

Class of generative tracking methods can be further divided into the traditional and the CNN-based object trackers according to the method they use to represent the object

2.2.1 Traditional object trackers

In traditional object trackers, handcrafted features or predefined features from specific object or even the simple raw pixel value of images can be used to represent the object of interest Once the object representation is calculated, the appearance model of the target can be constructed using various type of methods There are several popular approaches in generative tracking oriented researches including incremental subspace learning, density estimation, sparse representation, kernel-based models, These approaches may differ in detail but all of them can be somewhat simplified to the following diagram

Figure 2.2.1: General framework of traditional generative tracking methods The method referred in the image is presented in [1] Image courtesy of [2]

In [23], object apprearance model was learned online via an incremental subspace learning method using particle filter framework to handle appearance changes

of the target The study also proposed an model update policy based incremental algorithms for principal component analysis(PCA) to improve overall tracking performance Following density estimation approach, object representation is built based on the statistics of the feature spaces In [24], target apprearance is represented by the density funtions of histogram Then, the mean-shift algorithm specifically designed for tracking problem is used to localize the object of interest The proposed method showed its superior performance when compare to template matching trackers based on normalized correlation filter due to the fact that the statistics appearance model of the target is variant when it comes to challenging scenarios like out-of-plane rotation or non-rigid motion(which is a huge problem when using handcrafted features like HOG or Haar like for representing the target) Visual

Trang 17

Tracking Decomposition tracker[25] utilized multiple basic object observations and motion models In this work, sparse principal component analysis (SPCA) of a set

of feature templates is used to decompose the observation model into multiple basic observation models to which is expected to better summarize the object representation

Sparse representation approach success was first seen in [26] where it is proved

to effectively improve the robustness of face recognition system Inspired by this pioneering work, Mei et al.[27] intensively investigated its usage in object tracking

problem They form a L1 regularized least squares problem to propose object candidate The location where its projection error is smallest is the position of the target object in the next frame Another tracking algorithm based on sparse representation namely S-MTT is proposed in [28] where object tracking problem is formulated as multi-task learning problem This tracker models object appearance by embedding object representation into a particle filter framework In which, the authors enforced

a joint sparsity constraint so that the sparse coefficients themselves and their embedding in the particle representation were learned simultaneously Consequently,

it is only required a few dictionary templates and their linear combination to model the appearance of the target object It helps avoiding bad samples produced by mis- aligned patches of the target template thanks to the inconsistent property of sparse representation when compare with the bulk of the target templates A follow-up research of the work in [28] is proposed in [29] in which some strict conditions in [28] is loosen in order to handle the outlier of the appearance model of the target while the optimization problem in the proposed method also take into account an additional regularization terms Furthermore, the low rank representation is comprehensively discussed to demonstrate its effectness in exloiting the temporal information so that

it can boost the performance of the tracker In [30], an interesting method using sparse representation is depicted where an alignment-pooling based method was proposed to reduce drifting possibility and handle paritial occlusion problem by efficiently exploiting the information of the target in its spatial domain in order to adjust the target template even when it is occluded Many further discoveries in sparcity-based trackers have been presented in other literatures that show some improvement in terms of accuracy and robustness However, this type of tracker by and large are computationally expensive so that they are impractical for real world applications

2.2.2 CNN-based object trackers

An important aspect of using deep neural network is that it can solve the issue

of data scarcity in visual object tracking problem In traditional object tracker,

no information about the object of interest other than the initial bounding box containing the target is provided so that the object representation and object model

is built solely based on a single image which poses the challenge of generalization capability of the object model On the other hand, deep neural network can take advantage of large datasets supplying huge amount of data from variety of object classes which grants CNN-based object tracker better generalization capability As mentioned in the the beginning of section 2.2, visual object tracking can be addressed

by solving the similarity learning problem Given the appearance model of the target extracted from initial image patch, the similarity function will compare the target image patch to a candidate image of the same size and returns a score that reflexes the similarity between the objects which appear the two images Then, all candidate

10

Trang 18

positions is tested and the one with the highest similarity score in comparison with the object template will be chosen to find the location of the target in subsequent frames Due to the generalization capability, generative models can easily achieve realtime tracking since online training can be ommited

Recently, Siamese network architecture[5] [31] [32][33][34] have drawn great attention among research community since it is widely used in many generative CNN- based object trackers which achieved remarkable result in many tracking benchmark datasets The general framework used in those great work is depicted in figure

Figure 2.2.2: Representative illustration for Siamese based tracking methods

The pioneering work is presented in SINT tracker[31] where Siamese network is used to extract hierarchical deep features of the examplar image and the instance image respectively Candidates are randomly sampled in the instance image and the target patch is found in template-based matching fashion comparing the initial image of the object of interest with the candidate samples and return the one which has the highest similarity score Accuracy and robustness of SINT tracker is enhanced

by utilizing optical flow algorithm for better target estimation Despite achieving impressive performance in terms of accuracy and robustness, SINT tracker failed to meet the computational constraint for real time applications when operating at only the speed of 4 FPS Inspired by the work in SINT[31], Bertinetto et.al[5] propose

a fully convolutional Siamese network(SiamFC) as backbone to train an end-to-end learning similarity function The tracking speed of SiamFC is greatly improved by the usage of cross correlation operation which allow us to feed to the network a search image much larger the examplar one and we can compute similarity at all candidate sample inside that search image in a single operation The simple CNN architecture in combination with an offline-learnt similarity metric of SiamFC result

in its remarkable tracking speed However, since a CNN network is simultaneously used for object representation and object model, it can be easily overfitting To address the limitations of the SiamFC tracker, CFNet [34] made an improvement by using a circulant matrix to integrate a correlation filter as a layer in order to create

a end-to-end learning framework This allows CFNet to enhance the discriminative capability of the tracker in the cost of limiting its generalization capacity Another method to incorporate Siamese network in object tracking framework is put forward

in [33] The main idea of the proposed tracker is that the tracking problem is refor- mulated as a decision-making process based on learning an agent in reinforcement learning fashion to decide whether to locate objects with high confidence on an early layer, or continue processing subsequent layers of a network[33] Furthermore, the proposed tracker adaptively choose different level of features for different frames

To be specific, low-level features are used for easy frames whereas deep features extracted from learned Siamese Network are useful for challenging frames Due to

Trang 19

the adaptability of tracking process, the overall operation speed of the tracker is reduced significantly Be aware of the fact that existed Siamese trackers do not fully utilize semantic and objectness information from pre-trained networks that have been trained on the image classification task, Mohamed H Abdelpakey et.al[32] addressed this weakness in DSiam tracker by updating online the representation of the target object using a ridge regression network Despite integrating a model update module which requires more computational cost, DSiam tracker was able to achieve realtime speed of 53 FPS in benchmark datasets However, background clutter, occlusion, and fast motion are critical issues that hinder the overall performance of this tracker

2.3 Discriminative tracking methods

Discriminative methods work based on learned discriminative models which

aim to estimate the conditional probability of a candidate of object to be found given its appearance model in feature space

The learning process involves training a binary classifier which is responsible for the classification of the object and non-object instance The output of this classfier is confidence scores measuring the objectness of the test instances whereas its input is sets of positive and negative samples To be more specifically, these sets

of images are provided as labels of the object of interest Before the training process

is performed, object features will be extracted from labelled images to represent the target in high dimensional space vectors Commonly used feature extractors can be refered including HOG[10], SIFT[11], SURF[12], Appearance model of the target object is then trained when the object state is initialized and it is updated on the fly everytime a new position of the target is estimated The location of the object

of interest is determined by choosing the candidate patch where its confidence score

is highest Illustration of a typical discriminative tracking framework is shown in figure 2.3.1

Figure 2.3.1: Illustration of a typical discriminative tracking framework Image

courtesy of [2]

12

Trang 20

2.3.1 Traditional object trackers

As mentioned in 2.2, traditional object trackers represent object appearance

by handcrafted features Discriminative trackers then use those features vector to train various type of classifier including support vector machine [35], boosting [36], random forest [37] and correlation filter[38][39]

Boosting-based methods

This type of tracker collects sets of positive and negative samples of the target

to train a classifier online In the first frame of the video, a positive sample of the object is cropped at the position where the the initial state of the target is given and a set of image patches in the surrounding area are viewed as negative samples In the subsequent frames of the video, the classifier will produce a score map according to the the vicinity of the target and each value in the map measures the possibility that the target might appear in that corresponding location The target is then localized by choosing the position where its corresponding confidence score is highest Everytime a new location is determined, an additional positive samples are cropped and and added to retrain the classifer so that the size of the positive set is increased overtime The quality of the trained classifier is heavily dependent on the correctness of training samples labeling Potential mislabelling will result in tracking failure since the number of wrong labels increases as the classifier gradually learns the object model online This issue is refered as drifting problem

To address this problem, an interesting approach was put forward in [40] where transfer learning was used in combination with the Gaussian process regression

S Wu et al.[41] divided all training samples into two subsets and improve overall tracking performance by exploiting the intrinsic relationship among those samples Leistner et al.[42] make use of an online version of GradientBoost in which a set of noise insensitive loss function is integrated into the online learner to improve the robustness of the tracker

SVM based methods

SVM-based tracker rely on SVM classifier which has been already seen to be successful to tackle other challenging areas in computer vision including object detection and object classification Therefore, it is expected to be effective when it comes to object tracking problem In [35], A kernelized structured output SVM was proposed to incorporate the motion information of the target object to the classifier Furthermore, a budgeting mechanism is introduced to negate the problem of exces- sive number of support vectors during tracking which allows the proposed tracker

to be practical for real time applications Based on the work in [35], many follow-up research was presented In [43], generic structured learning was created by combin- ing a set of weak classifiers Moreover, Rui Yao et al.[44] proposed a part-based and latent structured SVM model to handle the drifting problem In addition to that,

a dual linear SVM is put forward in [45] to boost the optimization process which

is the main bottleneck that slow down the tracking speed seen in its predecessor trackers

Correlation Filter based methods

The main idea of correlation filter based trackers are primarily based on the template matching problem which originates from the theory of matched filter which

Trang 21

w

is commonly used in radar applications Its original idea encouraged scientific to find way to make of use of correlation filter in visual object tracking framework Although many early works have shown their decent result in simple scenarios, correlation filter based tracker have only drawn great have drawn great attention from researchers community since 2010 due to the pioneering work of Bolme et al in [46] There are many discriminative correlation filter(DCF) trackers with different level of complexity but all of them can be simplified to a general framework as illustrated in figure 2.3.2

Figure 2.3.2: General framework of correlation filter based tracker Image courtesy

w = (X T X + λI) −1 X T y (2.2)

It is notable that the aforementioned optimization problem and its solution is representative for correlation filter tracking framework Many later researches proposed much more sophisticated construction of the optimization problem to enhance the overall performance Either way, solving this solution in spatial domain is really time consuming which hinder overall tracking speed Therefore, it is of popularity that DCF trackers reformulate the problem in Fourier domain using Fast Fourier Transform to mitigate computational cost Initially, object features is extracted in the first frame of the video Then, a cosine window is applied to suppress to boundary effect when we perform Fast Fourier Transform on feature vectors in the next step The output will be considered as the template that we will use to estimate the position of the target in subsequent frames For the rest of the video, the image patch at the previous estimated postion is cropped for extracting new features in order to train and update correlation filter based on initial labeled data Subsequently,

14

Trang 22

correlation operations are performed to correlate the trained filter and the template patch followed by an inverse fast fourier transform in order to generate a response map The position with a highest value in the response map is chosen to determine the new state of the object of interest The correlation filter based trackers can be generally divided into two main categories which is traditional DCF(discriminative correlation filter) trackers which make use of handcrafted features and modern DCF trackers which utilize deep features and incorporate deep network in the tracking framework In this section, we will only cover the traditional DCF tracker, the latter one will be discussed in the next section

As stated above, the critical issues when constructing an efficient DCF trackers

is how to solve the optimization problem 2.1 without hindering the tracking speed, especially when DCF tracker requires continously updating model of the object of interest unlike generative based trackers Researchers developed variety of method

to better reduce the computational cost but it generally boils down to two main ideas The first one is to how to efficiently construct the set of training samples so that as many informative samples as possible are collected and they are compact

in order to be conveniently solved when feed in the equation of optimization 2.1

The compact features representation of training samples is denoted by matrix X in

2.1 The second ideas to facilitate the computational burden is to transform the optimization problem into Fourier domain using Fast Fourier Transform(FFT) due

to the fact that in frequent domain, heavy matrix convolution and correlation in spatial domain can be converted into element-wised matrix multiplication which is much lighter in terms of number of operations so that the time needed to solve 2.1 can be sharply reduced

The pioneering work that encouraged further discoveries of DCF based trackers

is presented in [46] In this paper, Bolme et al[46] proposed a method to train the correlation filter by minimizing the sum of the squared errors which deviates from the common DCF optimization problem 2.1 by ignoring the regularization term This simple training formulation allows MOSSE track operate far beyond the realtime speed at 669 FPS and relatively robust to simple variation in term

of light, appearance and motion Acknowleging the drawbacks of the tracker, the author put forward a method to verify whether the tracker losts the target or not

by evaluating the Peak of Sidelobe Ratio[46] This measurement is really helpful when it comes to object tracking in long term and adopted by later works as in [19] Following the work of [46], Henriques et al[47] modified the correlation filter

by taking the advantage of the kernel trick[48] to enhance the generalization power

of the filter Taking one more step, the Kernelized correlation filter described in [38] utilized the circulant matrix[49] to efficiently assemble feature of the whole training set into a single matrix in order to effectively boost the speed of solving the regression problem process even when dealing with the multi channel feature vectors Researches on DCF trackers is not limited to improve the productivity of the training process of the correlation filter Several other researches are dedicated

to propose better heuristic procedure in the general tracking framework These works include using subspace learning method [50] to handle occlusion , finding the best patches to track [51] However, it can be seen the aforementioned methods that a region surrounding the object object of interest is cropped as the input which limit the structural information exploited from the target Additionally, the usage

of circulant matrix requires an assumption that all training samples must be cyclic which pose the problem of boundary effect when convert the model from spatial domain to frequency domain Therefore, many follow-up researches attempted to

Trang 23

mitigate this unwanted problem

In [52], Danelljan et al added a spatial regularization term to the optimization problem 2.1 as the penalization while training correlation filter in order to suppress the unwanted boundary effect It permits the filter to learn from a bigger set of negative patches without adversely effect the positive one A new training strategy is adopted when they use Gaussian-Seidel algorithm to iteratively solve the optimization problem Though the accuracy and robustness of the tracker is improved, but

it comes at the cost of degrading the tracking speed in comparison with its predecessor trackers such as KCF[38], MOSSE[46] since the solution to the optimization prolbem of these older work is provided in the closed form which makes it faster when operating online Inspired by the work of SRDCF tracker, Feng Li et al introduced STRCF tracker in [53] where an upgraded version of SRDCF is proposed

In this work, regularization terms is not only accounted for spatial information but the temporal information is also taken into consideration The combination of both regularization term greatly improve the accuracy without the deficiency in terms of speed An interesting approach for DCF tracker was presented in [54] where the authors raise the awareness of modelling the background to better discriminate it from the foreground However, there are two critical limitations in BACF tracker The first one is the problem of potential model extension since BACF adopted the aug- mented Lagrangian method [55] to formulate the optimization problem The second reason is that BACF is only able to suppress the background outside the region of interest If the target patch is contaminated by too much background, it hinders the tracker performance To address this issue, Chong Sun et al [56] suggested to train the correlation filter based on the the discrimination and reliability model jointly

In [56]a local response consistency regularization term is introduced to penalize the influence of different regions so that the tracker can avoid being contaminated by non-informative regions

Another popular approach of DCF tracker family is to efficently estimate the scale variation of the target during the tracking process The representative for this type of tracker are the SAMF tracker [57] and the DSST tracker [58] In [58], smart policy for scale variation is proposed in which the tracker looks for the the target in multiple scales and the search patch with the highest confidence score corresponding with its scale is chosen as the target location in the next frame Although, this strategy increases the scale adaptability, its exhaustive search nature is really time consuming which leads to the degradation of DSST tracker speed As an improvement of DSST tracker, MKCF tracker [59] makes use of the bisection search algorithm in the combination of fast evaluation of features to speed up the searching process in the scale space A better version of MKCF tracker is presented in [60] where the scale of the target is updated by maximizing the posterior probability

of a set of scales Another robust scale estimation policy was proposed in [61] In this work, the context information of a number of consecutive frames is averaged

to estimate the sacle of the target However, this approach only work when dealing with isometric scale variation and it fails to handle the aspect ratio change of the target To overcome this limitation, Feng Li et al [62] suggested using the family of 1D boundary correlation filter to determine the boundary of the video so that the aspect ratio adaptability can be enhanced

In the previous reviewed tracker, the appreance model is solely built based on the spatial information of the target patch There are several works that incorporate multiple cues to better represent the object model Many researchers prone to make use of color cue to improve the object representation In [63], color information of

16

Trang 24

the target is utilized to disrcriminate different instances of the same class of object whereas in[64], color channels contribute to the construction of multiple dimensional features DCF tracker is also effective when it comes to long term tracking problem if

it is equipped with efficient heuristic strategy In [65], a long term memory system is integrated to deal with fully occlusion problem Attention mechanism is also proved

to be effective when used in the DCF tracker framework as described in [66]

2.3.2 CNN-based object trackers

Due to the advent of deep Convolutional neural network, researchers are now equipped with more powerful tools to extract feature from the object of interest since CNN models are pretrained from large datasets containing thousands of class

of object commonly seen in nature This solves the issues of data scarcity in object tracking unlike traditional DCF trackers when the object of interest is modelled based on only the target patches during tracking The usage of CNN network are featured in 2 different types of CNN-based trackers It can be used as a replacement for handcrafted feature extractor to represent the target or as the backbone to construct a whole deep tracking framework

CNN as feature extractor

Conventional DCF trackers are seen to achieve better performance when using CNN as a feature extractor For example, a deepnetwork version of SRDCF tracker[52] is presented in [67] where features extracted from shallow layers of the deep network is used to train the baseline SRDCF tracking framework

It is stated in many literatures that convolutional neural network produces various types of features description as we go down the network Therefore, several works have been dedicated to exploit the hierarchical feature responses by taking feature maps from different layers of a pretrained convolutional neural network In [68], features extracted from the begining layers and last layers of the network are respectively used to capture the spatial and semantic information of the object of interest Qi et al [69] utilizes features from deep network to train a set of weak trackers Then an adaptive Hedge algorithm is used to combine those weak tracker into a stronger version In [70], features from different layers of VGG network are feed into two subnetworks which captures category and target information seperately The information from these 2 networks are then combined to disrcriminate the target from the background Although these aforementioned trackers can take advantage of pretrained deep model to enrich target representation, this also pose a critical limitation when incorporating it into discriminative correlation based tracking framework In visual object tracking, the object to be tracked is arbitrary so that its training samples might not appear or be insufficient in the training datasets of the pretrained model In those 3 aforementioned works, correlation filter is learned seperately from feature extraction Therefore, the features extracted from deep network may poorly present the target object which makes the trained correlation filter unable to produce accurate confidence response map This degrades tracker performance in some particular sequences

To address this problem, Yibing Song et al[71] integrates model update, feature extraction and generating response map into a single convolutional neural network which is trained in an end-to-end fashion Furthermore, residual learning is adopted

to prevent model degradation while updating online Recently, a new approach for DCF based tracker presented by Danelljan et al [72] have drawn great attention

Trang 25

from research community The authors deduced that the formulation of conventional DCF trackers is restricted to single-resolution feature maps which is the main cause that prevents DCF trackers achieve better performance In [72], a novel DCF formulation is proposed to solve the optimization problem in continuous spatial domain which involves employing the implicit interpolation model to break the restriction in conventional DCF tracker As multi-resolution deep feature maps are integrated in the framework, CCOT tracker achieved remarkable performance in several benchmark datasets However, since the number of parameters in the model drastically increases, it is more likely prone to overfitting since the training data in object tracking is often insufficient Understanding the drawbacks of CCOT, Danelljan et

al proposed ECO tracker [73] to not only reduce the model size of DCF tracker

by using factorized convolution operator but also present an efficient model update strategy to simultaneously enhance robustness and tracking speed of the tracker Since the deep network is trained from huge amount of data, it grants trackers using deep network as features extractor a decent generalization capacity in comparison with handcrafted features which can only encode features from the target patch itself This improves the performance of trackers because it helps enrich the feature representation of the object of interest However, extracting features from deep network is computationally expensive due to the huge amount of parameters

to be trained in the convolutional neural network Therefore, trackers using deep features is always slower than the same trackers utilizing handcrafted features This must be taken into account when it comes to choosing which one should be used in real world applications

Deep Tracking Networks for object tracking

This type of trackers tackle visual object tracking problem by formulating it by

a convolutional neural network that can be trained in an end-to-end manner They can be further divided into two smaller group of trackers The first one uses the network as a discriminative classifier so that the tracking problem can be viewed

as the object classification problem where objects are classified into either target

or non-target class The second group view the tracking problem as the similarity learning problem which have already been reviewed in section 2.2.2 Therefore, in this section we will only cover trackers follow the first strategy of tackling object tracking problem

End-to-end convolutional neural network models trained on large datasets have shown great performance in object classification problem However, when it comes

to object tracking problem, we can not directly apply the networks used in object classifcation due to the primary difference between these two problems While classification problem try to predict the class label of an object without caring about the difference among each individual within a class which requires the generalisation capability of the model, object tracking attempts to locate position of an instance

of a specific class Therefore, the deep model for tracking problem must be specifically designed to have a decent discriminative capability to disrcriminate the target even when lots of instances of the same class are nearby Many deep classsification trackers have been proposed with different level of complexity but they all follow the same pattern that is describe in figure 2.3.3

18

Trang 26

Figure 2.3.3: The general framework of CNN-based classification trackers Image

courtesy of [4]

As we can see in figure 2.3.3, the object tracking problem can be solved by training a binary classifier to put an object in either target or non-target class It is the same as conventional DCF tracker but in this case, the trained model is CNN network, not the correlation filter Therefore, it can be seen that deep classification trackers is the combination of the rich generalisation capacity of deep neural network and the discriminative ability that have been shown in DCF trackers

In the initial stage of the framework, the baseline network architecture is designed for object tracking It is common that, a simplified version CNN architecture used

in object classification is utilized since binary classification is not as hard as the multi label classification problems like in the ImageNet datasets with 1000 classes Additionally, since heavy CNN models require larger amount of training samples, finetuning them online with limited resourse will result in severe overfitting

Once the backbone for the network is determined, we enter the first stage of deep classification tracking As illustrated in the first row in figure 2.3.3 the network is trained offline on severak tracking datasets The more objects in the datasets is, the better In each video sequence, positive samples are collected in the neighborhood

of the target and negative samples are randomly samples in the background The training process aim to learn the pattern of different tracking scenarios including deformation, scale variation, applied on the object of interest

In the second stage of the framework depicted in the second row of figure 2.3.3, the pretrained model of the backbone network is finetuned online This resembles the training process of correlation filter in DCF trackers The early layer of network

is often freezed during training to presever low level pattern of the target while few last layer is unfreezed to train in order to adapt with the variation of the target while tracking

In the last stage of the process as shown in the third row of figure 2.3.3, candidate patches are cropped around the neigborhood of the previous location of the target to

Trang 27

evaluate the possibility that the target appear on those patches The patch with the highest confidence score is picked to determine the exact location of the target The candidate patches are also used to feed more training samples so that the networks can be retrained in the next frame

Many successful trackers have been developed to prove the effectiveness of the aforementioned framework In [74], Li et al proposed a method resembling the one shot learning Siamese network used in face recognition where the network is trained

in such a way that in the search area, a positive samples is closer to the target template than it is compared to the background samples In [75], Nam and Han proposed a novel deep architecture which composed of shared layers and multiple branches which are responsible for learning target information in several specific domain to obtain decent generalisation capacity of the object representation A similar idea was presented by Han et al in [76] where the CNN network contains multiple branches of fully connected layers which aim to learn a multi-level feature

of the object of interest SANet tracker presented by Fan and Ling in [77] makes use

of a Recurrent Neural Network to modify the network utilized in MDNet in order

to boost the robustness against challengin scene since the discriminative ability of the trackers is increased In [78], deep reinforcement learning method is proposed

to assist the tracker to acknowlege the action of the object of interest in order to adapt to scale variation and appearance change

Although deep classification trackers achieved outstanding accuracy since they are trained learn the domain-specific discriminative features, this achievement comes

at the exchange of trackind speed drop It is due to the fact that these trackers requires continously update their model in the whole sequence to adapt to appearance variation of the target The computational expensive nature of this approach result

in the speed of less than 1FPS in most of the work Consequently, these trackers are impractical for time constraint applications

2.4 Chapter summary

In this chapter, a wide range of classes of visual object tracking methods have been reviewed to analyze the strength and weakness of each tracking approach Based on the analysis of the previous research works, it can be seen that appearance model of the target plays a significant role in terms of boosting the performance of the tracker Many efforts have been dedicated to built different kind of frameworks

to construct robust appearance model so that it can handle all challenging scenarios described in 1.1.1 However, in the last decade, the trend of using deep learning seems to come out to the top The usage of deep neural network in object tracking framework gives rise to the unprecedented performance in terms of accuracy and robustness while maintaining relatively fast tracking speed which is suitable for real time applications There is a big gap between the performance of trackers based on deep features and their counterpart based on traditional approaches Inspired by the remarkable development progress of deep learning in a wide range of academic areas other than object tracking, more and more research works have been publishing utilizing the power of deep learning Speaking of visual object tracking problem, the use of CNN-based methods have partially replaced the use of handcrafted feature

in terms of representing the object of interest It is due to the fact that these days, more and more datasets specifically designed for visual object tracking have been published not to mention those in other areas like object detection and object classification which can also be used in object tracking problem These massive

20

Tiêu đề	Moving Object Tracking Using Fully Convolutional Neural Network
Tác giả	Quang Minh Bui
Người hướng dẫn	PhD. Tran Thi Thao, Dr. Pham Van Truong
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Control Engineering and Automation
Thể loại	master thesis
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	55
Dung lượng	1,54 MB