luận án tiến sĩ tái định danh trong hệ thống camera giám sát tự động

Person ReID tries to answer the ﬀective guidance, their question: "Which gallery images belong to a certain probe person?" and it returns a sorted list of the gallery persons in descendi

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

NGUYEN THUY BINH

Trang 2

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

NGUYEN THUY BINH

PERSON RE-IDENTIFICATION

IN A SURVEILLANCE CAMERA NETWORK

Major: Electronics Engineering Code: 9520203

Trang 3

DECLARATION OF AUTHORSHIP

I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in asurveillance camera network" has been entirely composed by myself I assuresome points as follows:

This work was done wholly or mainly while in candidature for a Ph.D researchdegree at Hanoi University of Science and Technology

The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions

Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others

The thesis submitted is my own, except where work in the collaboration has been included The collaborative contributions have been clearly indicated

Hanoi, 24/11/ 2020PhD Student

SUPERVISORS

i

Trang 4

This dissertation was written during my doctoral course at School of Electronicsand Telecommunications (SET) and International Research Institute of Multimedia,Infor-mation, Communication and Applications (MICA), Hanoi University ofScience and Technology (HUST) I am so grateful for all people who alwayssupport and encourage me for completing this study

First, I would like to express my sincere gratitude to my advisors Assoc Prof.Pham Ngoc Nam and Assoc Prof Le Thi Lan for their e ective guidance, theirﬀective guidance, theirpatience, continuous support and encouragement, and their immense knowledge

I would like to express my gratitude to Dr Vo Le Cuong and Dr Ha thi Thu Lanfor their help I would like to thank to all member of School of Electronics andTelecom-munications, International Research Institute of Multimedia, Information,Communi-cations and Applications (MICA), Hanoi University of Science andTechnology (HUST) as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University of Transport and Communications (UTC) Theyhave always helped me on research process and given helpful advises for me toovercome my own di culties Moreover, the attention at scientific conferences hasﬃculties Moreover, the attention at scientific conferences hasalways been a great experience for me to receive many the useful comments.During my PhD course, I have received many supports from the ManagementBoard of School of Electronics and Telecommunications, MICA Institute, and Faculty

of Electrical-Electronic Engineering My sincere thank to Assoc Prof Nguyen HuuThanh, Dr Nguyen Viet Son and Assoc Prof Nguyen Thanh Hai who gave me a lot ofsupport and help Without their precious support, it has been impossible to conductthis research Thanks to my employer, University of Transport and Communications(UTC) for all necessary support and encouragement during my PhD journey I am alsograteful to Vietnam’s Program 911, HUST and UTC projects for their generousfinancial support Special thanks to my family and relatives, particularly, my belovedhusband and our children, for their never-ending support and sacrifice

Hanoi, 2020Ph.D Student

ii

Trang 5

DECLARATION OF AUTHORSHIP i

ACKNOWLEDGEMENT ii

CONTENTS

vi SYMBOLS

vi LIST OF TABLES x

LIST OF FIGURES

xiv INTRODUCTION 1

CHAPTER 1 LITERATURE REVIEW 8

1.1 Person ReID classifications 8

1.1.1 Single-shot versus Multi-shot 8

1.1.2 Closed-set versus Open-set person ReID 9

1.1.3 Supervised and unsupervised person ReID

10 1.2 Datasets and evaluation metrics 11

1.2.1 Datasets 11

1.2.2 Evaluation metrics

16 1.3 Feature extraction 16

1.3.1 Hand-designed features 17

1.3.2 Deep-learned features

20 1.4 Metric learning and person matching 25

1.4.1 Metric learning 25

1.4.2 Person matching 28

1.5 Fusion schemes for person ReID

29 1.6 Representative frame selection

31 1.7 Fully automated person ReID systems .

33 1.8 Research on person ReID in Vietnam

34 CHAPTER 2 MULTI-SHOT PERSON RE-ID THROUGH REPRESEN-TATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 36

Trang 6

2.1 Introduction 36

2.2 Proposed method 36

2.2.1 Overall framework 36

2.2.2 Representative image selection

37

iii

Trang 7

2.2.3 Image-level feature extraction 44

2.2.4 Temporal feature pooling 49

2.2.5 Person matching 50

2.3 Experimental results 55

2.3.1 Evaluation of representative frame extraction and temporal feature pooling schemes

55 2.3.2 Quantitative evaluation of the trade-o between the accuracy and compu-tational ﬀective guidance, their time

61 2.3.3 Comparison with state-of-the-art methods

63 2.4 Conclusions and Future work .

65 CHAPTER 3 PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES 67

3.1 Introduction 67

3.2 Fusion schemes for the first setting of person ReID 69

3.2.1 Image-to-images person ReID

69 3.2.2 Images-to-images person ReID 75

3.2.3 Obtained results on the first setting

76 3.3 Fusion schemes for the second setting of person ReID 82

3.3.1 The proposed method

82 3.3.2 Obtained results on the second setting

86 3.4 Conclusions

89 CHAPTER 4 QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE 91

4.1 Introduction 91

4.2 An end-to-end person ReID pipeline .

92 4.2.1 Pedestrian detection

92 4.2.2 Pedestrian tracking

97 4.2.3 Person ReID 98

4.3 GOG descriptor re-implementation 99

4.3.1 Comparison the performance of two implementations

99

4.3.2 Analyze the e ect of GOG parameters ﬀective guidance, their

Trang 8

4.4 Evaluation performance of an end-to-end person ReID pipeline 101 4.4.1 The

e ect of human detection and segmentation on person ReID in single-shot ﬀective guidance, their scenario 102

iv

Trang 9

4.4.2 The e ect of human detection and segmentation on person ReID in multi-shot ﬀective guidance, their scenario 104

4.5 Conclusions and Future work

Bibliography 113

v

Trang 10

No Abbreviation Meaning

2 AIT Austrian Institute of Technology

3 AMOCAccumulative Motion Context

6 CIE The International Commission on Illumination

11 CVPDL Cross-view Projective Dictionary Learning

12 CVPR Conference on Computer Vision and Pattern Recognition

13 DDLM Discriminative Dictionary Learning Method

15 DeepSORT Deep learning Simple Online and Realtime Tracking

19 ECCV European Conference on Computer Vision

20 FAST 3D Fast Adaptive Spatio-Temporal 3D

23 FPNN Filter Pairing Neural Network

27 HUST Hanoi University of Science and Technology

28 IBP Indian Bu et Processﬀective guidance, their

29 ICCV International Conference on Computer Vision

30 ICIP International Conference on Image Processing

vi

Trang 11

31 IDE ID-Discriminative Embedding

32 iLIDS-VIDImagery Library for Intelligent Detection Systems

33 ILSVRCImageNet Large Scale Visual Recognition Competition

35 KCF Kernelized Correlation Filter

37 KISSME Keep It Simple and Straightforward MEtric

39 KXQDA Kernel Cross-view Quadratic Discriminative Analysis

40 LADF Locally-Adaptive Decision Functions

43 LDFV Local Descriptor and coded by Feature Vector

44 LMNNLarge Margin Nearest Neighbor

45 LMNN-RLarge Margin Nearest Neighbor with Rejection

46 LOMOLOcal Maximal Occurrence

48 LSTMC Long Short-Term Memory network with a Coupled gate

50 MAPR Multimedia Analysis and Pattern Recognition

51 Mask R-CNN Mask Region with CNN

54 MCMLMaximally Collapsing Metric Learning

55 MGCAMMask-Guided Contrastive Attention Model

57 MLAPG Metric Learning by Accelerated Proximal Gradient

62 MTMCT Multi-Target Multi-Camera Tracking

63 Person ReID Person Re -Identification

64 PedparsingPedestrian Parsing

vii

Trang 12

66 PRW Person Re-identification in the Wild

67 QDA Quadratic Discriminative Analysis

68 RAiD Re-Identification Across indoor-outdoor Dataset

71 RHSP Recurrent High-Structured Patches

72 RKHS Reproducing Kernel Hilbert Space

75 SDALFSymmetry Driven Accumulation of Local Feature

76 SCNCDSalient Color Names based Color Descriptor

78 SIFT Scale-Invariant Feature Transform

79 SILTP Scale Invariant Local Ternary Pattern

82 SORT Simple Online and Realtime Tracking

85 TAPR Temporally Aligned Pooling Representation

86 TAUDLTracklet Association Unsupervised Deep Learning

87 TCSVTTransactions on Circuits and Systems for Video Technology

88 TII Transactions on Industrial Informatics

89 TPAMI Transactions on Pattern Analysis and Machine Intelligence

91 Two-stream MR Two-stream Multirate Recurrent Neural Network

92 UIT University of Information Technology

93 UTAL Tracklet Association Unsupervised Deep Learning

94 VIPeR View-point Invariant Pedestrian Recognition

95 VNU-HCM Vietnam National University - Ho Chi Minh City

97 WHOS Weighted Histograms of Overlapping Stripes

99 XQDA Cross-view Quadratic Discriminative Analysis

viii

Trang 13

LIST OF TABLES

1.1 Benchmark datasets used in the thesis 14

2.1 The matching rates (%) when applying di erent pooling methods on di erentﬀective guidance, their ﬀective guidance, theircolor spaces in case of using four key frames on PRID 2011 dataset The two best results for each case are in bold 56

2.2 The matching rates (%) when applying di erent pooling methods onﬀective guidance, their

di erent color spaces in case of using frames within a walking cycle onﬀective guidance, theirPRID 2011 dataset The two best results for each case are in bold 562.3 The matching rates (%) when applying di erent pooling methods onﬀective guidance, their

di erent color spaces in case of using all frames on PRID 2011 dataset.ﬀective guidance, their

The two best results for each case are in bold 57 2.4 The matching rates (%) when applying di erent pooling methods onﬀective guidance, their

di erent color spaces in case of using four key frames on iLIDS-VID dataset.ﬀective guidance, theirThe two best results for each case are in bold 58

2.5 The matching rates (%) when applying di erent pooling methods on di erentﬀective guidance, their ﬀective guidance, theircolor spaces in case of using frames within a walking cycle on iLIDS-VID dataset The two best results for each case are in bold 58

2.6 The matching rates (%) when applying di erent pooling methods onﬀective guidance, their

di erent color spaces in case of using all frames on iLIDS-VID dataset.ﬀective guidance, their

The two best results for each case are in bold 59 2.7 Matching rates (%) in several important ranks when using four key

frames, four random frames, and one random frame in PRID-2011 and iLIDS-VID datasets 61

2.8 Comparison of the three representative frame selection schemes in term

of accuracy at rank-1, computational time, and memory requirement onPRID 2011 dataset 61

2.9 Comparison between the proposed method and existing works on PRID

2011 and iLIDS-VID datasets Two best results are in bold 663.1 Matching rates (%) in case of images-to-images on CAVIAR4REID (case B).803.2 Matching rates (%) in case of images-to-images person ReID on the

RAiD dataset 80

3.3 Comparison the best matching rates at rank-1 in image-to-images case

and those of in images-to-images one 80

ix

Trang 14

Comparison of images-to-images and image-to-images schemes at

rank-1 (*) means the obtained results by applying the proposed strategiesover 10 random trials in case A of CAVIAR4REID 80 3.5 Comparison between the proposed method and existing works on PRID

2011 and iLIDS-VID datasets.Two best results are in bold 884.1 Comparison of the proposed method with state of the art methods for

PRID 2011 (the two best results are in bold) 107

x

Trang 15

LIST OF FIGURES

1 The ranked list of gallery person corresponding to the given query based

on the similarities between the query and each of gallery ones 2

2 An example for challenges caused by variations in a) illumination b) view-point 3

3 A person has multiple images captured in di erent camera-views 4ﬀective guidance, their 4 A fully-automatic person ReID system consisting of three main stages: human detection, tracking and re-identification 4

1.1 Some important milestones for person ReID problem [8] Several ap-proaches related to this thesis are bounded by red blocks 8

1.2 An example for a) single-shot (image-based) and b)multi-shot person (video-based) ReID approaches 9

1.3 The di erences between a) Closed-set and b) Open-set person ReID In ﬀective guidance, their closed-set person ReID, an individual appears on at least two camera-views Inversely, in open-set person ReID, a pedestrian might appear on only one camera-view 10 1.4 Two popular settings for person ReID problem: a) The testing persons have appeared in the training set (represented by the same colors) b) Persons in the training and testing sets are absolutely di erent 12ﬀective guidance, their 1.5 Camera layout for PRID-2011 dataset [33] 13

1.6 iLIDS-VID is captured by five non-overlapping cameras [36] 15

1.7 Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID c) RAiD d) PRID-2011 e) iLIDS-VID For the first three datasets (VIPeR, CAVIAR4REID, and RAiD), images in the same column belong to the same person while for the last two datasets, images in the same row represent for the same person 15

1.8 An example of CMC curves obtained with two methods: Method #1 and Method #2 16 1.9 Proposed framework using Siamese Convolutional Neural Network (SCNN) [54] a) Overall framework b) structure of a typical SCNN 20 1.10 Structure of a) an inception block b) a typical GoogLeNet [62] 23

1.11 Structure of a) an inception block b) ResNet-50 [64] 24

1.12 ResNet architecture [64] 24

1.13 Example for metric learning 25 1.14 Di erent strategies for a) early fusion b) late fusion .ﬀective guidance, their 29

xi

Trang 16

2.1 The proposed framework consists of four main steps: representative

im-age selection, image-level feature extraction, temporal feature pooling

and person matching 37

2.2 An example for a normal walking cycle of a pedestrian 38

2.3 An example for motion of pixels in the two subsequent frames 39

2.4 An example for computed Vx and Vy values on every frames in a given sequence of images The blue and red dots present minimum and maxi-mum values in Vx and Vy, respectively 40 2.5 Representative frame selection The first row describes an image se-quence of a person, the second row indicates the related original FEP (blue curve) and the regulated FEP (red curve) A walking cycle and four key frames extracted from this cycle are shown in the third and the last rows, respectively 41

2.6 An example for Gaussian filter with = 0 42

2.7 a) Random walking cycles of some person in PRID-2011 datasets and b) Four key frames in a walking cycle 43 2.8 (a) A person image is divided into patches and regions; (b) Pipeline for GOG feature extraction [49] 44

2.9 RGB color space 46

2.10 CIE L a b color space [127] 47

2.11 HSV color space [129] 48

2.12 Three di erent feature pooling techniques for person representation .ﬀective guidance, their 49 2.13 Distribution of I and and E in one projected dimension [45] 52

2.14 Evaluation the performance of GOG features on a) PRID 2011 dataset and b) iLIDS-VID dataset with three di erent representative frame se-ﬀective guidance, their lection scheme 59

2.15 Matching rates when selecting 4 key frames or 4 random frames for person representation in a) PRID-2011 and iLDIS-VID 60

2.16 The distribution of frames for each person in PRID 2011 dataset with a) camera A view and b) camera B view 62

3.1 Image-to-images person ReID scheme 69

3.2 Extracting KDES feature (best viewed in color) 70

3.3 An example for the e ectiveness of GOG and ResNet features on di er-ﬀective guidance, their ﬀective guidance, their ent query persons 73

3.4 Proposed framework for images-to-images person ReID without tempo-ral linking requirement 76

xii

Trang 17

3.5 Evaluation the performance of three chosen features (GOG, KDES,

CNN) over 10 trials on (a) case A (b)

CAVIAR4REID-case B (c) RAiD datasets in image-to-images CAVIAR4REID-case 78 3.6

Comparison the performance of the three fusion schemes when using

two or three features over 10 trials on (a) CAVIAR4REID-case A (b)

CAVIAR4REID-case B (c) RAiD datasets in image-to-images case 79

3.7 CMC curves in case A of images-to-images person ReID on the

CAVIAR4REID dataset 79

3.8 An example result of SvsM and MvsM scenarios Each row in SvsM

scenario are the first five ranked persons for each query image obtained

by using image-to-images scheme on three features Person in red box

is the true matched person 81 3.9 The

proposed method for video-based person ReID by combining the

fusion scheme with metric learning technique 83 3.10

Matching rates with di erent fusion schemes on PRID-2011 dataset withﬀective guidance, their

a) four key frames b) frames within a walking cycle c) all frames 85

3.11 Matching rates with di erent fusion schemes on iLID-VID dataset a)ﬀective guidance, their

four key frames b) frames within a walking cycle c) all frames 863.12 Average weights for GOG and ResNet features on a random split in a)

PRID-2011 and b) iLIDS-VID 874.1 A fully person ReID pipeline including person detection, segmentation,

tracking and person ReID steps 924.2 An example for automatic person detection and segmentation results on

PRID 2011 dataset 934.3 An overview of ACF detector [109] 944.4 Fast feature pyramid in ACF detector [109] 944.5 a) An input image is divided in 7 7 grid cell b) The architecture of an

YOLO detector [152] 954.6 The architecture of a) Faster R-CNN [150] and b) Mask R-CNN [111] 964.7 DDN architecture for Pedestrian Parsing [112] 974.8 a) ReID accuracy of the source code provided in [49] and that of the

re-implementation and b) computation time (in s) for each step in

ex-tracting GOG feature on an image in C++ 1004.9 The matching rates at rank-1 with di erent number of regions (N) .ﬀective guidance, their 1004.10 CMC curves on VIPeR dataset when extracting GOG features with the

optimal parameters 1014.11 CMC curves of three evaluated scenarios on VIPER dataset when ap-

plying the method proposed in Chapter 2 102

xiii

Trang 18

4.12 Examples for results of a)segmentation and b), c) person ReID in all

three cases of using the original images, manually segmented images,

automatically segmented images of two di erent persons in VIPeR dataset 103ﬀective guidance, their

4.13 CMC curves of three evaluated scenarios on PRID 2011 dataset in

single-shot approach (a) Without segmentation and (b) with segmentation 103

4.14 CMC curves of three evaluated scenarios on PRID 2011 dataset when

applying the proposed method in Chapter 2 1054.15 Examples for results of a)human detection and segmentation and b),

c) person ReID in all three cases of using the original images,

manu-ally segmented images, automaticmanu-ally segmented images of two di erentﬀective guidance, their

persons in PRID-2011 dataset 106

xiv

Trang 19

Motivation

Person ReID is known as associating cross-view images of the same person when he/ she moves in a non-overlapping camera network [1] In recent years, along with the development of surveillance camera systems, person re-identification (ReID) has increasingly attracted the attention of computer vision and pattern recognition com- munities because of its promising applications in many areas, such as public safety and security, human-robotic interaction, and person retrieval In early years, person ReID was considered as the sub-task of Multi-Camera Tracking (MCT) [2] The purpose of MCT is to generate tracklets in every single field of view (FoV) and then associate the tracklets that belong to the same pedestrian in di erent FoVs In 2006, Gheissari et al [3] firstly ffective guidance, their considered person ReID as an independent task On a certain aspect, person ReID and Multi-Target Multi-Camera Tracking (MTMCT) are close to each other However, the two issues are fundamentally di erent from each other in terms of objective and evaluation ffective guidance, their metrics While the objective of MTMCT is to determine the position of each pedestrian over time from video streams taken by di erent cameras Person ReID tries to answer the ffective guidance, their question: "Which gallery images belong to a certain probe person?" and it returns a sorted list of the gallery persons in descending order of the similarities to the given query person.

If MTMCT classifies a pair of images as co-identical or not, person ReID ranks the gallery persons corresponding to the given query person Therefore, their performance is evaluated by di erent metrics: classifi-cation error rates for MTMCT and ranking ffective guidance, their performance for ReID It is worth noting that in case of overlapping camera network, the corresponding images of the same per-son would be found out based on data association, and can be considered as person tracking problem, which is out of scope of this thesis In the last decade, with the un-remitting e orts, person ReID has achieved numerous ffective guidance, their important milestones with many great results [4, 5, 6, 7, 8], however, it is still a challenging task and confronts various di culties These di culties and challenges will be presented fficulties Moreover, the attention at scientific conferences has fficulties Moreover, the attention at scientific conferences has

in the later section First of all, the mathematical formulation of person REID is given as follows.

Problem formulation

In person ReID, the dataset is divided into two sets: probe and gallery Noted thatprobe and gallery sets are captured in at least two non-overlapping field of cameraviews Given a query person Qi and N persons in gallery Gj, where j = 1; N Qi and

1

Trang 20

Gj are represented as follows:

Depending on the number of images used for person representation, personReID can be categorized into single shot where one sole image is used ormultishot where several images are available

The identity of the given query person Qi is determined as follows [9]:

j = arg min d (Qi; Gj) ;

j

where d (Qi; Gj) is defined as the distance between the given query person Qi and

a gallery person Gj This distance can be calculated directly or learned through ametric learning method It is worth noting that in another definition, similaritybetween two pedestrians is used instead of distance between them In this case,the identity of the give query person Qi is defined as follows:

Problem formulationProblem Problem formulationformulation Person Problem formulationRe-identificationj

(0.3)

j = arg max Sim (Qi; Gj) ;

– The returned result of person ReID is a gallery person who hasprobethe small query st/largest

persondistace/similarity to the given query person However, in order to evaluate the

perfor-– mancePersonsofainpersongalleryReIDset method, a ranked list of the gallery persons is provided This

 list is ranked in ascending/descending order of distance/similarity to the given query

Output

person Figure 1 shows an example of ranked list gallery person corresponding to the

– A Problem formulationlist Problem formulationof Problem formulationpersons in gallery is ranked by the similarity between the person in given query based on the similarities between the given query and each of gallery ones gallery and the query person

similarities between the query and each of gallery ones.

2

Trang 21

Firstly, the strong variations in illuminations, view-points, and poses are the eral di culties in any image processing problem These factors make the appear-ﬃculties Moreover, the attention at scientific conferences hasance discrepancies of the same person even larger than those of di erent ﬀective guidance, their

gen-persons Consequently, one of the crucial task in person ReID is to build not only dis-criminative but also visual descriptor for person representation This

descriptor ensures to highlight the characteristics of each individual and helps to distinguish between di erent persons more easily Figure 2 illustrates the ﬀective guidance, their

variations in illu-minations and view-points This Figure shows that color of

pedestrian’s clothes are significantly changed due to the variations in

illuminations and view-points Pairs of images in the same column present the same person, and are captured in the two di erent camera views.ﬀective guidance, their

Figure 2: An example for challenges caused by variations in a) illumination b) view-point.

The second challenge is the large number of images for each person in a camera view and the number of persons in examined datasets The number of identities as well as images in some evaluated datasets have grown rapidly in recent years The early datasets have only hundreds of identities and thousands of images, whereas there are more than thousands of identities and millions of images in the latest dataset This results in a significant burden on memory capacity requirement, execution speed and computation complexity when solving person ReID issue Figure 3 shows some images of the same person captured in di erent camera views ﬀective guidance, their

Besides, the number of images for each person in the existing datasets varies greatly For example, in PRID-2011 dataset, some persons have only 20

3

Trang 22

images, meanwhile others may have hundreds images This leads to unbalance in person

representationSome examplesandalsocausesofmatchingdi culties ﬃculties Moreover, the attention at scientific conferences has Resultsforperon ReID.

Query track Person 0002-Cam 1

Figure 3: A person has multiple images captured in di erent camera-views ﬀective guidance, their

The third challenge is the e ect of human detection and tracking results In a ﬀective guidance, their

fully-automatic surveillance system, person ReID task is the last stage whose

in-puts are the outcomes of human detection and tracking stages as illustrated in

Fig 4 The performance of the two previous stages greatly a ects the overall ﬀective guidance, their

perfor-mance Most of existing studies deal with human regions of interests

(ROIs) that are manually detected and segmented with well-aligned bounding

boxes Never-theless, in an automatic surveillance system, many problems and

errors appear in human detection and tracking, such as false detection, ID switch,

fragment, etc Consequently, these errors might cause an reduction in ReID

accuracy Though the latest methodology-driven methods surpass the

human-level performance in several commonly used benchmark datasets, improving

accuracy for application-driven ReID is still a non-trivial task

Person Problem

Figure 4: A fully-automatic person ReID system consisting of three main stages: human detection,

tracking and re-identification.

Based on the above analysis, person ReID is undoubtedly an interesting issue but

chal-4

Trang 23

lenging task As widely observed, person ReID has wide range application fieldsdespite the fact that it has to cope with a lot of di culties not only from realisticﬃculties Moreover, the attention at scientific conferences hasenvironment but also from hardware requirements This motivates us to examineperson ReID with di erent aspects This research focuses on both methodology-ﬀective guidance, theirdriven and application-driven person ReID trends in order to provide acomprehensive understanding of this issue The following section discusses theresearch objectives, context, constraints and datasets in more detail.

Objective

The objectives of this research are as follows:

Robust person representation for multi-shot person ReID The first objective

of this dissertation is to find a novel method to reduce computation cost as well as memory requirement but still retain accuracy in video-based person ReID ap-proach In video-based person ReID, computation cost and memory requirement are the two main issues Working with a large number of images

is the burden of any surveillance system This results in a high computation cost as well as a large memory capacity

Improve the accuracy through fusion schemes Improving accuracy is the most important target of person ReID In order to achieve this target, some existing works have concentrated on building an e ective descriptor for ﬀective guidance, their

person represen-tation However, each kind of features has its own influence

on the given image A feature might be e ective for a given image but not for ﬀective guidance, theiranother one Besides, the complexity and diversity of current datasets have been increased day by day Therefore, the second objective of this thesis is toimprove the person ReID accu-racy based on feature fusion schemes which wish to take advantage of all kinds of features

Evaluate the overall performance of an end-to-end person ReID pipeline As dis-cussed above, a practical person ReID system has three main stages including human detection, tracking, and person ReID Most existing person ReID studies only focus on the person ReID stage but not address the two first stages In the expectation of the provision of obtained results in person ReID to a surveillance system, the final objective is to evaluate the overall performance of an end-to-end person ReID pipeline

Context, constraints and datasets

Context and Constraints

Actually, there are some di erent approaches to solve person ReID problem with dif-ﬀective guidance, their

5

Trang 24

ferent contexts In this thesis, the author give some context and constraints that are listed as below.

Images/videos are captured in daylight conditions Region of Interests (RoI), also called bounding boxes, are generated based on manual or automatic manners through human detection These bounding boxes are discrete or consecutive These bounding boxes might have temporal constraint or not.Focusing on short-term person ReID in which the appearance and clothes of each pedestrian do not change during a certain period of time In addition, pedestrians do not wear uniform

Person ReID is solved in the close-set approach in which each person has to appear on at least two cameras

Contributions

The two main contributions are introduced in this dissertation

Contribution 1: As in multi-shot person ReID, a huge number of images can be captured for a person The use of all these images for person representation and matching may require a high memory capacity and computational time

Moreover, we observe that there exists some repeated walking cycle while a pedestrian walks within a camera’s field of view This suggests to extract several representative frames for person representation instead of using all images Therefore, in this thesis, an e ective method for multi-shot person ReID ﬀective guidance, their

consisting of four main steps: representative frame selection, image feature extraction, feature pooling and person matching is proposed There are two types

of representative frames that are frames within a walking cycle and four key frames of a walking cycle are considered in this work

Contribution 2: Some previous studies have proved that each feature has its own discriminative power for person representation, therefore, in order to lever-age the considered features, di erent late fusion schemes are proposed for both ﬀective guidance, theirsettings of multi-shot person ReID In the first setting, person ReID is treated as information retrieval problem, in which the identity of the given query person is determined through the probability of his/her images belonging to each of trained appearance models In the second setting, a combination of metric learning and fusion scheme is proposed to improve the person ReID accuracy Instead of using equal weights, feature weights are adaptively determined for each query based on the query characteristics A larger weight will be assigned to a more

e ective feature for the given query image.ﬀective guidance, their

6

Trang 25

Dissertation outline

In addition to the introduction and conclusion, the dissertation consists of four chapters and is structured as follows

Introduction: This chapter provides the main motivations, objectives of the

thesis as well as contributions, constraints, and challenges to the research.Chapter 1 entitled "Literature Review": This chapter reviews and synthesizes the existing literature in order to obtain a comprehensive understanding of previous researches related to person ReID

Chapter 2 entitled "Multi-shot person ReID through representative frames

selec-tion and temporal feature pooling" : This chapter presents an e ective ﬀective guidance, theirframework for person ReID This framework helps overcome the di culties ﬃculties Moreover, the attention at scientific conferences haswhen dealing with video-based person ReID

Chapter 3 entitled "Person ReID performance improvement based on fusion schemes": This chapter introduces several fusion schemes for person ReID

Di erent kinds of features are integrated into both early and late fusion schemes.ﬀective guidance, theirThe proposed fusion schemes are evaluated in both settings of person ReID.Chapter 4 entitled "Quantitative evaluation of and end-to-end person ReID pipeline": This chapter focuses on person ReID performance when considering the a ect of human detection and tracking steps From this, the chapter aims to ﬀective guidance, theiranswer the question: How to overcome the influence of the two first steps on person ReID with the state of the art methods for person ReID?

Conclusion and future work: This section summarizes the contributions of this thesis, and introduces some future works for person ReID problem

7

Trang 26

Chapter 1 LITERATURE REVIEW

A broad view on person ReID problem is provided through the timeline drawn

by Leng et al [8] in Figure 1.1

As shown in this figure, several remarkable concepts for person ReID are

multi-shot, metric-learning, open-set ReID, end-to-end ReID, and so on Depending on the

numbers of images of a person, the availability of training set or the appearance of the

query person in the gallery set, person reID techniques are categorized into

Single-Important Problem formulationmilestones

shotPersonvsusmulti-Reshot,-Closeidentification-setversusOpen-set,Supervised and unsupervised learning.

Next, the author would like to described the taxonomy used in person ReID

Visible-thermal ReID

[Leng Problem formulationet Problem formulational, Problem formulation2019]

Figure 1.1: Some important milestones for person ReID problem [8] Several approaches related to

this thesis are bounded by red blocks.

71.1 Person ReID classifications

1.1.1 Single-shot versus Multi-shot

The di erence between single-shot (image-based) and multi-shot (video-based) ap-ﬀective guidance, their

proaches are illustrated in Fig 1.2 According to the aforementioned definition of

person ReID, in single-shot scenarios, the number of images for query person and

per-son in gallery sets is one while in multi-shot scenarios, the value of ni and mj are

greater than one Early person ReID studies only focused on single-shot approach in

which person matching is mainly relied on comparison between two images (one in

8

Trang 27

probe and another in gallery) [10, 11, 12, 13] By contrast, in multi-shot person ReID,each pedestrian is described by multiple images or sequences In 2010, the firststudies on multi-shot person ReID were reported [14, 15] On the one hand, althoughsingle-shot scenario is seem to be far from realistic applications, obtained results forthis case would be a crucial brick for multi-shot Computational cost as well asmemory stor-age requirements for single-shot problem are much lower than those forthe multi-shot person ReID On the other hand, multi-shot person ReID can providericher and more useful information to improve the ReID accuracy However, multi-shotperson ReID has its own issues such as memory storage requirement, computationtime To solve this problem, several studies have introduced some solutions to extractkey frames which contain su cient information to describe a pedestrian Thisﬃculties Moreover, the attention at scientific conferences hasapproach not only helps to remove redundant information but also increasescalculation speed and saves memory capacity These key frames are extracted based

on some cluster algorithms or distribution of motion energy

1.1.2 Closed-set versus Open-set person ReID

In a broad view, person ReID can be defined as closed-set and open-set problems.Majority of the existing person ReID works belong to the former approach with theassumption that each individual appears in both probe and gallery sets Also, this taskcan be understood as the matching problem which aims to seek the occurrences of aquery person (probe) from a set of person candidates (gallery), called closed-setperson ReID However, the hypothesis that images of the same person are captured

by di erent cameras is not always satisfied and limits the capability of practical ap-ﬀective guidance, theirplications Therefore, the open-set person ReID has become an inevitable trend andreceived more and more attention of the researchers all over the world The open-setperson ReID aims to answer the question: "Does a query person appear in the galleryset?" In this case, the query probe might appear in the gallery set or not and open-set

Trang 28

person ReID is turned into person verification [16, 17, 18, 19, 20] This approach

can be used in suspect retrieval if multiple images of a suspect or a criminal are

available Figure 1.3 shows an example of the closed-set and open-set person

ReID In Figure 1.3a) the person appears on both cameras, while she appears

only on the camera-A in Figure 1.3b)

Camera-A

Camera-B

Camera-A

Camera-B

Figure 1.3: The di erences between a) Closed-set and b) Open-set person ReID In closed-set ﬀective guidance, their

person ReID, an individual appears on at least two camera-views Inversely, in open-set person

ReID, a pedestrian might appear on only one camera-view.

1.1.3 Supervised and unsupervised person ReID

Based on the availability of matched pairs used in the training phase, person ReID can

be divided into supervised and unsupervised approaches In general, training is the crucial

phase in pattern recognition problems which helps improve performance of the recognition

process Accordingly, pedestrian’s models on cross-view cameras are learned from the

previously collected dataset and the matching pairs are labelled for the training phase.

This requirement not only creates a burden for human labors but also limits the scalability

of person ReID problem Moreover, the data labeling process becomes even more di cult ﬃculties Moreover, the attention at scientific conferences has

and infeasible when dealing with a large-scale dataset In order to overcome this di culty, ﬃculties Moreover, the attention at scientific conferences has

some latest studies have followed the unsupervised approach which employs unlabelled

data into consideration [21, 22, 23, 24, 25] From this approach, person ReID is closer to a

realistic context in which thousands of people appear on a camera view everyday in public

spaces, such as airports, railway-stations, super markets Since, labeling task is a

self-taught process, without supervision On the one hand, the unsupervised methods help to

reduce the human labor and toward realistic systems On the other hand, the matching

rates of these methods are often much lower than those of the supervised ones This is

because of that without manually labelled matching pairs in cross-view cameras, existing

unsupervised models cannot learn the appearance transformation of the same person

from di erent camera views ﬀective guidance, their

As the research focuses on the supervised person ReID approach, two person settings

in which a certain pedestrian’s model is previously trained or not are further discussed In

the former setting, person ReID is turned into person search [26, 27, 28], the identity of an

interested person is determined based on the similarity between this person and

10

Trang 29

each of trained models This indicates that the identities of pedestrian in the testing setare the same ones in the training set This setting is suitable and employed in severalreal situations, such as human management and monitoring, suspect/criminal search,etc Obviously, when a pedestrian’s model is previously learned, person ReIDperformance is greatly improved By contrast, in the later one, the model of eachindividual in the testing phase is not learned in advance Specially, the identities of thepedestrians for the test phase are di erent from ones for the training phase Thisﬀective guidance, theirassumption is more likely to be closer to a realistic context and majority of existingperson ReID studies support the second evaluated setting Figure 1.4 shows the

di erences in two above evaluation settings for person ReID Fig 1.4a) used theffective guidance, theirsame color to describe that the given query exists in the gallery set while Fig 1.4b)illustrates that query person’s appearance models are not previously known,described by the colors of the query persons are di erent from those of the galleryffective guidance, theirpersons In the first setting, query persons exist in the gallery set This means that themodel of query persons is previously learned while in the second setting the queryperson’s appearance models are not previously known In the second setting (Fig.1.4b) the query person’s appearance models are not previously known, described bythe colors of the query persons are di erent from those of the gallery persons.ffective guidance, their

For more details, we will be analysis some prominent researches on person ReID

to have a better understanding on the existing approaches on this problem Similar toany pattern recognition problem, feature extraction and metric learning are the twoindispensible components of person ReID problem Most existing person ReID workshave tried to exploit the e ectiveness of one/both of them and toward the target ofﬀective guidance, theirimproving ReID performance The role of the two steps are discussed in more details

in the two sections (Section 1.3 and 1.4) Section 1.5 introduces several outstandingstudies with di erent fusion strategies to achieve a higher person ReID accuracy Sec-ﬀective guidance, theirtion 1.6 discusses on representative frames selection and feature pooling, the two keyissues in video-based person ReID systems Finally, end-to-end person ReID systemsare presented in Section 1.7 Specially, not only ReID accuracy but also e ciencyﬃculties Moreover, the attention at scientific conferences has(consuming time) as well as memory requirement are considered

1.2 Datasets and evaluation metrics

1.2.1 Datasets

In the literature, there are numerous datasets used for person ReID evaluation Some datasets are set up for either single- or multiple-shot approaches However, the other datasets are set up for both scenarios In single-shot approach, each person has sole one image in both probe and gallery sets Meanwhile, each person is presented by multiple image in both probe and gallery sets in multi-shot approach Additionally, two settings

11

Trang 30

(b)

Figure 1.4: Two popular settings for person ReID problem: a) The testing persons have appeared

in the training set (represented by the same colors) b) Persons in the training and testing sets are absolutely di erent ﬀective guidance, their

12

Trang 31

for person ReID will be mentioned in this thesis For the first setting, a person in thetesting set has appeared in the training set Inversely, for the second setting, thetesting and training sets are completely di erent These concepts wil be described inﬀective guidance, theirmore details in the Chapter 1 Five benchmark datasets used for performanceevaluation of the proposed methods in this thesis will be indicated as follows.

Viewpoint Invariant Pedestrian Recognition (VIPeR) [29] This is one of the most challenging datasets with strong variations in pose, illumination, view-point and occlusion The dataset contains 1,264 images of 632 persons, with image

resolution is 128 48 Each person has a pair of images captured by two di erent ﬀective guidance, theirnon-overlapping cameras This dataset is used for the single-shot case

CAVIAR4REID [30] This dataset contains 1,220 images of 72 pedestrians captured from two non-overlapping cameras in a shopping mall However, there are only 50 persons appeared in both cameras This dataset is

generated to maximize the variation in illumination conditions, occlusions, image resolutions, and view-points In this dataset, the image resolutions varystrongly, from 17 32 to 72 144 This also cause di culty for person ReID.ﬃculties Moreover, the attention at scientific conferences hasRAiD [31] Comprising 6,920 images of 43 individuals appeared in four

cameras (two indoor, two outdoor) Only 41 of the 43 total persons appear in all cameras All images in this dataset are normalized to the same size of 64

128 The large illumination variations caused by collecting images from

di erent scenarios is one of the challenges when working on this dataset In ﬀective guidance, theirthis thesis, the gallery set is generated by selecting randomly 5 images for each person and the remaining images are used for the probe set

PRID-2011 (Person ReID) [32]

Figure 1.5: Camera layout for PRID-2011 dataset [33].

This dataset was created in the Austrian Institute of Technology (AIT) for periments on person ReID Images in this dataset are extracted from multiple

ex-13

Trang 32

pedestrian trajectories captured from two static non-overlapping cameras These images su er from large variations in illuminations, view-point, poses, etc Figure 1.5 ﬀective guidance, their shows camera layout for PRID-2011 dataset, two cameras are installed in two sides

of the building in AIT with di erent view-points The original videos have 475 ﬀective guidance, their pedestrians from one view and 856 from the other view, in which 245 per-sons appear in both views Some images are filtered out due to strong occlusions, sudden disappearance/appearance or number of reliable images for each person in each camera view less than five After filtering, there are 385 persons in camera view A and 749 persons in camera view B The first 200 persons appear on both views and are used in person ReID experiments It is worth noting that

this thesis follows the experimental setting in [34], only 178 persons having more than 21 images in an image sequence were chosen for evaluation The data is divided into two halves, one for the training and test phase This random division is repeated 10 times and the reported result is the average value of these 10 splits.

iLIDS-VID (Imagery Library for Intelligent Detection Systems ) [35] This

dataset was recorded at an airport arrival hall under a multi-camera CCTV network It consists of 300 pedestrians with 600 image sequences The length

of each sequence varies from 23 to 192 images, with an average number of

73 In practical, this dataset is captured by five non-overlapping cameras as shown in Figure 1.6 In comparison with PRID-2011 dataset, iLIDS-VID is

evaluated more challenging due to the similarities of clothes, the strong

variations in illuminations, viewpoints, cluttered background and occlusions Inperson ReID evaluation, this dataset is also randomly split into two halves, one for training and the remaining for testing This process is performed 10 times to achieve a fair comparison Figure 1.6 describes five real contexts of the camera network for capturing iLIDS-VID dataset

Among the above five datasets, CAIVAR4REID and RAID is setup following thefirst setting, while three remaining datasets are according to the second setting,one haft is used for the training phase and the other used for the test phase Table1.1 summaries the datasets used in the thesis

Table 1.1: Benchmark datasets used in the thesis.

Trang 33

2 Chapter 1 Introduction

Figure 1.1: Human re-identification in a network of 5 CCTV cameras: the system

Figure 1.6: iLIDS-VID is captured by five non-overlapping cameras [36].

should be able to associate all appearances of the same person with a single identity across disjoint camera views (e.g the lady in red dress appears in two cameras) This video footage is distributed by the UK’s government as the image Library for Intelligent Detection Systems (i-LIDS).

1.1 Motivation and problem statement

Recently, cameras spread out across various domains that range from personal

com-(a) VIPeR

puters, video games, home surveillance applications, to large camera networks which facility access to sports venue, monitored environments, such as airports, metro stations or car parks A natural consequence of such situation is a need for an auto-mated extraction of high-level semantic information from extremely large volumes of recorded video data

In many surveillance systems, detection and tracking of moving objects constitute the main problem The number of targets and occlusions produce ambiguities which introduce a requirement for reacquiring objects which have been lost during track-ing However, (b)CAVIAR4REID theultimate goal of any surveillance system (c) is RAiD not to track and reacquire targets, but to understand a scene and

to establish an identity of the desired object.

Human re-identification can be defined as a determination whether a given person of interest has already been observed over a network of cameras (see figure 1.1 ) This issue is also known as the person re-identification problem Person re-identification can be considered at di erent levels depending on information cues, which are cur- ﬀective guidance, their rently available in the system Biometrics such as iris, face or gait can be used to recognize identities However, in most video surveillance scenarios such detailed

Figure 1.7: Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID c) RAiD d) PRID-2011 e) iLIDS-VID For the first three datasets (VIPeR, CAVIAR4REID, and RAiD), images

in the same column belong to the same person while for the last two datasets, images in the same row represent for the same person.

15

Trang 34

100 90

In order to evaluate the proposed methods for person ReID, we used

Cumulative Matching Characteristic (CMC) curves [37] CMC shows a ranked list

of retrieval person based on the similarity between a gallery and a query person

The value of the CMC curve at each rank is the rate of the true matching results

and total number of queried persons The matching rates at several important

ranks (1, 5, 10, 20) are usually used for evaluating the e ectiveness of a certainﬀective guidance, their

method Figure 1.8 illustrates CMC curves, in which the CMC of the method#2 is

higher than that of the method#1, this means that method#2 is better than the

method#1 Values in the curves caption are accuracies at rank-1

1.3 Feature extraction

The first crucial component for any pattern recognition problem is feature extraction

step In order to describe a pedestrian image, biometric cues (eyes, iris, gait) or visual

appearance is exploited These are considered as the most useful information for person

representation However, because images/videos in person ReID are usually captured

with low resolution, information extracted on eyes or iris is not su cient for person ﬃculties Moreover, the attention at scientific conferences has

representation Besides, gait is a whole-body, behavioral bio-metric that is considered as

a pedestrian’s characteristic and has been studied for person ReID for a long time.

However, it is not easy for extracting human gait because of the complexity of a realistic

surveillance environment, such as airport, railway station, super market Additionally,

human gait usually strong depends on person mood and health condition Consequently,

majority of existing person ReID studies mainly focus on visual appearance of pedestrian

[8] In general, features are classified into two main categorizes: hand-designed and

deep-learned features In the early days, hand-designed features were proposed for image

representation These features are built on

16

Trang 36

experiences and perceptions of researchers [28, 9] Fortunately, in 2014, with therapid development of Convolutional Neural Network (CNN), the first studies on deep-learned features was applied to person ReID Since then, a lot of works have paidattention to exploiting the capability of numerous deep networks Features are alsocategorized into three di erent abstract levels including low-, mid-, and high-levelﬀective guidance, theirfeatures [38] While low-level features contain color, texture and shape informationextracted from every pixel, more advanced low-level features are created bycomputing a covarian matrix from image derivatives [39] or seeking at local key points(SIFT [40]) Mid-, and high-level features are constructed by learning from pixel-levelfeatures, for example, Bag of Words (BOW) models [41] which encode low-levelfeatures into visual words are considered as mid-level features and deep-learnedfeatures extracted from the last layers of a CNN are high-level ones.

17

Trang 37

mation of a pixel such as the pixel co-ordinates, its intensity as well as the first andthe second-derivaties of this pixel By this way, each pixel is presented by a 7-dimension vector and after that, these extracted feature vectors are turned intoFisher Vector , called Local Descriptors encoded by Fisher Vector (LDFV), forperson representation Zhao et al [43] solved person ReID by finding saliencyregions in an pedestrian image In this work, saliency regions are defined as theoutstanding and easily recognizable cues for distinguishing di erent persons.ﬀective guidance, theirDense patches 10 10 pixels are extracted with a step size of 5 pixels, and then,32-dim LAB color histogram and 128-dim SIFT descriptor are computed on eachpatch Adjacency constrained search is exploited to seek the best match for aquery patch in horizontal stripes with the similar latitudes in gallery images By thisway, the higher scores are assigned to the more prominent patches These scoresare the fundamental for calculating the similarity between two persons.

With the consideration for the influence of illumination variations, in [44] the thors calculated color histograms in di erent color spaces and then, combined theseﬀective guidance, theirhistograms to take the more robust signature to variations of illumination In this work,the authors claimed that performance of person representation is still not satisfactory ifonly relying on exploiting color histograms Based on this analysis, they proposed anovel Salient Color Names based Color Descriptor (SCNCD) For SCNCD descriptor,

au-16 colors in RGB space including fuchsia, blue, aqua, lime, yellow, red, purple, navy,teal, green, olive, maroon, black, gray, silver, and white are used Salient color namesshow that each color has a certain probability of being assigned to nearest colornames, and the closer one owns a higher probability In order to overcome thechallenge caused by di erent viewpoints, Liao et al [45] focused on horizontalﬀective guidance, theiroccurrence of local fea-tures and maximize them to build a robust descriptor, namedLOMO This descriptor includes the color and Scale Invariant Local Ternary Pattern(SILTP) histograms [46] Bins in the same horizontal stripe undergo max pooling and athree-scale pyramid model is built before a log transformation LOMO descriptor islater is employed in several works [47, 48]

In [39], co-variance descriptor is employed for image representation in person ReID In which, the target of this descriptor is to encode information on feature variances within a given image region, their correlation with each other as well as their spatial location The

e ectiveness of this descriptor is based on the invariant property of co-variance matrices ﬀective guidance, their which help to overcome the changes in illumination and rotation Additionally, one more advantage of the covariance descriptor is that it can be com-puted on any kind of images, such as intensity image, color image This descriptor is also used in the study of Matsukawa et al [49] In this study, the authors claimed that mean and covariance are the most important cues for representing a person appearance They proposed Gaussian of Gaussian (GOG) descriptor which inherits the advantage

18

Trang 38

of covariance-based descriptor This descriptor is constructed on three di erentﬀective guidance, theirlevels including pixel-, patch-, and region-level Gaussian distribution is appliedtwice at patch and region levels, therefore, named as Gaussian of Gaussian.

Di erent from directly exploiting low-level color and texture features in the above ﬀective guidance, their studies, some other works based on attribute-based features can be considered as mid- level representation In comparison with low-level descriptors, attribute features are more invariant and robust to the change of person images captured from cross-view cameras.

In [50], the authors introduced 15 binary attributes for describing a person image relating

to attire and soft biometric, such as short, skirt, sandal, backpack, long-hair, short-hair, gender and etc Additionally, these attribute classifiers are trained by exploiting the low- level color and texture features This information is combined with visual features extracted by SDALF method [14] to get a higher ReID performance.

In [51], the authors embed the binary semantic attributes of the same person incross-view cameras into a continuous low-rank attribute space, so that the attributevector is more discriminative for matching Shi et al [52] introduced a framework tolearn a semantic attribute model from existing fashion photography datasets Theseattributes help person ReID to achieve competitive results Di erent from previousﬀective guidance, theirworks, the authors take a generative modelling approach relied on the Indian Bu etﬀective guidance, theirProcess (IBP) This model has several advantages including: simultaneously learning

of all attributes; the ability to naturally exploit training data in both supervised andunsupervised manners With a great e ort, Li et al [53] built a large-scale dataset forﬀective guidance, theirpedestrian attribute recognition, namely Richly Annotated Pedestrian (RAP) Thisdataset is generated from real multi-camera surveillance scenarios with long term col-lection consisting 41,585 pedestrian samples with 72 attributes In particular, environ-mental and contextual factors are also considered in this dataset

From the above analysis, we can give some remarkable points about hand-designed features as follows Most of the existing studies have tried to build a descriptor in-cluding useful information about color, texture or shape for person representation To overcome

di culties caused by the variations in illuminations, view-points, poses, the above fficulties Moreover, the attention at scientific conferences has information might be extracted on di erent scales/levels to form a descriptor is not only ffective guidance, their robust but also discriminative Besides, attribute-based features are also considered and exploited as complementary information to appearance In fact, hand-designed features are usually built based on knowledge and experiments of researches Therefore, a descriptor might be only e ective on several datasets but not be e ective on the others ffective guidance, their ffective guidance, their This is one of the disadvantage of hand-designed features Addition-ally, to extract more and more information at di erent scales/levels, hand-designed features always have a ffective guidance, their large dimension, for example approximately 26,960 dimensions for LOMO or 27,622 dimensions for GOG descriptor Large-dimensional vectors also bring di culties to fficulties Moreover, the attention at scientific conferences has storage and computation in the person matching step.

19

Trang 39

1.3.2 Deep-learned features

Recent years have witnessed the impressive results of Convolutional Neural Networks

(CNNs) on pattern recognition tasks For person ReID, CNN and its variants are employed

to achieve a higher performance In 2014, the first two studies on person ReID exploiting

the capability of the deep-learned network are introduced [54, 55] In [54], the authors

proposed a framework using Siamese deep neural network, in which appearance-based

features (color, texture) and similarity metric are simultaneously learned Siamese network

has a symmetrical structure including two sub-networks that are connected by a cosine

layer Each sub-network composes two convolutional layers and a fully connected layer.

By this way, similarity metric is learned directly through features extracted on image

pixels An input image is divided into three overlapping horizontal parts, and then, these

parts are forwarded to the sub-networks and the output of the Siamese network is the

similarity of image pairs based on cosine

each part Fig independently 1.Theflowch and rtof fusing thepropos them dmethod bysum Learning rule three SCNNs for

each part independently and fusing them by sum rule.

Fig 4 Cost function can Fig 4 Cost function candidates for the SCNN

for reference, which is use

of Deviance largest lossis cost very when similar the wit

25

20

Trang 40

been proved robust

to optimize the neural network.

Fig 2 The structure of the siamese convolutional neural network (SCNN) not differentiable at s

The SCNN can work in two modes: sharing parameters (General SCNN) and

By plugging Equ (1) into Equ (4)

(b) The structure of SCNN

propagation function to calculate the

Fig 2 The structure of the siamese convolutional neural network (SCNN).

2Cosine(B (x)

Figure 1.9: ProposedTheSCNNframeworkcanworkus in gtwoSiamodes:eseConvolutionsharingparalmetersNeural(GeneralNetworkSCNN)(SCNN)and[54] a) Overall

Differentiating the cost function wi

Di erent from the abov ﬀective guidance, their e work, Li et al [55] designed a new deep network with a

ar-Convolution Convolution Full connection 2Cosine(B (x),B (y)) l

Differentiating the

photo-metric and geometric transforms, occlusion, and background clutter, are jointly(B2(y) − B1(x)

solved Moreover, instead of employing hand-designed features this deep network

∂J

dev

∂x

B Convolutional Neural Network

and pooling layers are 32, 32, 48 and 48 The output of the

where denotes the magnitude of CNN is 500 dimensions Every pooling layer includes a cross- ∂J

channel normalization unit Befo convolution the input data

parameters of SCNN ∂y by standard e −

Định dạng
Số trang	188
Dung lượng	9,34 MB